Tag Archives: Chetty

Is VAM a Sham? Depends on the Question You’re Asking.

  1. “Does data source X provide useful information?” and
  2. “Should data source X be used for purpose Y?”

are two very different questions.  Unfortunately, conflation of these questions by education researchers, writers, and advocates far too frequently results in bad policy recommendations.

This problem surfaces especially often in debates about value added modeling (VAM), a statistical method aimed at capturing a teacher’s effectiveness in the classroom.  Based on a new paper from economists Raj Chetty, John Friedman, and Jonah Rockoff, Andew Flowers writes, in response to question 1 above, that we’re pretty good at “the science of grading teachers” with VAM results.  Flowers weighs in on question 2 as well, arguing that Chetty et al.’s work means that “administrators can legitimately use value-added scores to hire, fire and otherwise evaluate teacher performance.”

In terms of question 1, the idea that VAM research indicates that we’re pretty good at “grading teachers” is itself debatable.  Flowers doesn’t conduct an extensive survey of researchers or research, but focuses on six well-known veterans of VAM debates, including several of the more outspoken defenders of the metric (Chetty and Thomas Kane specifically; Friedman, Rockoff, and Douglas Staiger are also longtime VAM supporters).  While many respected academics caution about VAM’s limitations and/or have more nuanced positions on its use, Jesse Rothstein is the only one Flowers cites.

In fact, whether VAM estimates are systematically biased (Rothstein’s argument) or not (Chetty et al.’s contention), there are legitimate questions about whether VAM results are valid (whether or not they are really capturing “teacher effectiveness” in the way that most people think about it).  VAM estimates correlate surprisingly little with other measures aimed at capturing effective teaching (like experts’ assessments of classroom instruction).  They’re also notoriously unstable, meaning that a teacher’s scores bounce around a lot depending on the year and test studied.  While other methods of evaluating teacher effectiveness have similar issues and there are certain approaches to VAM (not commonly used) that are more useful than others, it’s perfectly reasonable to argue that we’re still pretty bad at “grading teachers.”

More importantly, however, debates about bias, validity, and stability in VAM actually have much less to do with the answer to question 2 – should we use VAM to evaluate teachers in the way its proponents recommend? – than many people think.  To understand why, we need look no farther than two of the core purposes of teacher evaluation, purposes which everyone from teachers unions to education reform organizations generally agree about (at least rhetorically).

1) One core purpose of teacher evaluation is helping teachers improve. Making VAM results a defined percentage of a teacher’s evaluation is not useful for this purpose even if we assume VAM results are unbiased, valid, and stable.  Such a policy may actually undermine teacher improvement, and hence the quality of instruction that students receive.

For starters, a VAM score is opaque.  Teachers cannot match their VAM results back to specific questions on a test or use them to figure out what their students did or didn’t know.  VAM may be able to tell a teacher if her students did well or poorly on a specific test, but not why students did well or poorly.  In other words, a VAM score provides no actionable feedback. It does not indicate anything about what a teacher can do to help her students learn.

In addition, VAM results are outcomes over which a teacher has very limited control – research typically finds that teachers contribute to less than a fifth of the variation in student test scores (the rest is mostly random error and outside-of-school factors).  If a teacher’s VAM results look good, that might be because the teacher did something well, but it also might be because the teacher got lucky, or because some other factor contributed to her students’ success.  The tendency to view VAM results as indicative of whether or not a teacher did a good job – a common side effect of making VAM results a defined percentage of a teacher’s evaluation – is thus misguided (and a potential recipe for the reinforcement of unhelpful behaviors).  This concern is especially germane because VAM results are often viewed as “grades” by the teachers receiving them – even if they are only a small percentage of a teachers’ evaluation “score” – and thus threaten to overwhelm other, potentially productive elements of an evaluation conversation.

A better evaluation system would focus on actionable feedback about things over which a teacher has direct control.  Student performance should absolutely be included in the teacher evaluation process, but instead of making VAM a defined percentage of a teacher’s evaluation (part of a “grade”), evaluators should give teachers feedback on how well they use information about student performance to analyze their teaching practices and adapt their instruction accordingly.  This approach, unlike the approach favored by many VAM proponents, would help a teacher improve over time.

2) A second core purpose of teacher evaluation is to help evaluators make personnel decisions. Relative to the evaluation system described above – one that focuses on actions over which a teacher has control – making VAM results a defined percentage of teacher evaluations does not help us with this purpose, either.  Suppose a teacher gets a bad VAM result.  If that result is consistent with classroom observation data, the quality of assigned work, and various other elements of the teacher’s practice, an evaluator shouldn’t need it to conclude that the teacher is ineffective.

If there is a discrepancy between the VAM result and the other measures, on the other hand, there are a few possibilities.  The VAM results might have been unlucky.  The teaching practices the teacher employed might not be as useful as the teacher or evaluator thought they would be.  Or perhaps VAM isn’t a very good indicator of teacher quality (there’s also a possibility that the various other measures aren’t good indicators of teacher quality, but the measures suggested all have more face validity – meaning that they’re more intuitively likely to reflect relevant information – than do VAM results).  Under any of these alternative scenarios, using VAM results as a defined percentage of a teacher’s evaluation makes us more likely both to fire teachers who might actually be good at their jobs and to reward teachers who might not be.

When we evaluate schools on student outcomes, we reward (and punish) them for factors they don’t directly control. A more intelligent and fair approach would evaluate the actions schools take in pursuit of better student outcomes, not the outcomes themselves.

Relative to teacher evaluation systems that focus on things over which a teacher has direct control, using VAM results as a defined percentage of a teacher’s evaluation makes us more likely both to fire teachers who might actually be good at their jobs and to reward teachers who might not be.

To be fair, question 1 could have some relevance for this purpose of teacher evaluation; if VAM results were an excellent indicator of teaching quality (again, they aren’t, but let’s suspend disbelief for a moment), that would negate one of the concerns above and make us more confident in using VAM for reward and punishment.  Yet even in this case the defined-percentage approach would hold little if any advantage over the properly-designed evaluation system described above in helping administrators make personnel decisions, and it would be significantly more likely both to feel unfair to teachers and to result in a variety of negative consequences.

I’ve had many conversations with proponents of making VAM a defined percentage of teacher evaluations, and not a single one has been able to explain why their approach to VAM is better than an alternative approach that focuses on aspects of teaching practice – like creating a safe classroom environment, lesson planning, analyzing student data, and delivering high-quality instruction – over which teachers have more control.

So while the answer to question 1 in the case of VAM is that, despite its shortcomings, it may provide useful information, the answer to question 2 – should VAM results be used as a defined percentage of teacher evaluations? – is a resounding “no.”  And those who understand the crucial distinction between the two questions know that no amount of papers, articles, or researcher opinions, however interesting or useful for other purposes they may be, is ever going to change that fact.

8 Comments

Filed under Education

Approaching Education Data the Nate Silver Way

My girlfriend’s very hospitable and generous family gave me some great gifts for the holidays when I stayed with them in upstate New York.  As I rocked my new Teach For America T-shirt in the Rochester airport on Christmas Eve, my cursory overview of Nate Silver’s new book, The Signal and the Noise, inspired me to write this post.

While most people probably know Silver for his election predictions and designation in 2009 as one of the world’s 100 Most Influential People, Silver has been my baseball stat guru for considerably longer than he’s been doing political analysis.  In one of my favorite books of all time, Baseball Between the Numbers, Silver penned a brilliant examination of clutch hitting that I still quote at least four or five times a year.  I have generally found Silver’s arguments compelling not just because of his statistical brilliance, but also because of his high standards for data collection and analysis, evident in the following passage from the introduction of his book:

The numbers have no way of speaking for themselves.  We speak for them.  We imbue them with meaning…[W]e may construe them in self-serving ways that are detached from their objective reality…Before we demand more of our data, we need to demand more of ourselves.

In few fields are Silver’s words as relevant as education.  While the phrase “data-driven” has become ubiquitous in discussions of school reform and high-quality instruction, most people discussing education have very little understanding of what the statistics actually say.  As I’ve written before, many studies that reformers reference to push their policy agendas are methodologically unsound, and many more have findings very different than the summaries that make it into the news.

It’s hard to know how many reformers just don’t understand statistics, how many fall victim to confirmation bias, and how many intentionally mislead people.  But no matter the reason for their errors, those of us who care about student outcomes have a responsibility to identify statistical misinterpretation and manipulation and correct it.  Policy changes based on bad data and shoddy analyses won’t help (and will quite possibly harm) low-income students.

Fortunately, I believe one simple practice can help us identify truth in education research: read the full text of education research articles.

Yes, reading the full text of academic research papers can be time consuming and mind-numbingly dull at times, but reading articles’ full text is vitally important if you want to understand research findings.  Sound bites on education studies rarely provide accurate information.  In a Facebook comment following my most recent post about TFA, a former classmate of mine referenced a 2011 study by Raj Chetty to argue that we can’t blame the achievement gap on poverty.  “If you leave a low value-added teacher in your school for 10 years, rather than replacing him with an average teacher, you are hypothetically talking about $2.5 million in lost income,” claims one of the co-authors of the study in a New York Times article.  Sounds impressive.  Look under the hood, however, and we find that, even assuming the study’s methodology is foolproof (it isn’t), the actual evidence can at best show an average difference of $182 in the annual salaries of 28-year-olds.

As I’ve mentioned before, there’s also a poor statistical basis for linking student results on standardized test scores to teacher evaluation systems.  Otherwise useful results can give readers the wrong impression when they gloss over or omit this fact, a point underscored by a recent article describing an analysis of IMPACT (the D.C. Public Schools teacher evaluation system).  The full text of the study provides strong evidence that the success of D.C.’s system thus far has been achieved despite a lack of variation in standardized test score results among teachers in different effectiveness categories.  Instead, the successes of the D.C. evaluation system are driven by programs teachers unions frequently support, programs like robust and meaningful classroom observations that more accurately measure teacher effectiveness.

Policymakers have misled the public with PISA data as well.  In a recent interview with MSNBC’s Chris Hayes, Michelle Rhee made the oft-repeated claim that U.S. schools are failing because American students, in aggregate, score lower on international tests than their peers in other countries.  Yet, as Hayes pointed out, it is abundantly clear from a more thorough analysis that poverty explains the PISA results much better than school quality, not least because poor US students have been doing better on international tests than poor students elsewhere for several years.

I would, in general, recommend skepticism when reading articles on education, but I’d recommend skepticism in particular when someone offers a statistic suggesting that school-related changes can solve the achievement gap.  Education research’s only clear conclusion right now is that poverty explains the majority of student outcomes.  The full text of Chetty’s most recent study defending value-added models acknowledges that “differences in teacher quality are not the primary reason that high SES students currently do much better than their low SES peers” and that “differences in [kinder through eighth grade] teacher quality account for only…7% of the test score differences” between low- and high-income schools.  In fact, that more recent study performs a hypothetical experiment in which the lowest-performing low-income students receive the “best” teachers and the highest-performing affluent students receive the “worst” teachers from kinder through eighth grade and concludes that the affluent students would still outperform the poor students on average (albeit by a much smaller margin).  Hayes made the same point to Rhee that I made in my last post: because student achievement is influenced significantly more by poverty than by schools, discussions about how to meet our students’ needs must address income inequality in addition to evidence-based school reforms.  We can’t be advocates for poor students and exclude policies that address poverty from our recommendations.

When deciding which school-based recommendations to make, we must remember that writers and policymakers all too often misunderstand education research.  Many reformers selectively highlight decontextualized research that supports their already-formed opinions.  Our students, on the other hand, depend on us to combat misleading claims by doing our due diligence, unveiling erroneous interpretations, and ensuring that sound data and accurate statistical analyses drive decision-making. They rely on us to adopt Nate Silver’s approach to baseball statistics: continuously ask questions, keep an open mind about potential answers, and conduct thorough statistical analyses to better understand reality.  They rely on us to distinguish statistical significance from real-world relevance.  As Silver writes about data in the information age more generally, education research “will produce progress – eventually.  How quickly it does, and whether we regress in the meantime, depends on us.”

Update: Gary Rubinstein and Bruce Baker (thanks for the heads up, Demian Godon) have similar orientations to education research – while we don’t always agree, I appreciate their approach to statistical analysis.

Update 2 (6/8/14): Matthew Di Carlo is an excellent read for anyone interested in thoughtful analysis of educational issues.

Update 3 (7/8/14): The Raj Chetty study linked above seems to have been modified – the pieces I quoted have disappeared.  Not sure when that happened, or why, but I’d love to hear an explanation from the authors and see a link to the original.

7 Comments

Filed under Education