Is VAM a Sham? Depends on the Question You’re Asking.

Published by

Ben Spielberg

August 19, 2015

“Does data source X provide useful information?” and
“Should data source X be used for purpose Y?”

are two very different questions. Unfortunately, conflation of these questions by education researchers, writers, and advocates far too frequently results in bad policy recommendations.

This problem surfaces especially often in debates about value added modeling (VAM), a statistical method aimed at capturing a teacher’s effectiveness in the classroom. Based on a new paper from economists Raj Chetty, John Friedman, and Jonah Rockoff, Andew Flowers writes, in response to question 1 above, that we’re pretty good at “the science of grading teachers” with VAM results. Flowers weighs in on question 2 as well, arguing that Chetty et al.’s work means that “administrators can legitimately use value-added scores to hire, fire and otherwise evaluate teacher performance.”

In terms of question 1, the idea that VAM research indicates that we’re pretty good at “grading teachers” is itself debatable. Flowers doesn’t conduct an extensive survey of researchers or research, but focuses on six well-known veterans of VAM debates, including several of the more outspoken defenders of the metric (Chetty and Thomas Kane specifically; Friedman, Rockoff, and Douglas Staiger are also longtime VAM supporters). While many respected academics caution about VAM’s limitations and/or have more nuanced positions on its use, Jesse Rothstein is the only one Flowers cites.

In fact, whether VAM estimates are systematically biased (Rothstein’s argument) or not (Chetty et al.’s contention), there are legitimate questions about whether VAM results are valid (whether or not they are really capturing “teacher effectiveness” in the way that most people think about it). VAM estimates correlate surprisingly little with other measures aimed at capturing effective teaching (like experts’ assessments of classroom instruction). They’re also notoriously unstable, meaning that a teacher’s scores bounce around a lot depending on the year and test studied. While other methods of evaluating teacher effectiveness have similar issues and there are certain approaches to VAM (not commonly used) that are more useful than others, it’s perfectly reasonable to argue that we’re still pretty bad at “grading teachers.”

More importantly, however, debates about bias, validity, and stability in VAM actually have much less to do with the answer to question 2 – should we use VAM to evaluate teachers in the way its proponents recommend? – than many people think. To understand why, we need look no farther than two of the core purposes of teacher evaluation, purposes which everyone from teachers unions to education reform organizations generally agree about (at least rhetorically).

1) One core purpose of teacher evaluation is helping teachers improve. Making VAM results a defined percentage of a teacher’s evaluation is not useful for this purpose even if we assume VAM results are unbiased, valid, and stable. Such a policy may actually undermine teacher improvement, and hence the quality of instruction that students receive.

For starters, a VAM score is opaque. Teachers cannot match their VAM results back to specific questions on a test or use them to figure out what their students did or didn’t know. VAM may be able to tell a teacher if her students did well or poorly on a specific test, but not why students did well or poorly. In other words, a VAM score provides no actionable feedback. It does not indicate anything about what a teacher can do to help her students learn.

In addition, VAM results are outcomes over which a teacher has very limited control – research typically finds that teachers contribute to less than a fifth of the variation in student test scores (the rest is mostly random error and outside-of-school factors). If a teacher’s VAM results look good, that might be because the teacher did something well, but it also might be because the teacher got lucky, or because some other factor contributed to her students’ success. The tendency to view VAM results as indicative of whether or not a teacher did a good job – a common side effect of making VAM results a defined percentage of a teacher’s evaluation – is thus misguided (and a potential recipe for the reinforcement of unhelpful behaviors). This concern is especially germane because VAM results are often viewed as “grades” by the teachers receiving them – even if they are only a small percentage of a teachers’ evaluation “score” – and thus threaten to overwhelm other, potentially productive elements of an evaluation conversation.

A better evaluation system would focus on actionable feedback about things over which a teacher has direct control. Student performance should absolutely be included in the teacher evaluation process, but instead of making VAM a defined percentage of a teacher’s evaluation (part of a “grade”), evaluators should give teachers feedback on how well they use information about student performance to analyze their teaching practices and adapt their instruction accordingly. This approach, unlike the approach favored by many VAM proponents, would help a teacher improve over time.

2) A second core purpose of teacher evaluation is to help evaluators make personnel decisions. Relative to the evaluation system described above – one that focuses on actions over which a teacher has control – making VAM results a defined percentage of teacher evaluations does not help us with this purpose, either. Suppose a teacher gets a bad VAM result. If that result is consistent with classroom observation data, the quality of assigned work, and various other elements of the teacher’s practice, an evaluator shouldn’t need it to conclude that the teacher is ineffective.

If there is a discrepancy between the VAM result and the other measures, on the other hand, there are a few possibilities. The VAM results might have been unlucky. The teaching practices the teacher employed might not be as useful as the teacher or evaluator thought they would be. Or perhaps VAM isn’t a very good indicator of teacher quality (there’s also a possibility that the various other measures aren’t good indicators of teacher quality, but the measures suggested all have more face validity – meaning that they’re more intuitively likely to reflect relevant information – than do VAM results). Under any of these alternative scenarios, using VAM results as a defined percentage of a teacher’s evaluation makes us more likely both to fire teachers who might actually be good at their jobs and to reward teachers who might not be.

When we evaluate schools on student outcomes, we reward (and punish) them for factors they don’t directly control. A more intelligent and fair approach would evaluate the actions schools take in pursuit of better student outcomes, not the outcomes themselves. — Relative to teacher evaluation systems that focus on things over which a teacher has direct control, using VAM results as a defined percentage of a teacher’s evaluation makes us more likely both to fire teachers who might actually be good at their jobs and to reward teachers who might not be.

To be fair, question 1 could have some relevance for this purpose of teacher evaluation; if VAM results were an excellent indicator of teaching quality (again, they aren’t, but let’s suspend disbelief for a moment), that would negate one of the concerns above and make us more confident in using VAM for reward and punishment. Yet even in this case the defined-percentage approach would hold little if any advantage over the properly-designed evaluation system described above in helping administrators make personnel decisions, and it would be significantly more likely both to feel unfair to teachers and to result in a variety of negative consequences.

I’ve had many conversations with proponents of making VAM a defined percentage of teacher evaluations, and not a single one has been able to explain why their approach to VAM is better than an alternative approach that focuses on aspects of teaching practice – like creating a safe classroom environment, lesson planning, analyzing student data, and delivering high-quality instruction – over which teachers have more control.

So while the answer to question 1 in the case of VAM is that, despite its shortcomings, it may provide useful information, the answer to question 2 – should VAM results be used as a defined percentage of teacher evaluations? – is a resounding “no.” And those who understand the crucial distinction between the two questions know that no amount of papers, articles, or researcher opinions, however interesting or useful for other purposes they may be, is ever going to change that fact.

8 responses to “Is VAM a Sham? Depends on the Question You’re Asking.”

Matt Barnum

September 3, 2015

Ben – I think where you miss the mark is with what I see an overly simplistic/binary description of when a VAM versus an observation score differ. To me this suggests a failure to think at the margin.

Imagine a situation where a teacher’s observations are mediocre and a principal is deciding whether to, say, dismiss her. Wouldn’t you as a school leader want the VAM score to help inform that decision? If it was very bad, dismissal might be appropriate; it it too was mediocre, than maybe dismissal but maybe another year of intensive support; if it was above average, then probably no dismissal and perhaps a reflection on observation protocol.

I think this is particularly the case when we have via the Chetty study evidence of validity much stronger than any such validity evidence for observations. (I agree about face validity, but it only gets you so far.) In general it seems like the more (valid) information to make personnel decisions, the better.

I’m also wondering if your position here is falsifiable? And if so what evidence would it take to change your mind?

Just to end with a couple points of agreement: I don’t think test scores should be a set proportion of teachers’ evaluations. I think this is an overly mechanical form of evaluation that disempowers school leaders and is rarely (if ever?) seen in the evaluation of other professionals.

Secondly, I think there have been some major mistakes in how reformers have used tests, particularly in teacher evaluation. I wrote about this extensively in what I think was a very tough (but I think fair) assessment of Arne Duncan’s legacy https://www.the74million.org/article/arne-duncans-wrong-turn-on-reform-how-federal-dollars-fueled-the-testing-backlash

Matt

Reply
1. Ben Spielberg
  
  September 4, 2015
  
  Thanks, Matt, for the thoughtful comment. I appreciate many of the issues you highlighted in your article about Duncan. However, after reading it, I’m not sure I understand your position on the usage of test scores in teacher evaluation. My interpretation is that you don’t believe in defined percentages but do believe in telling an administrator to consider a large basket of factors that includes VAM however he or she pleases – is that right? If so, I think that’s a little better but still misguided, as I’ll explain below.
  
  A couple notes about your hypothetical evaluation scenario:
  
  1) Thinking at the margin is certainly valuable in any policy discussion, but an important thing to remember about it is that we rarely encounter the marginal scenario. Suppose approach A is best in 99% of cases and approach B is best in 1% of cases. We should generally prefer approach A, shouldn’t we? That calculation may change, of course, if the 1% of cases for which approach B works better are extremely significant, but even in this case, we aren’t designing policy in a vacuum – we can potentially think of policy complements to address these anomalous cases instead of implementing policy that’s worse for the other 99%.
  
  2) I disagree about the implications of Chetty’s research – it does not indicate that VAM has more validity than other measures aimed at assessing teaching quality. In particular, I think the significantly higher face validity of multiple observations by multiple trained observers is considerably more compelling given the comparable stability of the two measures (see http://www.ajeforum.com/on-the-instability-of-teacher-effectiveness-measures-by-morgan-polikoff/). While lots of studies have shown issues with classroom observations, the same is true for VAM.
  
  Based on #2, therefore, your hypothetical is really easy – if a teacher’s overall evaluation on other metrics was too close to call and the teacher had an outlier VAM score, a school leader would be foolish to make the VAM score the deciding factor in an employment decision.
  
  Since I argue in my piece that the appropriate policy doesn’t really depend on VAM’s validity, however, let’s imagine you’re right and VAM is pretty valid. In this case, if I were a school leader, I would certainly be tempted to use VAM. Yet it would be a mistake to allow me to do so, and the reason gets back to the formative purpose of teacher evaluation and my responsibility as a principal. If I can’t make a call based on the other measures and VAM is valid, that means I don’t really know what I’m doing when it comes to the other measures. That could be my fault as a school leader, or it could be a reflection of our general inability to identify the components of good teaching – either way, we’d have a bigger problem on our hands than one teacher who wasn’t as good as (or was better than) the other measures indicated, a problem that would require significantly more attention. Allowing me to make the decision based on VAM would help me evade improvement as a school leader and/or disincentivize exploration into productive elements of teaching. And a negative decision would do so at the expense of a teacher – a human being – who was not given useful guidance about how to get better.
  
  I like to think of the classroom as a microcosm in these scenarios – would it be appropriate for a teacher to make a negative decision about a student’s future when the teacher had failed to help the student reach his or her potential? I don’t think so, and I suspect you wouldn’t think so, either.
  
  There are of course differences between the classroom and an employment setting, but not as many as a lot of adults pretend – people are people, and we should treat them like people, even after they’ve turned 18. We absolutely must think about the good of the students in the employment situation, but not just the short-term good – the long-term good matters as well. There’s the proximate question of whether our practices make them more likely to be exposed to better teaching in the long run, and I’d say the answer with your approach is still no, even with a “VAM is valid” assumption. And the more important question, in my opinion, is: what type of society do we want students to grow up in?
  
  In terms of what it would take for you to convince me that your approach is better, you’d have to show that, compared to alternative systems like the one I’ve written about, the inclusion of VAM in teacher evaluations can help teachers improve more and is better for making appropriate personnel decisions. You’d also need to present a compelling rationale about why all the theory and research about such measures in other fields doesn’t apply to teaching, and it would be great if you could get teachers to like your proposal.
  
  I recognize that that’s a pretty tough falsification test. But I think that’s appropriate given that the conclusions in this piece are grounded in a combination of a pretty extensive body of research in other domains, the compelling theory, and my experience teaching and serving as an instructional coach to other teachers, all of which point in the same direction.
  
  Reply
Matt Barnum

September 5, 2015

Ben –

Here’s how I understand your argument. Correct me if I’m misstating any of it:

1) If we’ve developed a quality observation system, VAM is unnecessary.
2) If we haven’t developed a quality observation system, we have much bigger problems that VAM won’t solve and we should focus our energy on improving the observation system.

The first argument I find odd. Most systems for making high-stakes decisions rely on multiple measures for two main reasons: 1) all metrics, even extremely good ones, are imperfect and have measurement error 2) different metrics capture different aspects of quality that we care about.

Your analogy to how we treats students makes this point. Teachers rely on many assignments to give grades (including high stakes tests!); colleges rely on many measures to make admissions decisions (including high stakes tests!).

To your second point, yes this is the world we live. There are lots of reason to be concerned about the quality of teacher observations. We should be working hard to improve how observations are conducted (which I think is happening), but meanwhile we have to make personnel decisions. I don’t understand why we wouldn’t use the tools we have at our disposal to do so. And I don’t understand why you assume using VAM will inevitably hurt the teacher in this case. It might very well save a teacher from a negative personnel decisions created by our imperfect observation protocols!

I do believe that your position is essentially nonfalsifiable. For example we do have evidence that VAM helps us make better personnel decisions. Recent study by Kraft showed that combining VAM and observations was best among competing options for making layoff decisions as measured by subsequent test scores: https://www.the74million.org/article/new-study-ignoring-teacher-performance-in-layoffs-hurts-kids. I’m guessing you’ll reject this evidence out of hand as circular. (I don’t agree with that assessment, but that’s another discussion.) But fine. So my question what is the study that Kraft should (and could) have run that could have convinced you of his hypothesis?

Matt

Reply
1. Ben Spielberg
  
  September 5, 2015
  
  Your discussion of what you think my argument is, which is not quite right, suffers from the same problem as your discussion of layoff policy: it is far too narrow in it scope. It deals only with a teacher evaluation system’s implications for personnel decisions, not with its implications for teacher improvement, which, as I wrote in my piece, is the commonly-agreed-upon primary purpose of teacher evaluation. We must evaluate systems and policies comprehensively.
  
  As I explained in this post, using VAM as a component of the personnel decision – rather than using student performance in the alternative ways I’ve consistently recommended – does not help teachers get better (and may in fact undermine their improvement). It’s hard to see a scenario in which it would, which is why there’s a gigantic strike against this approach from the get-go and the falsifiability criteria are deservedly high.
  
  Given that reality, my argument is as follows:
  
  1) If there are a lot of concerns about whether or not VAM is a valid and stable measure of teaching effectiveness – which is, in my view, the case today – VAM provides little information for a personnel decision beyond that provided by a system that uses multiple other measures (especially if those measures have more face validity, which we both agree is true). In the context of VAM’s implications for teacher development and the way people perceive information about teaching quality, there is no justification for the proposed use of VAM in personnel decisions.
  
  2) If your wildest dreams come true and VAM becomes a valid and stable measure of teaching effectiveness – a better one than our other measures – then it could help inform personnel decisions when tough calls have to be made. But its use would still undermine teacher improvement and have the negative externalities I mentioned in my first comment, and its utility would also be confined to relatively few cases. There would still be a very high bar for justifying the use you recommend.
  
  Also, for the record, I’m opposed to high-stakes decision-making using test scores for students as well (a position that I did not used to hold but has developed, like my position on teacher evaluation, in response to sound theory, a large body of empirical evidence in multiple fields, and my experience in the classroom), even though they typically have more control over their scores than teachers do over VAM results. I’d like to see our education system focus more on helping students develop and learn than on ranking and categorizing them. I recognize that standardized test scores could boost the “grades” of some students and teachers, but while that’s less worrisome from the perspective of the student or teacher who is positively affected, it eliminates very few of the negative externalities I’ve laid out.
  
  On layoffs, I think the terms on which you want to have this discussion again embody a typical problem in education policy conversations. Layoffs occur not in response to teacher ineffectiveness but in response to inadequate funding. The primary focus for anyone concerned about layoffs, therefore, should be on adequately funding schools and making sure we don’t need to dismiss teachers we’d otherwise keep. (I neither support LIFO nor the use of VAM results in the unfortunate circumstance in which layoffs happen, as I’ve written in other pieces; I’d therefore say I agree with part of Kraft’s hypothesis – the idea that LIFO is suboptimal. If you want to know what it would take for me to support including your proposed use of VAM in teacher evaluation practices, I’ve already explained that in my previous comment.)
  
  One final note about your summary of Kraft’s study: I was bummed to see you use the bogus “months of learning” conversion in your article (something Kraft fortunately did not do in his writeup). Kraft’s results are not “fairly large,” as you assert, but practically insignificant – there’s a big difference between statistical and practical significance, especially with large data sets (otherwise, I agree that it doesn’t make sense to get into the weeds of his study in this conversation).
  
  Reply
Matt Barnum

September 6, 2015

For now I’ll just me answer the question you posed on Twitter: what it would take for me to change my views on VAM.

1) Strong evidence that VAM is not predictive of important outcomes; or alternatively that it does not add any predictive value when used in combination with other measures such as teacher observations
2) Relatedly, strong evidence that the validity of VAM breaks down to a point that it’s no longer useful when applied in high-stakes circumstances
3) Strong evidence of the negative externalities on teacher improvement that you’ve discussed, which outweighs the benefits to its personnel use as I see them (I know of no such empirical evidence currently, but do feel free to direct me to anything I’m missing)

I think it’s plausible that new evidence will come in and I’ll change my mind about VAM. But right now the existing literature convinces me that it provides useful data that can improve personnel decisions. In contrast, I think you’ve set a falsification standard that will not and cannot be met (which is fine! it just means you’re basing your view on something other the empirical evidence).

I think from my end I’ll leave it at that . Thanks, as always for the dialogue, Ben.

Reply
1. Ben Spielberg
  
  September 6, 2015
  
  Sure. Though just to be clear, my falsification standard is based in large part on empirical evidence, which you seem to keep missing – there’s lots of research in other domains (like student incentives – https://larrycuban.files.wordpress.com/2010/04/student-incentives.pdf – and medicine – http://healthaffairs.org/blog/2012/10/11/will-pay-for-performance-backfire-insights-from-behavioral-economics/ – for example; there’s also the merit pay research in education, which we’ve discussed before). You and I seem to have similarly stringent falsification standards, in fact – you just view personnel decisions as most important and “can VAM be useful for them?” as your research question, while I prefer a comprehensive assessment and a question of “how well does VAM work compared to viable alternatives?”
  
  Reply
Douglas W. Green, EdD

October 3, 2015

Great post. Have you read Rethinking Value-Added Models in Education? Even if you have your can review it easily at http://DrDougGreen.com. Here is the specific link to my summary. http://bit.ly/1zaiI6Y Keep up the good work. Doug

Reply
1. Ben Spielberg
  
  October 3, 2015
  
  I haven’t read it but will check it out – thanks!
  
  Reply

34justice