Tag Archives: value added

Is VAM a Sham? Depends on the Question You’re Asking.

  1. “Does data source X provide useful information?” and
  2. “Should data source X be used for purpose Y?”

are two very different questions.  Unfortunately, conflation of these questions by education researchers, writers, and advocates far too frequently results in bad policy recommendations.

This problem surfaces especially often in debates about value added modeling (VAM), a statistical method aimed at capturing a teacher’s effectiveness in the classroom.  Based on a new paper from economists Raj Chetty, John Friedman, and Jonah Rockoff, Andew Flowers writes, in response to question 1 above, that we’re pretty good at “the science of grading teachers” with VAM results.  Flowers weighs in on question 2 as well, arguing that Chetty et al.’s work means that “administrators can legitimately use value-added scores to hire, fire and otherwise evaluate teacher performance.”

In terms of question 1, the idea that VAM research indicates that we’re pretty good at “grading teachers” is itself debatable.  Flowers doesn’t conduct an extensive survey of researchers or research, but focuses on six well-known veterans of VAM debates, including several of the more outspoken defenders of the metric (Chetty and Thomas Kane specifically; Friedman, Rockoff, and Douglas Staiger are also longtime VAM supporters).  While many respected academics caution about VAM’s limitations and/or have more nuanced positions on its use, Jesse Rothstein is the only one Flowers cites.

In fact, whether VAM estimates are systematically biased (Rothstein’s argument) or not (Chetty et al.’s contention), there are legitimate questions about whether VAM results are valid (whether or not they are really capturing “teacher effectiveness” in the way that most people think about it).  VAM estimates correlate surprisingly little with other measures aimed at capturing effective teaching (like experts’ assessments of classroom instruction).  They’re also notoriously unstable, meaning that a teacher’s scores bounce around a lot depending on the year and test studied.  While other methods of evaluating teacher effectiveness have similar issues and there are certain approaches to VAM (not commonly used) that are more useful than others, it’s perfectly reasonable to argue that we’re still pretty bad at “grading teachers.”

More importantly, however, debates about bias, validity, and stability in VAM actually have much less to do with the answer to question 2 – should we use VAM to evaluate teachers in the way its proponents recommend? – than many people think.  To understand why, we need look no farther than two of the core purposes of teacher evaluation, purposes which everyone from teachers unions to education reform organizations generally agree about (at least rhetorically).

1) One core purpose of teacher evaluation is helping teachers improve. Making VAM results a defined percentage of a teacher’s evaluation is not useful for this purpose even if we assume VAM results are unbiased, valid, and stable.  Such a policy may actually undermine teacher improvement, and hence the quality of instruction that students receive.

For starters, a VAM score is opaque.  Teachers cannot match their VAM results back to specific questions on a test or use them to figure out what their students did or didn’t know.  VAM may be able to tell a teacher if her students did well or poorly on a specific test, but not why students did well or poorly.  In other words, a VAM score provides no actionable feedback. It does not indicate anything about what a teacher can do to help her students learn.

In addition, VAM results are outcomes over which a teacher has very limited control – research typically finds that teachers contribute to less than a fifth of the variation in student test scores (the rest is mostly random error and outside-of-school factors).  If a teacher’s VAM results look good, that might be because the teacher did something well, but it also might be because the teacher got lucky, or because some other factor contributed to her students’ success.  The tendency to view VAM results as indicative of whether or not a teacher did a good job – a common side effect of making VAM results a defined percentage of a teacher’s evaluation – is thus misguided (and a potential recipe for the reinforcement of unhelpful behaviors).  This concern is especially germane because VAM results are often viewed as “grades” by the teachers receiving them – even if they are only a small percentage of a teachers’ evaluation “score” – and thus threaten to overwhelm other, potentially productive elements of an evaluation conversation.

A better evaluation system would focus on actionable feedback about things over which a teacher has direct control.  Student performance should absolutely be included in the teacher evaluation process, but instead of making VAM a defined percentage of a teacher’s evaluation (part of a “grade”), evaluators should give teachers feedback on how well they use information about student performance to analyze their teaching practices and adapt their instruction accordingly.  This approach, unlike the approach favored by many VAM proponents, would help a teacher improve over time.

2) A second core purpose of teacher evaluation is to help evaluators make personnel decisions. Relative to the evaluation system described above – one that focuses on actions over which a teacher has control – making VAM results a defined percentage of teacher evaluations does not help us with this purpose, either.  Suppose a teacher gets a bad VAM result.  If that result is consistent with classroom observation data, the quality of assigned work, and various other elements of the teacher’s practice, an evaluator shouldn’t need it to conclude that the teacher is ineffective.

If there is a discrepancy between the VAM result and the other measures, on the other hand, there are a few possibilities.  The VAM results might have been unlucky.  The teaching practices the teacher employed might not be as useful as the teacher or evaluator thought they would be.  Or perhaps VAM isn’t a very good indicator of teacher quality (there’s also a possibility that the various other measures aren’t good indicators of teacher quality, but the measures suggested all have more face validity – meaning that they’re more intuitively likely to reflect relevant information – than do VAM results).  Under any of these alternative scenarios, using VAM results as a defined percentage of a teacher’s evaluation makes us more likely both to fire teachers who might actually be good at their jobs and to reward teachers who might not be.

When we evaluate schools on student outcomes, we reward (and punish) them for factors they don’t directly control. A more intelligent and fair approach would evaluate the actions schools take in pursuit of better student outcomes, not the outcomes themselves.

Relative to teacher evaluation systems that focus on things over which a teacher has direct control, using VAM results as a defined percentage of a teacher’s evaluation makes us more likely both to fire teachers who might actually be good at their jobs and to reward teachers who might not be.

To be fair, question 1 could have some relevance for this purpose of teacher evaluation; if VAM results were an excellent indicator of teaching quality (again, they aren’t, but let’s suspend disbelief for a moment), that would negate one of the concerns above and make us more confident in using VAM for reward and punishment.  Yet even in this case the defined-percentage approach would hold little if any advantage over the properly-designed evaluation system described above in helping administrators make personnel decisions, and it would be significantly more likely both to feel unfair to teachers and to result in a variety of negative consequences.

I’ve had many conversations with proponents of making VAM a defined percentage of teacher evaluations, and not a single one has been able to explain why their approach to VAM is better than an alternative approach that focuses on aspects of teaching practice – like creating a safe classroom environment, lesson planning, analyzing student data, and delivering high-quality instruction – over which teachers have more control.

So while the answer to question 1 in the case of VAM is that, despite its shortcomings, it may provide useful information, the answer to question 2 – should VAM results be used as a defined percentage of teacher evaluations? – is a resounding “no.”  And those who understand the crucial distinction between the two questions know that no amount of papers, articles, or researcher opinions, however interesting or useful for other purposes they may be, is ever going to change that fact.


Filed under Education

Eric Lerum and I Debate Teacher Evaluation and the Role of Anti-Poverty Work (Part 2)

StudentsFirst Vice President Eric Lerum and I recently began debating the use of standardized test scores in high stakes decision-making.  I argued in a recent blog post that we should instead evaluate teachers on what they directly control – their actions.  Our conversation, which began to touch on additional interesting topics, is continued below.

Click here to read Part 1 of the conversation.

Lerum: To finish the outcomes discussion – measuring teachers by the actions they take is itself measuring an input. What do we learn from evaluating how hard a teacher tries? And is that enough to evaluate teacher performance? Shouldn’t performance be at least somewhat related to the results the teacher gets, independent of how hard she tries? If I put in lots of hours learning how to cook, assembling the perfect recipes, buying the best ingredients, and then even more hours in the kitchen – but the meal I prepare doesn’t taste good and nobody likes it, am I a good cook?

Regarding your use of probability theory and VAM – the problem I have with your analysis there is that VAM is not used to raise student achievement. So using it – even improperly – should not have a direct effect on student achievement. What VAM is used for is determining a teacher’s impact on student achievement, and thereby identifying which teachers are more likely to raise student achievement based on their past ability to do so. So even if you want to apply probability theory and even if you’re right, at best what you’re saying is that we’re unlikely to be able to use it to identify those teachers accurately on an ongoing basis. The larger point that is made repeatedly is that because outside factors play a larger overall role in impacting student achievement, we should not focus on teacher effectiveness and instead solve for these other factors. This is a key disconnect in the education reform debate. Reformers believe that focusing on things like teacher quality and focusing on improving circumstances for children outside of school need not be mutually exclusive. Teacher quality is still very important, as Shankerblog notes. Improving teacher quality and then doing everything we can to ensure students have access to great teachers does not conflict at all with efforts to eliminate poverty. In fact, I would view them as complementary. But critics of these reforms use this argument to say that one should come before the other – that because these other things play larger roles, we should focus our efforts there. That is misguided, I think – we can do both simultaneously. And as importantly in terms of the debate, no reformer that I know suggests that we should only focus on teacher quality or choice or whatever at the expense or exclusion of something else, like poverty reduction or improving health care.

If you’re interested in catching up on class size research, I highly recommend the paper published by Matt Chingos at Brookings, found here with follow-up here. To be clear about my position on class size, however; I’m not against smaller class sizes. If school leaders determine that is an effective way for improving instruction and student achievement in their school, they should utilize that approach. But it’s not the best approach for every school, every class, every teacher, or every child. And thus, state policy should reflect that. Mandating class size limits or restrictions makes no sense. It ties the hands of administrators who may choose to staff their schools differently and use their resources differently. It hinders innovation for educators who may want to teach larger classes in order to configure their classrooms differently, leverage technology or team teaching, etc. Why not instead leave decisions about staffing to school leaders and their educators?

The performance framework for San Jose seems pretty straightforward. I’m curious how you measure #2 (whether teachers know the subjects) – are those through rigorous content exams or some other kind of check?

I think a solid evaluation system would include measures using indicators like these. But you would also need actual student learning/growth data to validate whether those things are working – as you say, “student outcome results should take care of themselves.” You need a measure to confirm that.

I honestly think my short response to all of this would be that there’s nothing in the policies we advocate for that prevent what you’re talking about. And we advocate for meaningful evaluations being used for feedback and professional development – those are critical elements of bills we try to move in states. But as a state-level policy advocacy organization, we don’t advocate for specific models or types of evaluations. We believe certain elements need to be there, but we wouldn’t be advocating for states to adopt the San Jose model or any other specifically – that’s just not what policy advocacy is. So I think there’s just general confusion about that – that simply because you don’t hear us saying to build a model with the components you’re looking for, that must mean we don’t support it. In fact, we’re focused on policy at a level higher than the district level, and design and implementation of programs isn’t in our wheelhouse.

Spielberg: I believe you discuss three very important questions, each one of which deserves some attention:

1) Given that student outcomes are primarily determined by factors unrelated to teaching quality, can and should people still work on improving teacher effectiveness?

Yes!  While teaching quality accounts for, at most, a small percentage of the opportunity gap, teacher effectiveness is still very important.  Your characterization of reform critics is a common misconception; everyone I’ve ever spoken with believes we can work on addressing poverty and improving schools simultaneously.  Especially since we decided to have this conversation to talk about how to measure teacher performance, I’m not sure why you think I’d argue that “we should not focus on teacher effectiveness.”  I am critiquing the quality of some of StudentsFirst’s recommendations – they are unlikely to improve teacher effectiveness and have serious negative consequences – not the topic of reform itself.  I recommend we pursue policy solutions more likely to improve our schools.

Critics of reform do have a legitimate issue with the way education reformers discuss poverty, however.  Education research’s clearest conclusion is that poverty explains inequality significantly better than school-related factors.  Reformers often pay lip-service to the importance of poverty and then erroneously imply an equivalence between the impact of anti-poverty initiatives and education reforms.  They suggest that there’s far more class mobility in the United States than actually exists.  This suggestion harms low-income students.

As an example, consider the controversy that surrounded New York mayor Bill de Blasio several months ago.  De Blasio was a huge proponent of measures to reduce income inequality, helped reform stop-and-frisk laws that unfairly targeted minorities, had fought to institute universal pre-K, and had shown himself in nearly every other arena to fight for underprivileged populations.  While it would have been perfectly reasonable for StudentsFirst to disagree with him about the three charter co-locations (out of seventeen) that he rejected, StudentsFirst’s insinuation that de Blasio’s position was “down with good schools” was dishonest, especially since a comprehensive assessment of de Blasio’s policies would have indisputably given him high marks on helping low-income students.  At the same time, StudentsFirst aligns itself with corporate philanthropists and politicians, like the Waltons and Chris Christie, who actively exploit the poor and undermine anti-poverty efforts.  This alignment allows wealthy interests to masquerade as advocates for low-income students while they work behind the scenes to deprive poor students of basic services.  Critics argue that organizations like StudentsFirst have chosen the wrong allies and enemies.

I wholeheartedly agree that anti-poverty initiatives and smart education reforms are complementary.  I’d just like to see StudentsFirst speak honestly about the relative impact of both.  I’d also love to see you hold donors and politicians accountable for their overall impact on students in low-income communities.  Then reformers and critics of reform alike could stop accusing each other of pursuing “adult interests” and focus instead on the important work of improving our schools.

2) How can we use student outcome data to evaluate whether an input-based teacher evaluation system has identified the right teaching inputs?

This concept was the one we originally set out to discuss.  I’d love to focus on it in subsequent posts if that works for you (though I’d love to revisit the other topics in a different conversation if you’re interested).

I’m glad we agree that “a solid evaluation system would include [teacher input-based] measures…like [the ones used in San Jose Unified].”  I also completely agree with you that we need to use student outcome data “to validate whether those things are working.”  That’s exactly the use of student outcome data I recommend.  Though cooks probably have a lot more control over outcomes than teachers, we can use your cooking analogy to discuss how Bayesian analysis works.

We’d need to first estimate the probability that a given input – let’s say, following a specific recipe – is the best path to a desired outcome (a meal that tastes delicious).  This probability is called our “prior.”  Let’s then assume that the situation you describe occurs – a cook follows the recipe perfectly and the food turns out poorly.  We’d need to estimate two additional probabilities. First, we’d need to know the probability the food would have turned out badly if our original prediction was correct and the recipe was a good one.  Second, we’d need the probability that the food would have turned out poorly if our original prediction was incorrect and the recipe was actually a bad one.  Once we had those estimates, there’s a very simple formula we could use to give us an updated probability that the input – the recipe – is a good one.  Were this probability sufficiently low, we would throw out the recipe and pick a new one for the next meal.  We would, however, identify the cook as an excellent recipe-follower.

This approach has several advantages over the alternative (evaluating the cook primarily on the taste of the food).  Most obviously, it accurately captures the cook’s performance.  The cook clearly did an excellent job doing what both you and he thought was a good idea – following this specific recipe – and can therefore be expected to do a good job following other recipes in the future.  If we punished him, we’d be sending the message that his actual performance matters less than having good luck, and if we fired him, we’d be depriving ourselves of a potentially great cook.  Additionally, it’s not the cook’s fault that we picked the wrong cooking strategy, so it’s unethical to punish him for doing everything we asked him to do.

Just as importantly, this approach would help us identify the strategies most likely to lead to better meals in the long run.  We might not catch the problem with the recipe if we incorrectly attribute the meal’s taste to the cook’s performance – we might end up continuously hiring and firing a bunch of great cooks before we realize that the recipe is bad.  If we instead focus on the cook’s locus of control – following the recipe – and use Bayesian analysis, we will more quickly discover the best recipes and retain more cooks with recipe-following skills.  Judging cooks on their ability to execute inputs and using outcomes to evaluate the validity of the inputs would, over time, increase the quality of our meals.

Let’s now imagine the analogous situation for teachers.  Suppose a school adopts blended learning as its instructional framework, and suppose a teacher executes the school’s blended learning model perfectly.  However, the teacher’s value added (VAM) results aren’t particularly high.  Should we punish the teacher?  The answer, quite clearly, is no; unless the teacher was bad at something we forgot to identify as an effective teaching practice, none of the explanations for the low scores have anything to do with the teacher’s performance.  Just as with cooking, we might not catch a real problem with a given teaching approach if we incorrectly attribute outcome data to a teacher’s performance – we might end up continuously hiring and firing a bunch of great teachers based on random error, a problem with an instructional framework, or a problem with VAM methodology.

The improper use of student outcome data in high-stakes decision-making has negative consequences for students precisely because of this incorrect attribution.  Making VAM a defined percentage of teacher evaluations leads to employment decisions based on inaccurate perceptions of teacher quality.  Typical VAM usage also makes it harder for us to identify successful teaching practices.  If we instead focus on teachers’ locus of control – effective execution of teacher practices – and use Bayesian analysis, we will more quickly discover the best teaching strategies and retain more teachers who can execute teaching strategies effectively.  Judging teachers on their ability to execute inputs and using outcomes to evaluate the validity of the inputs would, over time, increase the likelihood of student success.

3) As “a state-level policy advocacy organization,” what is the scope of StudentsFirst’s work?

You wrote that StudentsFirst “[doesn’t] advocate for specific models or types of evaluations” but believes “certain elements need to be there.”  One of the elements you recommend is “evaluating teachers based on evidence of student results.”  This recommendation has translated into your support for the use of standardized test scores as a defined percentage of teacher evaluations.  I was not recommending that you ask states to adopt San Jose Unified’s evaluation framework (as an aside, the component you ask about deals mostly with planning and, among other things, uses lesson plans, teacher-created materials, and assessments as evidence) or that you recommend across-the-board class size reduction (thanks for clarifying your position on that, by the way – I look forward to reading the pieces you linked).  Instead, since probability theory and research suggest it isn’t likely to improve teacher performance, I recommend that StudentsFirst discontinue its push to make standardized test scores a percentage of evaluations.  You could instead advocate for evaluation systems that clearly define good teacher practices, hold teachers accountable for implementing good practices, and use student outcomes in Bayesian analysis to evaluate the validity of the defined practices.  This approach would increase the likelihood of achieving your stated organizational goals.

Thanks again for engaging in such an in-depth conversation.  I think more superficial correspondence often misses the nuance in these issues, and I am excited that you and I are getting the opportunity to both identify common ground and discuss our concerns.

Click here to read Part 3a of the conversation, which focuses back on the evaluation debate.

Click here to read Part 3b of the conversation, which focuses on how reformers and other educators talk about poverty.


Filed under Education

StudentsFirst Vice President Eric Lerum and I Debate Accountability Measures (Part 1)

After my blog post on the problem with outcome-oriented teacher evaluations and school accountability measures, StudentsFirst Vice President Eric Lerum and I exchanged a few tweets about student outcomes and school inputs and decided to debate teacher and school accountability more thoroughly.  We had a lengthy email conversation we agreed to share, the first part of which is below.

Spielberg: In my last post, I highlighted why both probability theory and empirical research suggest we should stop using student outcome data to evaluate teachers and schools.  Using value added modeling (VAM) as a percentage of an evaluation actually reduces the likelihood of better future student outcomes because VAM results have more to do with random error and outside-of-school factors than they have to do with teaching effectiveness.

I agree with some of your arguments about evaluation; for example, evaluations should definitely use multiple measures of performance.  I also appreciate your opposition to making student test score results the sole determinant of a teacher’s evaluation.  However, you insist that measures like VAM constitute a fairly large percentage of teacher evaluations despite several clear drawbacks; not only do they fail to reliably capture a teacher’s contribution to student performance, but they also narrow our conception of what teachers and schools should do and distract policymakers and educators from conversations about specific practices they might adopt.  Why don’t you instead focus on defining and implementing best practices effectively?  Most educators have similar ideas about what good schools and effective teaching look like, and a focus on the successful implementation of appropriately-defined inputs is the most likely path to better student outcomes in the long run.

Lerum: There’s nothing in the research or the link you cite above that supports a conclusion that use VAM “actually reduces the likelihood of better future student outcomes” – that’s simply an incorrect conclusion to come to. Numerous researchers have concluded that using VAM is reasonable and a helpful component of better teacher evaluations (also see MET). Even Shankerblog doesn’t go so far as to suggest using VAM could reduce chances of greater student success.

Some of your concerns with VAM deal with the uncertainty built within it. But that’s true for any measure. Yet VAM is one of the few (if not the only) measure that has actually been shown to allow one to control for many of the outside factors you suggest could unfairly prejudice a teacher’s rating.

What VAM does tell us – with greater reliability than other measures is whether a teacher is likely to get higher student achievement with a particular group of students. I would argue that’s a valuable piece of information to have if the goal is to identify which teachers are getting results and which teachers need development.

To suggest that districts & schools that are focusing on implementing new evaluation systems like those we support are not focusing on “defining and implementing best practices effectively” misses a whole lot of evidence to the contrary. What we’re seeing in DC, Tennessee, Harrison County, CO, and countless other places is that these conversations are happening, and with a renewed vigor because educators are working with more data and a stronger framework than ever before.

Back to your original post and my issues with it, however – focusing on inputs is not a new approach. It’s the one we have tried for decades. More pay for earning a Masters degree. Class size restrictions and staffing ratios. Providing funding that can only be used for certain programs. The list goes on and on.

Spielberg: I don’t think anyone thinks we should evaluate teachers on the number and type of degrees they hold, or that we should evaluate schools on how much specialized funding they allocate – I can see why you were concerned if you thought that’s what I recommended.  My proposal is to evaluate teachers on the actions they take in pursuit of student outcomes and is something I’m excited to discuss with you.

However, I think it’s important first to discuss my statement about VAM usage more thoroughly because the sound bites and conclusions drawn in and from many of the pieces you link are inconsistent with the actual research findings.  For example, if you read the entirety of the report that spawned the first article you link, you’ll notice that there’s a very low correlation between teacher value added scores in consecutive years.  I’m passionate about accurate statistical analyses – my background is in mathematical and computational sciences – and I try to read the full text of education research instead of press releases because, as I’ve written before, “our students…depend on us to [ensure] that sound data and accurate statistical analyses drive decision-making. They rely on us to…continuously ask questions, keep an open mind about potential answers, and conduct thorough statistical analyses to better understand reality.  They rely on us to distinguish statistical significance from real-world relevance.”  When we implement evaluation systems based on misunderstandings of research, we not only alienate people who do their jobs well, but we also make bad employment decisions.

My original statement, which you only quoted part of in your response, was the following: “Using value added modeling (VAM) as a percentage of an evaluation actually reduces the likelihood of better future student outcomes because VAM results have more to do with random error and outside-of-school factors than they have to do with teaching effectiveness.”  This statement is, in fact, accurate.  The following are well-established facts in support of this claim:

– As I explained in my post, probability theory is extremely clear that decision-making based on results yields lower probabilities of future positive results when compared to decision-making based on factors people completely control.

– In-school factors have never been shown to explain more than about one-third of the opportunity gap.  As mentioned in the Shanker Blog post I linked above, estimates of teacher impact on the differences in student test scores are generally in the ballpark of 10% to 15% (the American Statistical Association says it ranges from 1% to 14%).  Teachers have an appreciable impact, but teachers do not have even majority control over VAM scores.

Research on both student and teacher incentives is consistent with what we’d expect from the bullet points above – researchers agree that systems that judge performance based on factors over which people have only limited control (in nearly any field) fail to reliably improve performance and future outcomes.

Those two bullet points, the strong research that corroborates the theory, and the existence of an alternative evaluation framework that judges teachers on factors they completely control (which I will talk more about below) would essentially prove my statement even if recent studies hadn’t also indicated that VAM scores correlate poorly with other measures of teacher effectiveness.  In addition, principal Ted Appel astutely notes that, “even when school systems use test scores as ‘only a part’ of a holistic evaluation, it infects the entire process as it becomes the piece [that] is most easily and simplistically viewed by the public and media. The result is a perverse incentive to find the easiest route to better outcome scores, often at the expense of the students most in need of great teaching input.”

I also think it’s important to mention that the research on the efficacy of class size reduction, which you seem to oppose, is at worst comparable to the research on the accuracy of VAM results.  I haven’t read many of the class size studies conducted in the last few years yet (this one is on my reading list) and thus can’t speak at this time to whether the benefits they find are legitimate, but even Eric Hanushek acknowledges that “there are likely to be situations…where small classes could be very beneficial for student achievement” in his argument that class size reduction isn’t worth the cost.  It’s intellectually inconsistent to argue simultaneously that class size reduction doesn’t help students and that making VAM a percentage of evaluations does, especially when (as the writeup you linked on Tennessee reminds us) a large number of teachers in some systems that use VAM have been getting evaluated on the test scores of students they don’t even teach.

None of that is to say that the pieces you link are devoid of value.  There’s some research that indicates VAM could be a useful tool, and I’ve actually defended VAM when people confuse VAM as a concept with the specific usage of VAM you recommend.  Though student outcome data shouldn’t be used as a percentage of evaluations, there’s a strong theoretical and research basis for using student outcomes in two other ways in an input-based evaluation process.  The new teacher evaluation system that San Jose Unified School District (SJUSD) and the San Jose Teachers Association (SJTA) have begun to implement can illustrate what I mean by an input-based evaluation system that uses student outcome data differently and that is more likely to lead to improved student outcomes in the long run.

The Teacher Quality Panel in SJUSD has defined the following five standards of teacher practice:

1) Teachers create and maintain effective environments for student learning.

2) Teachers know the subjects they teach and how to organize the subject matter for student learning.

3) Teachers design high-quality learning experiences and present them effectively.

4) Teachers continually assess student progress, analyze the results, and adapt instruction to promote student achievement.

5) Teachers continuously improve and develop as professional educators.

Note that the fourth standard gives us one of the two important uses of student outcome data – it should drive reflection during a cycle of inquiry.  These standards are based on observable teacher inputs, and there’s plenty of direct evidence evaluators can gather about whether teachers are executing these tasks effectively.  The beautiful thing about a system like this is that, if we have defined the elements of each standard correctly, the student outcome results should take care of themselves in the long run.

However, there is still the possibility that we haven’t defined the elements of each standard correctly.  As a concrete example, SJTA and SJUSD believe Explicit Direct Instruction (EDI) has value as an instructional framework, and someone who executes EDI effectively would certainly do well on standard 3.  However, the idea that successful implementation of EDI will lead to better student outcomes in the long run is a prediction, not a fact.  That’s where the second usage of student outcome data comes in – as I mentioned in my previous post, we should use student outcome results to conduct Bayesian analysis and figure out if our inputs are actually the correct ones.  Let me know if you want me to go into detail about how that process works.  Bayesian analysis is really cool (probability is my favorite branch of mathematics, if you haven’t guessed), and it will help us decide, over time, which practices to continue and which ones to reconsider.

I certainly want to acknowledge that many components of systems like IMPACT are excellent ones; increasing the frequency and validity of classroom observations is a really important step, for instance, in executing an input-based model effectively.  We definitely need well-trained evaluators and calibration on what great execution of given best practices look like.  When I wrote that I’d like to see StudentsFirst “focus on defining and implementing best practices effectively,” I meant that I’d like to see you make these ideas your emphasis.  Conducting evaluations on this sort of input-based criteria would make professional development and support significantly more relevant.  It would help reverse the teach-to-the-test phenomenon and focus on real learning.  It would make feedback more actionable. It would also help make teachers and unions feel supported and respected instead of attacked, and it would enable us to collaboratively identify both great teaching and classrooms that need support.  Most importantly, using these kinds of input-based metrics is more likely than the current approach to achieve long-run positive outcomes for our students.

Part 2 of the conversation, posted on August 11, can be found here.


Filed under Education

The Problem with Outcome-Oriented Evaluations

Imagine I observe two poker players playing two tournaments each. During their first tournaments, Player A makes $1200 and Player B loses $800. During her second tournament, Player A pockets another $1000. Player B, on the other hand, loses $1100 more during her second tournament. Would it be a good decision for me to sit down at a table and model my play after Player A?

For many people the answer to this question – no – is counterintuitive. I watched Player A and Player B play two tournaments each and their results were very different – haven’t I seen enough to conclude that Player A is the better poker player? Yet poker involves a considerable amount of luck and there are numerous possible short- and longer-term outcomes for skilled and unskilled players. As Nate Silver writes in The Signal and the Noise, I could monitor each player’s winnings during a year of their full-time play and still not know whether either of them was any good at poker. It would be fully plausible for a “very good limit hold ‘em player” to “have lost $35,000” during that time. Instead of focusing on the desired outcome of their play – making money – I should mimic the player who uses strategies that will, over time, increase the likelihood of future winnings. As Silver writes,

When we play poker, we control our decision-making process but not how the cards come down. If you correctly detect an opponent’s bluff, but he gets a lucky card and wins the hand anyway, you should be pleased rather than angry, because you played the hand as well as you could. The irony is that by being less focused on your results, you may achieve better ones.

As Silver recommends for poker and Teach For America recommends to corps members, we should always focus on our “locus of control.” For example, I have frequently criticized Barack Obama for his approach to the Affordable Care Act. While I am unhappy that the health care bill did not include a public option, I couldn’t blame Obama if he had actually tried to pass such a bill and failed because of an obstinate Congress. My critique lies instead with the President’s deceptive work against a more progressive bill – while politicians don’t always control policy outcomes, they do control their actions. As another example, college applicants should not judge their success on whether or not colleges accept them. They should evaluate themselves on what they control – the work they put into high school and their applications. Likewise, great football coaches recognize that they should judge their teams not on their won-loss records, but on each player’s successful execution of assigned responsibilities. Smart decisions and strong performance do not always beget good results; the more factors in-between our actions and the desired outcome, the less predictive power the outcome can give us.

Most education reformers and policymakers, unfortunately, still fail to recognize this basic tenet of probabilistic reasoning, a fact underscored in recent conversations between Jack Schneider (a current professor and one of the best high school teachers I’ve ever had) and Michelle Rhee. We implement teacher and school accountability metrics that focus heavily on student outcomes without realizing that this approach is invalid. As the American Statistical Association’s (ASA’s) recent statement on value-added modeling (VAM) clearly states, “teachers account for about 1% to 14% of the variability in [student] test scores” and “[e]ffects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.” Paul Bruno astutely notes that the ASA’s statement is an indictment of the way VAM is used, not the idea of VAM itself, yet little correlation currently exists between VAM results and effective teaching. As I’ve mentioned before, research on both student and teacher incentives suggests that rewards and consequences based on outcomes don’t work. When we use student outcome data to assign credit or blame to educators, we may close good schools, demoralize and dismiss good teachers, and ultimately undermine the likelihood of achieving the student outcomes we want.

Better policy would focus on school and teacher inputs. For example, we should agree on a set of clear and specific best teaching practices (with the caveat that they’d have to be sufficiently flexible to allow for different teaching styles) on which to base teacher evaluations. Similarly, college counselors should provide college applicants with guidance about the components of good applications. Football coaches should likewise focus on their players’ decision-making and execution of blocking, tackling, route-running, and other techniques.

Input Output Graphic

When we evaluate schools on student outcomes, we reward (and punish) them for factors they don’t directly control.  A more intelligent and fair approach would evaluate the actions schools take in pursuit of better student outcomes, not the outcomes themselves.

Outcomes are incredibly important to monitor and consider when selecting effective inputs, of course. Mathematicians use outcomes in a process called Bayesian analysis to constantly update our assessments of whether or not our strategies are working. If we observe little correlation between successful implementation of our identified best teaching practices and student growth for five consecutive years, for instance, we may want to revisit our definition of best practices. A college counselor whose top students are consistently rejected from Ivy League schools should begin to reconsider the advice he gives his students on their applications. Relatedly, if a football team suffers through losing season after losing season despite players’ successful completion of their assigned responsibilities, the team should probably overhaul its strategy.

The current use of student outcome data to make high-stakes decisions in education, however, flies in the face of these principles. Until we shift our measures of school and teacher performance from student outputs to school and teacher inputs, we will unfortunately continue to make bad policy decisions that simultaneously alienate educators and undermine the very outcomes we are trying to achieve.

Update: A version of this piece appeared in Valerie Strauss’s column in The Washington Post on Sunday, May 25.


Filed under Education, Philosophy