Tag Archives: school accountability

Cooks, Chefs, and Teachers: A Long-Form Debate on Evaluation (Part 3a)

StudentsFirst Vice President Eric Lerum and I have been debating teacher evaluation approaches since my blog post about why evaluating teachers based on student test scores is misguided and counterproductive.  Our conversation began to touch on the relationship between anti-poverty activism and education reform conversations, a topic we plan to continue discussing.  First, however, we wanted to focus back on our evaluation debate.  Eric originally compared teachers to cooks, and while I noted that cooks have considerably more control over the outcomes of their work than do teachers, we fleshed that analogy out and continue discussing its applicability to teaching below.

Click here to read Part 1 of the conversation.

Click here to read Part 2 of the conversation.

Lerum: I love the analogy you use for this simple reason – I don’t think we’re as interested in figuring out whether the cook is an “excellent recipe-follower” as we are about whether the cook makes food that tastes delicious. And since we’re talking about the evaluation systems themselves – and not the consequences attached (which by and large, most jurisdictions are not using) – then this really matters. The evaluation instrument may reveal that the cook is not an “excellent recipe follower,” which you gloss over. But that’s an important point. It could certainly identify those cooks that need to work on their recipe-following skills. That’s helpful in creating better cooks.

But taking your hypothetical that it identifies someone who can follow a recipe well and executes our strategies, but then the outcome is still bad – that is also important information. It could cause us to re-evaluate the recipe, the meal choice, certain techniques, even the assessment instrument itself (do the people tasting the food know what good food tastes like?). But all of those would be useful and significant pieces of information that we would not get if we weren’t starting with an evaluation framework that includes outcomes measures.

You clearly make the assumption that nobody would question the evaluation instrument or anything else – if we had this result for multiple cooks, we would just keep going with it and assume it’s the cooks and nothing else. But that’s an unreasonable assumption that I think is founded on a lack of trust and respect for the intentions underlying the evaluation. What we’re focused on is identifying, improving, rewarding, and making decisions based on performance. And we want accurate measures for doing so – nobody is interested in models that do not work. That’s why you constantly see the earliest adopters of such models making improvements as they go.

Also, to clarify, we do not advocate for the “use of standardized test scores as a defined percentage of teacher evaluations.” I assume you probably didn’t mean that literally, but I think it’s important for readers to understand the difference as it’s a common and oft-repeated misconception among critics of reform. We advocate for use of measures of student growth – big difference from just using the scores alone. It doesn’t make any sense to evaluate teachers based on the test scores themselves – there needs to be some measure (such as VAM) of how much students learn over time (their growth), but that is not a single snapshot based on any one test.

I appreciate your recommendation regarding the use of even growth data based on assessments, but again, your recommendation is based on your opinion and I respectfully disagree, as do many researchers and respected analysts (also see here and here – getting at some of the issues you raise as concerns, but proposing different solutions). To go back to your analogy, nobody is interested in going to a restaurant run by really good recipe-followers. They want to go where the food tastes good. Period. Likewise, no parent wants to send her child to a classroom taught by a teacher who creates and executes the best lesson-planning. They want to send their child to a classroom in which she will learn. Outcomes are always part of the equation. Figuring out the best way to measure them may always have some inherent issues with subjectivity or variability, but I believe removing outcomes from the overall evaluation itself betrays to some degree the initial purpose.

Spielberg: I think there’s some confusion here about what I’m advocating for and critiquing.  I’d like to reiterate what I have consistently argued in this exchange – that student outcomes should be a part of the teacher evaluation process in two ways:

1) We should evaluate how well teachers gather data on student achievement, analyze the data, and use the data to reflect on and improve their future instruction.

2) We should examine the correlation between the effective execution of teacher practices and student outcome results.  We should then use the results of this examination to revise our instructional practices as needed.

I have never critiqued the fact that you care about student outcomes and believe they should factor heavily into our thinking – on this point we agree (I’ve never met anyone who works in education who doesn’t).  We also agree that it is better to measure student growth on standardized test scores, as value added modeling (VAM) attempts to do, than to look at absolute scores on standardized tests (I apologize if my earlier wording about StudentsFirst’s position was unclear – I haven’t heard anyone speak in favor of the use of absolute scores in quite some time and assumed everyone reading this exchange would know what I meant).  Furthermore, the “useful and significant pieces of information” you talk about above are all captured in the evaluation framework I recommend.

My issue has always been with the specific way you want to factor student outcomes into evaluation systems.  StudentsFirst supports making teachers’ VAM results a defined percentage of a teacher’s “score” during the evaluation process, do you not?  You highlight places, like DC and Tennessee, that use VAM results in this fashion.  Whether or not this practice is likely to achieve its desired effect is not really a matter of opinion; it’s a matter of mathematical theory and empirical research.  I’ve laid out why StudentsFirst’s approach is inconsistent with the theory and research in earlier parts of our conversation and none of the work you link above refutes that argument.  As you mention, both Matt Di Carlo and Douglas Harris, the authors of the four pieces you linked, identify issues with the typical uses of VAM similar to the ones I discuss.  Their main defense of VAM is only to suggest that other methods of evaluation are similarly problematic; Harris discusses a “lack of reliability in essentially all measures” and Di Carlo notes that “alternative measures are also noisy.”  There is, however, more recent evidence from MET that multiple, full-period classroom observations by multiple evaluators are significantly more reliable than VAM results.  While Di Carlo and Harris do have slightly different opinions than me about the role of value added, Di Carlo’s writing and Harris’s suggestion for evaluation on the whole seem far closer to what I’m advocating than to StudentsFirst’s recommendations, and I’d be very interested to hear their thoughts on this conversation.

That said, I like your focus above on what parents want, and I think it’s a worthwhile exercise to look at the purposes of evaluation systems and how our respective proposals meet the desires and needs of different stakeholders.  I believe evaluation systems have three primary purposes: providing information, facilitating support, and creating incentives.

1) Providing Information – You wrote the following:

…nobody is interested in going to a restaurant run by really good recipe-followers. They want to go where the food tastes good. Period. Likewise, no parent wants to send her child to a classroom taught by a teacher who creates and executes the best lesson-planning. They want to send their child to a classroom in which she will learn.

The first thing I’d note is that this juxtaposition doesn’t make very much sense; students taught by teachers who create and execute the best lesson-planning will most likely learn quite a bit (assuming that the teachers who are great lesson planners are at least decent at other aspects of good teaching). In addition, restaurants run by really good recipe-followers, if the recipes are good, will probably produce good-tasting food.  Good outputs are expected when inputs are well-chosen and executed effectively.

The cooking analogy is a bit problematic here because, in the example you give, the taste of the food is both the ultimately desired outcome and the metric by which you propose to assess the cook’s output.  In the educational setting, the metric – VAM, in the case of our debate – is not the same as the desired output.  In fact, VAM results are a relatively weak proxy for only a subset of the outcomes we care about for kids (those related to academic growth).  To construct a more appropriate analogy for judging a teacher on VAM results, let’s consider a chef who works in a restaurant where we want to eat dinner.  We are interested, ultimately, in the overall dining experience we will have at the restaurant. A measurement tool parallel to VAM, one that gives us a potentially useful but very limited picture of only one aspect of the experience other diners had, could be other diners’ assessments of the smell of the chef’s previous meals.

This analogy is more appropriate because the degree to which different diners value different aspects of a dining experience is highly variable.  All diners likely care to some extent about a combination of the food selection, the sustainability of their meal, the food’s taste, the atmosphere, the service, and the price.  Some, however, might value a beautiful, romantic environment over the taste of their entrees, while others may care about service above all else.  Likewise, some parents may care most about a classroom that fosters kindness, some may prioritize the development of critical thinking skills, and others may hold content knowledge in the highest esteem.

Were I to eat at a restaurant, I’d certainly get some information from knowing other diners’ assessments of previous meals’ smells.  Smell and taste are definitely correlated and I tend to value taste above other considerations when I’m considering a restaurant.  Yet it’s possible that other diners like different kinds of food than me, or that their senses of smell were affected by the weather or allergies when they dined there.  Some food, even though it smells bad, tastes quite good (and vice versa).  If I didn’t look deeper and really analyze what caused the smell ratings, I could very easily choose a sub-optimal restaurant.

What I’d really want to know would be answers to the following questions: what kind of food does the chef plan to make?  Does he source it sustainably?  Is it prepared to order?  Is the wait-staff attentive?  What’s the decor like?  The lighting?  Does the chef accommodate special requests?  How does the chef solicit feedback from his guests, and does he, when necessary, modify his practices in response to the feedback?  If diners could get information on the execution in each of these areas, they would be much better positioned to figure out whether they would enjoy the dining experience than if they focused on other diners’ smell ratings.  A chef who did all of these things well and who used Bayesian analysis to add, drop, and refine menu items and restaurant practices over time would almost certainly maximize the likelihood that future guests would leave satisfied.  A chef with great smell ratings might maximize that probability, but he also might not.

The exact same reasoning applies to the classroom experience.  Good VAM results might indicate a classroom that would provide a learning experience appropriate for a given student, but they might not.  Though I will again note that you don’t advocate for judging teachers solely on VAM, VAM scores tend to be what people focus on when they’re a defined percentage of evaluations.  That focus, again, does not provide very good information.  Whether parents value character development, inspiration, skill building, content mastery, or any other aspect of their children’s educational experience, they would get the best information by concentrating on teacher actions. If a parent knows a teacher’s skill – at establishing a positive classroom environment, at lesson planning, at lesson delivery, at using formative assessment to monitor student progress and adapt instruction, at helping students outside of class, etc. – that parent will be much more informed about the likelihood that a child will learn in a teacher’s class than if that parent focuses attention on the teacher’s VAM results.

2) Facilitating support – A chef with bad smell ratings might not be a very good chef.  But if that’s the case, any system that addressed the questions above – that assessed the chef’s skill at choosing recipes, sourcing great ingredients, making food to order, training his wait-staff, decorating his restaurant, responding to guest feedback, etc. – should also give him poor marks.  Bad results that truly signify bad performance, as opposed to reflecting bad luck or circumstances outside of the chef’s control, are the result of a bad input.  The key idea here is that, if we judge chefs on input execution but monitor outputs to make sure the inputs are comprehensive and accurate, judging chefs on their smell ratings won’t give us any additional information about which chefs need support.

More importantly, making smell ratings a defined percentage of a chef’s evaluation would not help a struggling chef improve his performance.  No matter the other components of his evaluation, he is likely to concentrate primarily on the smell ratings, feel like a failure, and have difficulty focusing on areas in which he can improve.  If we instead show the chef that, despite training the waitstaff well, he is having trouble selecting the best ingredients, we give him an actionable item to consider.  “Try these approaches to selecting new ingredients” is much easier to follow and much less demoralizing a directive than “raise your smell ratings.”

I think the parallel here is pretty clear – if we define and measure appropriate teaching inputs and use outcomes in Bayesian analysis to constantly revise those inputs, making VAM a defined percentage of an evaluation provides no new information about which teachers need support.  Especially because VAM formulas are complex statistical models that aren’t easily understood, the defined-percentage approach also focuses the evaluation away from actionable improvement items and towards the assignment of credit and blame.

3) Creating Incentives – Finally, a third goal of evaluation systems is related to workforce incentives.  First, we often wish to reward and retain high-performers and, in the instances in which support fails, exit consistently low-performers.  For retention and dismissal to improve overall workforce quality, we must base these decisions on accurate performance measures.

I don’t think the incomplete information provided by VAM results and smell ratings needs rehashing here; the argument is the same as above.  We are going to retain a higher percentage of chefs and teachers who are actually excellent if our evaluation systems focus on what they control than if our incentives focus on outputs over which they have limited impact.

Of particular concern to me, however, are the incentives teachers have for working with the highest-need populations.  Even efforts that take great pains to “level the playing field” between teachers with different student populations result in significantly better VAM results for teachers and schools that work with more privileged students.  Research strongly suggests that teachers who work in low-income communities could substantially improve their VAM scores by moving to classrooms with more affluent populations (and keeping their teaching quality constant).  When we make VAM results a defined percentage of an evaluation, we provide incentives for teachers who work with the highest-need populations to leave.  The type of evaluation I’m proposing, if we execute it properly, would eliminate this perverse incentive.

Again, I want to reiterate that I support constantly monitoring student outcomes; we should evaluate teachers on their ability to modify instruction in response to student outcomes, and we should also use outcomes to continuously refine our list of great teaching inputs.  But we rely on evaluation systems to provide accurate and comprehensive information, to help struggling employees improve, and to provide appropriate incentives.  VAM can help us think about good teaching practices, but StudentsFirst’s proposed use of VAM does not help us accomplish the goals of teacher evaluation.

Part 3b – in which we return to our discussion about the relationship between anti-poverty work and education reform – will follow soon!

Update (8/21/14) – Matt Barnum alerted me to the fact that the article I linked above about efforts to “level the playing field” when looking at VAM results actually does provide evidence that “two-step VAM” can eliminate the bias against low-income schools.  That’s exciting because, assuming the results are replicable and accurate, this particular VAM method would eliminate one of the incentive concerns I discussed.  However, while Educators 4 Excellence (Barnum’s organization) advocates for the use of this method, I don’t believe states currently use it (if you know of a state that does, please feel free to let me know).  The significant other issues with VAM would also still exist even with the use of the two-step version.

5 Comments

Filed under Education

StudentsFirst Vice President Eric Lerum and I Debate Accountability Measures (Part 1)

After my blog post on the problem with outcome-oriented teacher evaluations and school accountability measures, StudentsFirst Vice President Eric Lerum and I exchanged a few tweets about student outcomes and school inputs and decided to debate teacher and school accountability more thoroughly.  We had a lengthy email conversation we agreed to share, the first part of which is below.

Spielberg: In my last post, I highlighted why both probability theory and empirical research suggest we should stop using student outcome data to evaluate teachers and schools.  Using value added modeling (VAM) as a percentage of an evaluation actually reduces the likelihood of better future student outcomes because VAM results have more to do with random error and outside-of-school factors than they have to do with teaching effectiveness.

I agree with some of your arguments about evaluation; for example, evaluations should definitely use multiple measures of performance.  I also appreciate your opposition to making student test score results the sole determinant of a teacher’s evaluation.  However, you insist that measures like VAM constitute a fairly large percentage of teacher evaluations despite several clear drawbacks; not only do they fail to reliably capture a teacher’s contribution to student performance, but they also narrow our conception of what teachers and schools should do and distract policymakers and educators from conversations about specific practices they might adopt.  Why don’t you instead focus on defining and implementing best practices effectively?  Most educators have similar ideas about what good schools and effective teaching look like, and a focus on the successful implementation of appropriately-defined inputs is the most likely path to better student outcomes in the long run.

Lerum: There’s nothing in the research or the link you cite above that supports a conclusion that use VAM “actually reduces the likelihood of better future student outcomes” – that’s simply an incorrect conclusion to come to. Numerous researchers have concluded that using VAM is reasonable and a helpful component of better teacher evaluations (also see MET). Even Shankerblog doesn’t go so far as to suggest using VAM could reduce chances of greater student success.

Some of your concerns with VAM deal with the uncertainty built within it. But that’s true for any measure. Yet VAM is one of the few (if not the only) measure that has actually been shown to allow one to control for many of the outside factors you suggest could unfairly prejudice a teacher’s rating.

What VAM does tell us – with greater reliability than other measures is whether a teacher is likely to get higher student achievement with a particular group of students. I would argue that’s a valuable piece of information to have if the goal is to identify which teachers are getting results and which teachers need development.

To suggest that districts & schools that are focusing on implementing new evaluation systems like those we support are not focusing on “defining and implementing best practices effectively” misses a whole lot of evidence to the contrary. What we’re seeing in DC, Tennessee, Harrison County, CO, and countless other places is that these conversations are happening, and with a renewed vigor because educators are working with more data and a stronger framework than ever before.

Back to your original post and my issues with it, however – focusing on inputs is not a new approach. It’s the one we have tried for decades. More pay for earning a Masters degree. Class size restrictions and staffing ratios. Providing funding that can only be used for certain programs. The list goes on and on.

Spielberg: I don’t think anyone thinks we should evaluate teachers on the number and type of degrees they hold, or that we should evaluate schools on how much specialized funding they allocate – I can see why you were concerned if you thought that’s what I recommended.  My proposal is to evaluate teachers on the actions they take in pursuit of student outcomes and is something I’m excited to discuss with you.

However, I think it’s important first to discuss my statement about VAM usage more thoroughly because the sound bites and conclusions drawn in and from many of the pieces you link are inconsistent with the actual research findings.  For example, if you read the entirety of the report that spawned the first article you link, you’ll notice that there’s a very low correlation between teacher value added scores in consecutive years.  I’m passionate about accurate statistical analyses – my background is in mathematical and computational sciences – and I try to read the full text of education research instead of press releases because, as I’ve written before, “our students…depend on us to [ensure] that sound data and accurate statistical analyses drive decision-making. They rely on us to…continuously ask questions, keep an open mind about potential answers, and conduct thorough statistical analyses to better understand reality.  They rely on us to distinguish statistical significance from real-world relevance.”  When we implement evaluation systems based on misunderstandings of research, we not only alienate people who do their jobs well, but we also make bad employment decisions.

My original statement, which you only quoted part of in your response, was the following: “Using value added modeling (VAM) as a percentage of an evaluation actually reduces the likelihood of better future student outcomes because VAM results have more to do with random error and outside-of-school factors than they have to do with teaching effectiveness.”  This statement is, in fact, accurate.  The following are well-established facts in support of this claim:

– As I explained in my post, probability theory is extremely clear that decision-making based on results yields lower probabilities of future positive results when compared to decision-making based on factors people completely control.

– In-school factors have never been shown to explain more than about one-third of the opportunity gap.  As mentioned in the Shanker Blog post I linked above, estimates of teacher impact on the differences in student test scores are generally in the ballpark of 10% to 15% (the American Statistical Association says it ranges from 1% to 14%).  Teachers have an appreciable impact, but teachers do not have even majority control over VAM scores.

Research on both student and teacher incentives is consistent with what we’d expect from the bullet points above – researchers agree that systems that judge performance based on factors over which people have only limited control (in nearly any field) fail to reliably improve performance and future outcomes.

Those two bullet points, the strong research that corroborates the theory, and the existence of an alternative evaluation framework that judges teachers on factors they completely control (which I will talk more about below) would essentially prove my statement even if recent studies hadn’t also indicated that VAM scores correlate poorly with other measures of teacher effectiveness.  In addition, principal Ted Appel astutely notes that, “even when school systems use test scores as ‘only a part’ of a holistic evaluation, it infects the entire process as it becomes the piece [that] is most easily and simplistically viewed by the public and media. The result is a perverse incentive to find the easiest route to better outcome scores, often at the expense of the students most in need of great teaching input.”

I also think it’s important to mention that the research on the efficacy of class size reduction, which you seem to oppose, is at worst comparable to the research on the accuracy of VAM results.  I haven’t read many of the class size studies conducted in the last few years yet (this one is on my reading list) and thus can’t speak at this time to whether the benefits they find are legitimate, but even Eric Hanushek acknowledges that “there are likely to be situations…where small classes could be very beneficial for student achievement” in his argument that class size reduction isn’t worth the cost.  It’s intellectually inconsistent to argue simultaneously that class size reduction doesn’t help students and that making VAM a percentage of evaluations does, especially when (as the writeup you linked on Tennessee reminds us) a large number of teachers in some systems that use VAM have been getting evaluated on the test scores of students they don’t even teach.

None of that is to say that the pieces you link are devoid of value.  There’s some research that indicates VAM could be a useful tool, and I’ve actually defended VAM when people confuse VAM as a concept with the specific usage of VAM you recommend.  Though student outcome data shouldn’t be used as a percentage of evaluations, there’s a strong theoretical and research basis for using student outcomes in two other ways in an input-based evaluation process.  The new teacher evaluation system that San Jose Unified School District (SJUSD) and the San Jose Teachers Association (SJTA) have begun to implement can illustrate what I mean by an input-based evaluation system that uses student outcome data differently and that is more likely to lead to improved student outcomes in the long run.

The Teacher Quality Panel in SJUSD has defined the following five standards of teacher practice:

1) Teachers create and maintain effective environments for student learning.

2) Teachers know the subjects they teach and how to organize the subject matter for student learning.

3) Teachers design high-quality learning experiences and present them effectively.

4) Teachers continually assess student progress, analyze the results, and adapt instruction to promote student achievement.

5) Teachers continuously improve and develop as professional educators.

Note that the fourth standard gives us one of the two important uses of student outcome data – it should drive reflection during a cycle of inquiry.  These standards are based on observable teacher inputs, and there’s plenty of direct evidence evaluators can gather about whether teachers are executing these tasks effectively.  The beautiful thing about a system like this is that, if we have defined the elements of each standard correctly, the student outcome results should take care of themselves in the long run.

However, there is still the possibility that we haven’t defined the elements of each standard correctly.  As a concrete example, SJTA and SJUSD believe Explicit Direct Instruction (EDI) has value as an instructional framework, and someone who executes EDI effectively would certainly do well on standard 3.  However, the idea that successful implementation of EDI will lead to better student outcomes in the long run is a prediction, not a fact.  That’s where the second usage of student outcome data comes in – as I mentioned in my previous post, we should use student outcome results to conduct Bayesian analysis and figure out if our inputs are actually the correct ones.  Let me know if you want me to go into detail about how that process works.  Bayesian analysis is really cool (probability is my favorite branch of mathematics, if you haven’t guessed), and it will help us decide, over time, which practices to continue and which ones to reconsider.

I certainly want to acknowledge that many components of systems like IMPACT are excellent ones; increasing the frequency and validity of classroom observations is a really important step, for instance, in executing an input-based model effectively.  We definitely need well-trained evaluators and calibration on what great execution of given best practices look like.  When I wrote that I’d like to see StudentsFirst “focus on defining and implementing best practices effectively,” I meant that I’d like to see you make these ideas your emphasis.  Conducting evaluations on this sort of input-based criteria would make professional development and support significantly more relevant.  It would help reverse the teach-to-the-test phenomenon and focus on real learning.  It would make feedback more actionable. It would also help make teachers and unions feel supported and respected instead of attacked, and it would enable us to collaboratively identify both great teaching and classrooms that need support.  Most importantly, using these kinds of input-based metrics is more likely than the current approach to achieve long-run positive outcomes for our students.

Part 2 of the conversation, posted on August 11, can be found here.

6 Comments

Filed under Education

The Problem with Outcome-Oriented Evaluations

Imagine I observe two poker players playing two tournaments each. During their first tournaments, Player A makes $1200 and Player B loses $800. During her second tournament, Player A pockets another $1000. Player B, on the other hand, loses $1100 more during her second tournament. Would it be a good decision for me to sit down at a table and model my play after Player A?

For many people the answer to this question – no – is counterintuitive. I watched Player A and Player B play two tournaments each and their results were very different – haven’t I seen enough to conclude that Player A is the better poker player? Yet poker involves a considerable amount of luck and there are numerous possible short- and longer-term outcomes for skilled and unskilled players. As Nate Silver writes in The Signal and the Noise, I could monitor each player’s winnings during a year of their full-time play and still not know whether either of them was any good at poker. It would be fully plausible for a “very good limit hold ‘em player” to “have lost $35,000” during that time. Instead of focusing on the desired outcome of their play – making money – I should mimic the player who uses strategies that will, over time, increase the likelihood of future winnings. As Silver writes,

When we play poker, we control our decision-making process but not how the cards come down. If you correctly detect an opponent’s bluff, but he gets a lucky card and wins the hand anyway, you should be pleased rather than angry, because you played the hand as well as you could. The irony is that by being less focused on your results, you may achieve better ones.

As Silver recommends for poker and Teach For America recommends to corps members, we should always focus on our “locus of control.” For example, I have frequently criticized Barack Obama for his approach to the Affordable Care Act. While I am unhappy that the health care bill did not include a public option, I couldn’t blame Obama if he had actually tried to pass such a bill and failed because of an obstinate Congress. My critique lies instead with the President’s deceptive work against a more progressive bill – while politicians don’t always control policy outcomes, they do control their actions. As another example, college applicants should not judge their success on whether or not colleges accept them. They should evaluate themselves on what they control – the work they put into high school and their applications. Likewise, great football coaches recognize that they should judge their teams not on their won-loss records, but on each player’s successful execution of assigned responsibilities. Smart decisions and strong performance do not always beget good results; the more factors in-between our actions and the desired outcome, the less predictive power the outcome can give us.

Most education reformers and policymakers, unfortunately, still fail to recognize this basic tenet of probabilistic reasoning, a fact underscored in recent conversations between Jack Schneider (a current professor and one of the best high school teachers I’ve ever had) and Michelle Rhee. We implement teacher and school accountability metrics that focus heavily on student outcomes without realizing that this approach is invalid. As the American Statistical Association’s (ASA’s) recent statement on value-added modeling (VAM) clearly states, “teachers account for about 1% to 14% of the variability in [student] test scores” and “[e]ffects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.” Paul Bruno astutely notes that the ASA’s statement is an indictment of the way VAM is used, not the idea of VAM itself, yet little correlation currently exists between VAM results and effective teaching. As I’ve mentioned before, research on both student and teacher incentives suggests that rewards and consequences based on outcomes don’t work. When we use student outcome data to assign credit or blame to educators, we may close good schools, demoralize and dismiss good teachers, and ultimately undermine the likelihood of achieving the student outcomes we want.

Better policy would focus on school and teacher inputs. For example, we should agree on a set of clear and specific best teaching practices (with the caveat that they’d have to be sufficiently flexible to allow for different teaching styles) on which to base teacher evaluations. Similarly, college counselors should provide college applicants with guidance about the components of good applications. Football coaches should likewise focus on their players’ decision-making and execution of blocking, tackling, route-running, and other techniques.

Input Output Graphic

When we evaluate schools on student outcomes, we reward (and punish) them for factors they don’t directly control.  A more intelligent and fair approach would evaluate the actions schools take in pursuit of better student outcomes, not the outcomes themselves.

Outcomes are incredibly important to monitor and consider when selecting effective inputs, of course. Mathematicians use outcomes in a process called Bayesian analysis to constantly update our assessments of whether or not our strategies are working. If we observe little correlation between successful implementation of our identified best teaching practices and student growth for five consecutive years, for instance, we may want to revisit our definition of best practices. A college counselor whose top students are consistently rejected from Ivy League schools should begin to reconsider the advice he gives his students on their applications. Relatedly, if a football team suffers through losing season after losing season despite players’ successful completion of their assigned responsibilities, the team should probably overhaul its strategy.

The current use of student outcome data to make high-stakes decisions in education, however, flies in the face of these principles. Until we shift our measures of school and teacher performance from student outputs to school and teacher inputs, we will unfortunately continue to make bad policy decisions that simultaneously alienate educators and undermine the very outcomes we are trying to achieve.

Update: A version of this piece appeared in Valerie Strauss’s column in The Washington Post on Sunday, May 25.

8 Comments

Filed under Education, Philosophy