Tag Archives: student outcomes

Cooks, Chefs, and Teachers: A Long-Form Debate on Evaluation (Part 3a)

StudentsFirst Vice President Eric Lerum and I have been debating teacher evaluation approaches since my blog post about why evaluating teachers based on student test scores is misguided and counterproductive.  Our conversation began to touch on the relationship between anti-poverty activism and education reform conversations, a topic we plan to continue discussing.  First, however, we wanted to focus back on our evaluation debate.  Eric originally compared teachers to cooks, and while I noted that cooks have considerably more control over the outcomes of their work than do teachers, we fleshed that analogy out and continue discussing its applicability to teaching below.

Click here to read Part 1 of the conversation.

Click here to read Part 2 of the conversation.

Lerum: I love the analogy you use for this simple reason – I don’t think we’re as interested in figuring out whether the cook is an “excellent recipe-follower” as we are about whether the cook makes food that tastes delicious. And since we’re talking about the evaluation systems themselves – and not the consequences attached (which by and large, most jurisdictions are not using) – then this really matters. The evaluation instrument may reveal that the cook is not an “excellent recipe follower,” which you gloss over. But that’s an important point. It could certainly identify those cooks that need to work on their recipe-following skills. That’s helpful in creating better cooks.

But taking your hypothetical that it identifies someone who can follow a recipe well and executes our strategies, but then the outcome is still bad – that is also important information. It could cause us to re-evaluate the recipe, the meal choice, certain techniques, even the assessment instrument itself (do the people tasting the food know what good food tastes like?). But all of those would be useful and significant pieces of information that we would not get if we weren’t starting with an evaluation framework that includes outcomes measures.

You clearly make the assumption that nobody would question the evaluation instrument or anything else – if we had this result for multiple cooks, we would just keep going with it and assume it’s the cooks and nothing else. But that’s an unreasonable assumption that I think is founded on a lack of trust and respect for the intentions underlying the evaluation. What we’re focused on is identifying, improving, rewarding, and making decisions based on performance. And we want accurate measures for doing so – nobody is interested in models that do not work. That’s why you constantly see the earliest adopters of such models making improvements as they go.

Also, to clarify, we do not advocate for the “use of standardized test scores as a defined percentage of teacher evaluations.” I assume you probably didn’t mean that literally, but I think it’s important for readers to understand the difference as it’s a common and oft-repeated misconception among critics of reform. We advocate for use of measures of student growth – big difference from just using the scores alone. It doesn’t make any sense to evaluate teachers based on the test scores themselves – there needs to be some measure (such as VAM) of how much students learn over time (their growth), but that is not a single snapshot based on any one test.

I appreciate your recommendation regarding the use of even growth data based on assessments, but again, your recommendation is based on your opinion and I respectfully disagree, as do many researchers and respected analysts (also see here and here – getting at some of the issues you raise as concerns, but proposing different solutions). To go back to your analogy, nobody is interested in going to a restaurant run by really good recipe-followers. They want to go where the food tastes good. Period. Likewise, no parent wants to send her child to a classroom taught by a teacher who creates and executes the best lesson-planning. They want to send their child to a classroom in which she will learn. Outcomes are always part of the equation. Figuring out the best way to measure them may always have some inherent issues with subjectivity or variability, but I believe removing outcomes from the overall evaluation itself betrays to some degree the initial purpose.

Spielberg: I think there’s some confusion here about what I’m advocating for and critiquing.  I’d like to reiterate what I have consistently argued in this exchange – that student outcomes should be a part of the teacher evaluation process in two ways:

1) We should evaluate how well teachers gather data on student achievement, analyze the data, and use the data to reflect on and improve their future instruction.

2) We should examine the correlation between the effective execution of teacher practices and student outcome results.  We should then use the results of this examination to revise our instructional practices as needed.

I have never critiqued the fact that you care about student outcomes and believe they should factor heavily into our thinking – on this point we agree (I’ve never met anyone who works in education who doesn’t).  We also agree that it is better to measure student growth on standardized test scores, as value added modeling (VAM) attempts to do, than to look at absolute scores on standardized tests (I apologize if my earlier wording about StudentsFirst’s position was unclear – I haven’t heard anyone speak in favor of the use of absolute scores in quite some time and assumed everyone reading this exchange would know what I meant).  Furthermore, the “useful and significant pieces of information” you talk about above are all captured in the evaluation framework I recommend.

My issue has always been with the specific way you want to factor student outcomes into evaluation systems.  StudentsFirst supports making teachers’ VAM results a defined percentage of a teacher’s “score” during the evaluation process, do you not?  You highlight places, like DC and Tennessee, that use VAM results in this fashion.  Whether or not this practice is likely to achieve its desired effect is not really a matter of opinion; it’s a matter of mathematical theory and empirical research.  I’ve laid out why StudentsFirst’s approach is inconsistent with the theory and research in earlier parts of our conversation and none of the work you link above refutes that argument.  As you mention, both Matt Di Carlo and Douglas Harris, the authors of the four pieces you linked, identify issues with the typical uses of VAM similar to the ones I discuss.  Their main defense of VAM is only to suggest that other methods of evaluation are similarly problematic; Harris discusses a “lack of reliability in essentially all measures” and Di Carlo notes that “alternative measures are also noisy.”  There is, however, more recent evidence from MET that multiple, full-period classroom observations by multiple evaluators are significantly more reliable than VAM results.  While Di Carlo and Harris do have slightly different opinions than me about the role of value added, Di Carlo’s writing and Harris’s suggestion for evaluation on the whole seem far closer to what I’m advocating than to StudentsFirst’s recommendations, and I’d be very interested to hear their thoughts on this conversation.

That said, I like your focus above on what parents want, and I think it’s a worthwhile exercise to look at the purposes of evaluation systems and how our respective proposals meet the desires and needs of different stakeholders.  I believe evaluation systems have three primary purposes: providing information, facilitating support, and creating incentives.

1) Providing Information – You wrote the following:

…nobody is interested in going to a restaurant run by really good recipe-followers. They want to go where the food tastes good. Period. Likewise, no parent wants to send her child to a classroom taught by a teacher who creates and executes the best lesson-planning. They want to send their child to a classroom in which she will learn.

The first thing I’d note is that this juxtaposition doesn’t make very much sense; students taught by teachers who create and execute the best lesson-planning will most likely learn quite a bit (assuming that the teachers who are great lesson planners are at least decent at other aspects of good teaching). In addition, restaurants run by really good recipe-followers, if the recipes are good, will probably produce good-tasting food.  Good outputs are expected when inputs are well-chosen and executed effectively.

The cooking analogy is a bit problematic here because, in the example you give, the taste of the food is both the ultimately desired outcome and the metric by which you propose to assess the cook’s output.  In the educational setting, the metric – VAM, in the case of our debate – is not the same as the desired output.  In fact, VAM results are a relatively weak proxy for only a subset of the outcomes we care about for kids (those related to academic growth).  To construct a more appropriate analogy for judging a teacher on VAM results, let’s consider a chef who works in a restaurant where we want to eat dinner.  We are interested, ultimately, in the overall dining experience we will have at the restaurant. A measurement tool parallel to VAM, one that gives us a potentially useful but very limited picture of only one aspect of the experience other diners had, could be other diners’ assessments of the smell of the chef’s previous meals.

This analogy is more appropriate because the degree to which different diners value different aspects of a dining experience is highly variable.  All diners likely care to some extent about a combination of the food selection, the sustainability of their meal, the food’s taste, the atmosphere, the service, and the price.  Some, however, might value a beautiful, romantic environment over the taste of their entrees, while others may care about service above all else.  Likewise, some parents may care most about a classroom that fosters kindness, some may prioritize the development of critical thinking skills, and others may hold content knowledge in the highest esteem.

Were I to eat at a restaurant, I’d certainly get some information from knowing other diners’ assessments of previous meals’ smells.  Smell and taste are definitely correlated and I tend to value taste above other considerations when I’m considering a restaurant.  Yet it’s possible that other diners like different kinds of food than me, or that their senses of smell were affected by the weather or allergies when they dined there.  Some food, even though it smells bad, tastes quite good (and vice versa).  If I didn’t look deeper and really analyze what caused the smell ratings, I could very easily choose a sub-optimal restaurant.

What I’d really want to know would be answers to the following questions: what kind of food does the chef plan to make?  Does he source it sustainably?  Is it prepared to order?  Is the wait-staff attentive?  What’s the decor like?  The lighting?  Does the chef accommodate special requests?  How does the chef solicit feedback from his guests, and does he, when necessary, modify his practices in response to the feedback?  If diners could get information on the execution in each of these areas, they would be much better positioned to figure out whether they would enjoy the dining experience than if they focused on other diners’ smell ratings.  A chef who did all of these things well and who used Bayesian analysis to add, drop, and refine menu items and restaurant practices over time would almost certainly maximize the likelihood that future guests would leave satisfied.  A chef with great smell ratings might maximize that probability, but he also might not.

The exact same reasoning applies to the classroom experience.  Good VAM results might indicate a classroom that would provide a learning experience appropriate for a given student, but they might not.  Though I will again note that you don’t advocate for judging teachers solely on VAM, VAM scores tend to be what people focus on when they’re a defined percentage of evaluations.  That focus, again, does not provide very good information.  Whether parents value character development, inspiration, skill building, content mastery, or any other aspect of their children’s educational experience, they would get the best information by concentrating on teacher actions. If a parent knows a teacher’s skill – at establishing a positive classroom environment, at lesson planning, at lesson delivery, at using formative assessment to monitor student progress and adapt instruction, at helping students outside of class, etc. – that parent will be much more informed about the likelihood that a child will learn in a teacher’s class than if that parent focuses attention on the teacher’s VAM results.

2) Facilitating support – A chef with bad smell ratings might not be a very good chef.  But if that’s the case, any system that addressed the questions above – that assessed the chef’s skill at choosing recipes, sourcing great ingredients, making food to order, training his wait-staff, decorating his restaurant, responding to guest feedback, etc. – should also give him poor marks.  Bad results that truly signify bad performance, as opposed to reflecting bad luck or circumstances outside of the chef’s control, are the result of a bad input.  The key idea here is that, if we judge chefs on input execution but monitor outputs to make sure the inputs are comprehensive and accurate, judging chefs on their smell ratings won’t give us any additional information about which chefs need support.

More importantly, making smell ratings a defined percentage of a chef’s evaluation would not help a struggling chef improve his performance.  No matter the other components of his evaluation, he is likely to concentrate primarily on the smell ratings, feel like a failure, and have difficulty focusing on areas in which he can improve.  If we instead show the chef that, despite training the waitstaff well, he is having trouble selecting the best ingredients, we give him an actionable item to consider.  “Try these approaches to selecting new ingredients” is much easier to follow and much less demoralizing a directive than “raise your smell ratings.”

I think the parallel here is pretty clear – if we define and measure appropriate teaching inputs and use outcomes in Bayesian analysis to constantly revise those inputs, making VAM a defined percentage of an evaluation provides no new information about which teachers need support.  Especially because VAM formulas are complex statistical models that aren’t easily understood, the defined-percentage approach also focuses the evaluation away from actionable improvement items and towards the assignment of credit and blame.

3) Creating Incentives – Finally, a third goal of evaluation systems is related to workforce incentives.  First, we often wish to reward and retain high-performers and, in the instances in which support fails, exit consistently low-performers.  For retention and dismissal to improve overall workforce quality, we must base these decisions on accurate performance measures.

I don’t think the incomplete information provided by VAM results and smell ratings needs rehashing here; the argument is the same as above.  We are going to retain a higher percentage of chefs and teachers who are actually excellent if our evaluation systems focus on what they control than if our incentives focus on outputs over which they have limited impact.

Of particular concern to me, however, are the incentives teachers have for working with the highest-need populations.  Even efforts that take great pains to “level the playing field” between teachers with different student populations result in significantly better VAM results for teachers and schools that work with more privileged students.  Research strongly suggests that teachers who work in low-income communities could substantially improve their VAM scores by moving to classrooms with more affluent populations (and keeping their teaching quality constant).  When we make VAM results a defined percentage of an evaluation, we provide incentives for teachers who work with the highest-need populations to leave.  The type of evaluation I’m proposing, if we execute it properly, would eliminate this perverse incentive.

Again, I want to reiterate that I support constantly monitoring student outcomes; we should evaluate teachers on their ability to modify instruction in response to student outcomes, and we should also use outcomes to continuously refine our list of great teaching inputs.  But we rely on evaluation systems to provide accurate and comprehensive information, to help struggling employees improve, and to provide appropriate incentives.  VAM can help us think about good teaching practices, but StudentsFirst’s proposed use of VAM does not help us accomplish the goals of teacher evaluation.

Part 3b – in which we return to our discussion about the relationship between anti-poverty work and education reform – will follow soon!

Update (8/21/14) – Matt Barnum alerted me to the fact that the article I linked above about efforts to “level the playing field” when looking at VAM results actually does provide evidence that “two-step VAM” can eliminate the bias against low-income schools.  That’s exciting because, assuming the results are replicable and accurate, this particular VAM method would eliminate one of the incentive concerns I discussed.  However, while Educators 4 Excellence (Barnum’s organization) advocates for the use of this method, I don’t believe states currently use it (if you know of a state that does, please feel free to let me know).  The significant other issues with VAM would also still exist even with the use of the two-step version.

5 Comments

Filed under Education

The Problem with Outcome-Oriented Evaluations

Imagine I observe two poker players playing two tournaments each. During their first tournaments, Player A makes $1200 and Player B loses $800. During her second tournament, Player A pockets another $1000. Player B, on the other hand, loses $1100 more during her second tournament. Would it be a good decision for me to sit down at a table and model my play after Player A?

For many people the answer to this question – no – is counterintuitive. I watched Player A and Player B play two tournaments each and their results were very different – haven’t I seen enough to conclude that Player A is the better poker player? Yet poker involves a considerable amount of luck and there are numerous possible short- and longer-term outcomes for skilled and unskilled players. As Nate Silver writes in The Signal and the Noise, I could monitor each player’s winnings during a year of their full-time play and still not know whether either of them was any good at poker. It would be fully plausible for a “very good limit hold ‘em player” to “have lost $35,000” during that time. Instead of focusing on the desired outcome of their play – making money – I should mimic the player who uses strategies that will, over time, increase the likelihood of future winnings. As Silver writes,

When we play poker, we control our decision-making process but not how the cards come down. If you correctly detect an opponent’s bluff, but he gets a lucky card and wins the hand anyway, you should be pleased rather than angry, because you played the hand as well as you could. The irony is that by being less focused on your results, you may achieve better ones.

As Silver recommends for poker and Teach For America recommends to corps members, we should always focus on our “locus of control.” For example, I have frequently criticized Barack Obama for his approach to the Affordable Care Act. While I am unhappy that the health care bill did not include a public option, I couldn’t blame Obama if he had actually tried to pass such a bill and failed because of an obstinate Congress. My critique lies instead with the President’s deceptive work against a more progressive bill – while politicians don’t always control policy outcomes, they do control their actions. As another example, college applicants should not judge their success on whether or not colleges accept them. They should evaluate themselves on what they control – the work they put into high school and their applications. Likewise, great football coaches recognize that they should judge their teams not on their won-loss records, but on each player’s successful execution of assigned responsibilities. Smart decisions and strong performance do not always beget good results; the more factors in-between our actions and the desired outcome, the less predictive power the outcome can give us.

Most education reformers and policymakers, unfortunately, still fail to recognize this basic tenet of probabilistic reasoning, a fact underscored in recent conversations between Jack Schneider (a current professor and one of the best high school teachers I’ve ever had) and Michelle Rhee. We implement teacher and school accountability metrics that focus heavily on student outcomes without realizing that this approach is invalid. As the American Statistical Association’s (ASA’s) recent statement on value-added modeling (VAM) clearly states, “teachers account for about 1% to 14% of the variability in [student] test scores” and “[e]ffects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.” Paul Bruno astutely notes that the ASA’s statement is an indictment of the way VAM is used, not the idea of VAM itself, yet little correlation currently exists between VAM results and effective teaching. As I’ve mentioned before, research on both student and teacher incentives suggests that rewards and consequences based on outcomes don’t work. When we use student outcome data to assign credit or blame to educators, we may close good schools, demoralize and dismiss good teachers, and ultimately undermine the likelihood of achieving the student outcomes we want.

Better policy would focus on school and teacher inputs. For example, we should agree on a set of clear and specific best teaching practices (with the caveat that they’d have to be sufficiently flexible to allow for different teaching styles) on which to base teacher evaluations. Similarly, college counselors should provide college applicants with guidance about the components of good applications. Football coaches should likewise focus on their players’ decision-making and execution of blocking, tackling, route-running, and other techniques.

Input Output Graphic

When we evaluate schools on student outcomes, we reward (and punish) them for factors they don’t directly control.  A more intelligent and fair approach would evaluate the actions schools take in pursuit of better student outcomes, not the outcomes themselves.

Outcomes are incredibly important to monitor and consider when selecting effective inputs, of course. Mathematicians use outcomes in a process called Bayesian analysis to constantly update our assessments of whether or not our strategies are working. If we observe little correlation between successful implementation of our identified best teaching practices and student growth for five consecutive years, for instance, we may want to revisit our definition of best practices. A college counselor whose top students are consistently rejected from Ivy League schools should begin to reconsider the advice he gives his students on their applications. Relatedly, if a football team suffers through losing season after losing season despite players’ successful completion of their assigned responsibilities, the team should probably overhaul its strategy.

The current use of student outcome data to make high-stakes decisions in education, however, flies in the face of these principles. Until we shift our measures of school and teacher performance from student outputs to school and teacher inputs, we will unfortunately continue to make bad policy decisions that simultaneously alienate educators and undermine the very outcomes we are trying to achieve.

Update: A version of this piece appeared in Valerie Strauss’s column in The Washington Post on Sunday, May 25.

8 Comments

Filed under Education, Philosophy