StudentsFirst Vice President Eric Lerum and I recently began debating the use of standardized test scores in high stakes decision-making. I argued in a recent blog post that we should instead evaluate teachers on what they directly control – their actions. Our conversation, which began to touch on additional interesting topics, is continued below.
Click here to read Part 1 of the conversation.
Lerum: To finish the outcomes discussion – measuring teachers by the actions they take is itself measuring an input. What do we learn from evaluating how hard a teacher tries? And is that enough to evaluate teacher performance? Shouldn’t performance be at least somewhat related to the results the teacher gets, independent of how hard she tries? If I put in lots of hours learning how to cook, assembling the perfect recipes, buying the best ingredients, and then even more hours in the kitchen – but the meal I prepare doesn’t taste good and nobody likes it, am I a good cook?
Regarding your use of probability theory and VAM – the problem I have with your analysis there is that VAM is not used to raise student achievement. So using it – even improperly – should not have a direct effect on student achievement. What VAM is used for is determining a teacher’s impact on student achievement, and thereby identifying which teachers are more likely to raise student achievement based on their past ability to do so. So even if you want to apply probability theory and even if you’re right, at best what you’re saying is that we’re unlikely to be able to use it to identify those teachers accurately on an ongoing basis. The larger point that is made repeatedly is that because outside factors play a larger overall role in impacting student achievement, we should not focus on teacher effectiveness and instead solve for these other factors. This is a key disconnect in the education reform debate. Reformers believe that focusing on things like teacher quality and focusing on improving circumstances for children outside of school need not be mutually exclusive. Teacher quality is still very important, as Shankerblog notes. Improving teacher quality and then doing everything we can to ensure students have access to great teachers does not conflict at all with efforts to eliminate poverty. In fact, I would view them as complementary. But critics of these reforms use this argument to say that one should come before the other – that because these other things play larger roles, we should focus our efforts there. That is misguided, I think – we can do both simultaneously. And as importantly in terms of the debate, no reformer that I know suggests that we should only focus on teacher quality or choice or whatever at the expense or exclusion of something else, like poverty reduction or improving health care.
If you’re interested in catching up on class size research, I highly recommend the paper published by Matt Chingos at Brookings, found here with follow-up here. To be clear about my position on class size, however; I’m not against smaller class sizes. If school leaders determine that is an effective way for improving instruction and student achievement in their school, they should utilize that approach. But it’s not the best approach for every school, every class, every teacher, or every child. And thus, state policy should reflect that. Mandating class size limits or restrictions makes no sense. It ties the hands of administrators who may choose to staff their schools differently and use their resources differently. It hinders innovation for educators who may want to teach larger classes in order to configure their classrooms differently, leverage technology or team teaching, etc. Why not instead leave decisions about staffing to school leaders and their educators?
The performance framework for San Jose seems pretty straightforward. I’m curious how you measure #2 (whether teachers know the subjects) – are those through rigorous content exams or some other kind of check?
I think a solid evaluation system would include measures using indicators like these. But you would also need actual student learning/growth data to validate whether those things are working – as you say, “student outcome results should take care of themselves.” You need a measure to confirm that.
I honestly think my short response to all of this would be that there’s nothing in the policies we advocate for that prevent what you’re talking about. And we advocate for meaningful evaluations being used for feedback and professional development – those are critical elements of bills we try to move in states. But as a state-level policy advocacy organization, we don’t advocate for specific models or types of evaluations. We believe certain elements need to be there, but we wouldn’t be advocating for states to adopt the San Jose model or any other specifically – that’s just not what policy advocacy is. So I think there’s just general confusion about that – that simply because you don’t hear us saying to build a model with the components you’re looking for, that must mean we don’t support it. In fact, we’re focused on policy at a level higher than the district level, and design and implementation of programs isn’t in our wheelhouse.
Spielberg: I believe you discuss three very important questions, each one of which deserves some attention:
1) Given that student outcomes are primarily determined by factors unrelated to teaching quality, can and should people still work on improving teacher effectiveness?
Yes! While teaching quality accounts for, at most, a small percentage of the opportunity gap, teacher effectiveness is still very important. Your characterization of reform critics is a common misconception; everyone I’ve ever spoken with believes we can work on addressing poverty and improving schools simultaneously. Especially since we decided to have this conversation to talk about how to measure teacher performance, I’m not sure why you think I’d argue that “we should not focus on teacher effectiveness.” I am critiquing the quality of some of StudentsFirst’s recommendations – they are unlikely to improve teacher effectiveness and have serious negative consequences – not the topic of reform itself. I recommend we pursue policy solutions more likely to improve our schools.
Critics of reform do have a legitimate issue with the way education reformers discuss poverty, however. Education research’s clearest conclusion is that poverty explains inequality significantly better than school-related factors. Reformers often pay lip-service to the importance of poverty and then erroneously imply an equivalence between the impact of anti-poverty initiatives and education reforms. They suggest that there’s far more class mobility in the United States than actually exists. This suggestion harms low-income students.
As an example, consider the controversy that surrounded New York mayor Bill de Blasio several months ago. De Blasio was a huge proponent of measures to reduce income inequality, helped reform stop-and-frisk laws that unfairly targeted minorities, had fought to institute universal pre-K, and had shown himself in nearly every other arena to fight for underprivileged populations. While it would have been perfectly reasonable for StudentsFirst to disagree with him about the three charter co-locations (out of seventeen) that he rejected, StudentsFirst’s insinuation that de Blasio’s position was “down with good schools” was dishonest, especially since a comprehensive assessment of de Blasio’s policies would have indisputably given him high marks on helping low-income students. At the same time, StudentsFirst aligns itself with corporate philanthropists and politicians, like the Waltons and Chris Christie, who actively exploit the poor and undermine anti-poverty efforts. This alignment allows wealthy interests to masquerade as advocates for low-income students while they work behind the scenes to deprive poor students of basic services. Critics argue that organizations like StudentsFirst have chosen the wrong allies and enemies.
I wholeheartedly agree that anti-poverty initiatives and smart education reforms are complementary. I’d just like to see StudentsFirst speak honestly about the relative impact of both. I’d also love to see you hold donors and politicians accountable for their overall impact on students in low-income communities. Then reformers and critics of reform alike could stop accusing each other of pursuing “adult interests” and focus instead on the important work of improving our schools.
2) How can we use student outcome data to evaluate whether an input-based teacher evaluation system has identified the right teaching inputs?
This concept was the one we originally set out to discuss. I’d love to focus on it in subsequent posts if that works for you (though I’d love to revisit the other topics in a different conversation if you’re interested).
I’m glad we agree that “a solid evaluation system would include [teacher input-based] measures…like [the ones used in San Jose Unified].” I also completely agree with you that we need to use student outcome data “to validate whether those things are working.” That’s exactly the use of student outcome data I recommend. Though cooks probably have a lot more control over outcomes than teachers, we can use your cooking analogy to discuss how Bayesian analysis works.
We’d need to first estimate the probability that a given input – let’s say, following a specific recipe – is the best path to a desired outcome (a meal that tastes delicious). This probability is called our “prior.” Let’s then assume that the situation you describe occurs – a cook follows the recipe perfectly and the food turns out poorly. We’d need to estimate two additional probabilities. First, we’d need to know the probability the food would have turned out badly if our original prediction was correct and the recipe was a good one. Second, we’d need the probability that the food would have turned out poorly if our original prediction was incorrect and the recipe was actually a bad one. Once we had those estimates, there’s a very simple formula we could use to give us an updated probability that the input – the recipe – is a good one. Were this probability sufficiently low, we would throw out the recipe and pick a new one for the next meal. We would, however, identify the cook as an excellent recipe-follower.
This approach has several advantages over the alternative (evaluating the cook primarily on the taste of the food). Most obviously, it accurately captures the cook’s performance. The cook clearly did an excellent job doing what both you and he thought was a good idea – following this specific recipe – and can therefore be expected to do a good job following other recipes in the future. If we punished him, we’d be sending the message that his actual performance matters less than having good luck, and if we fired him, we’d be depriving ourselves of a potentially great cook. Additionally, it’s not the cook’s fault that we picked the wrong cooking strategy, so it’s unethical to punish him for doing everything we asked him to do.
Just as importantly, this approach would help us identify the strategies most likely to lead to better meals in the long run. We might not catch the problem with the recipe if we incorrectly attribute the meal’s taste to the cook’s performance – we might end up continuously hiring and firing a bunch of great cooks before we realize that the recipe is bad. If we instead focus on the cook’s locus of control – following the recipe – and use Bayesian analysis, we will more quickly discover the best recipes and retain more cooks with recipe-following skills. Judging cooks on their ability to execute inputs and using outcomes to evaluate the validity of the inputs would, over time, increase the quality of our meals.
Let’s now imagine the analogous situation for teachers. Suppose a school adopts blended learning as its instructional framework, and suppose a teacher executes the school’s blended learning model perfectly. However, the teacher’s value added (VAM) results aren’t particularly high. Should we punish the teacher? The answer, quite clearly, is no; unless the teacher was bad at something we forgot to identify as an effective teaching practice, none of the explanations for the low scores have anything to do with the teacher’s performance. Just as with cooking, we might not catch a real problem with a given teaching approach if we incorrectly attribute outcome data to a teacher’s performance – we might end up continuously hiring and firing a bunch of great teachers based on random error, a problem with an instructional framework, or a problem with VAM methodology.
The improper use of student outcome data in high-stakes decision-making has negative consequences for students precisely because of this incorrect attribution. Making VAM a defined percentage of teacher evaluations leads to employment decisions based on inaccurate perceptions of teacher quality. Typical VAM usage also makes it harder for us to identify successful teaching practices. If we instead focus on teachers’ locus of control – effective execution of teacher practices – and use Bayesian analysis, we will more quickly discover the best teaching strategies and retain more teachers who can execute teaching strategies effectively. Judging teachers on their ability to execute inputs and using outcomes to evaluate the validity of the inputs would, over time, increase the likelihood of student success.
3) As “a state-level policy advocacy organization,” what is the scope of StudentsFirst’s work?
You wrote that StudentsFirst “[doesn’t] advocate for specific models or types of evaluations” but believes “certain elements need to be there.” One of the elements you recommend is “evaluating teachers based on evidence of student results.” This recommendation has translated into your support for the use of standardized test scores as a defined percentage of teacher evaluations. I was not recommending that you ask states to adopt San Jose Unified’s evaluation framework (as an aside, the component you ask about deals mostly with planning and, among other things, uses lesson plans, teacher-created materials, and assessments as evidence) or that you recommend across-the-board class size reduction (thanks for clarifying your position on that, by the way – I look forward to reading the pieces you linked). Instead, since probability theory and research suggest it isn’t likely to improve teacher performance, I recommend that StudentsFirst discontinue its push to make standardized test scores a percentage of evaluations. You could instead advocate for evaluation systems that clearly define good teacher practices, hold teachers accountable for implementing good practices, and use student outcomes in Bayesian analysis to evaluate the validity of the defined practices. This approach would increase the likelihood of achieving your stated organizational goals.
Thanks again for engaging in such an in-depth conversation. I think more superficial correspondence often misses the nuance in these issues, and I am excited that you and I are getting the opportunity to both identify common ground and discuss our concerns.
Click here to read Part 3a of the conversation, which focuses back on the evaluation debate.
Click here to read Part 3b of the conversation, which focuses on how reformers and other educators talk about poverty.