StudentsFirst Vice President Eric Lerum and I Debate Accountability Measures (Part 1)

Published by

Ben Spielberg

August 4, 2014

After my blog post on the problem with outcome-oriented teacher evaluations and school accountability measures, StudentsFirst Vice President Eric Lerum and I exchanged a few tweets about student outcomes and school inputs and decided to debate teacher and school accountability more thoroughly. We had a lengthy email conversation we agreed to share, the first part of which is below.

Spielberg: In my last post, I highlighted why both probability theory and empirical research suggest we should stop using student outcome data to evaluate teachers and schools. Using value added modeling (VAM) as a percentage of an evaluation actually reduces the likelihood of better future student outcomes because VAM results have more to do with random error and outside-of-school factors than they have to do with teaching effectiveness.

I agree with some of your arguments about evaluation; for example, evaluations should definitely use multiple measures of performance. I also appreciate your opposition to making student test score results the sole determinant of a teacher’s evaluation. However, you insist that measures like VAM constitute a fairly large percentage of teacher evaluations despite several clear drawbacks; not only do they fail to reliably capture a teacher’s contribution to student performance, but they also narrow our conception of what teachers and schools should do and distract policymakers and educators from conversations about specific practices they might adopt. Why don’t you instead focus on defining and implementing best practices effectively? Most educators have similar ideas about what good schools and effective teaching look like, and a focus on the successful implementation of appropriately-defined inputs is the most likely path to better student outcomes in the long run.

Lerum: There’s nothing in the research or the link you cite above that supports a conclusion that use VAM “actually reduces the likelihood of better future student outcomes” – that’s simply an incorrect conclusion to come to. Numerous researchers have concluded that using VAM is reasonable and a helpful component of better teacher evaluations (also see MET). Even Shankerblog doesn’t go so far as to suggest using VAM could reduce chances of greater student success.

Some of your concerns with VAM deal with the uncertainty built within it. But that’s true for any measure. Yet VAM is one of the few (if not the only) measure that has actually been shown to allow one to control for many of the outside factors you suggest could unfairly prejudice a teacher’s rating.

What VAM does tell us – with greater reliability than other measures is whether a teacher is likely to get higher student achievement with a particular group of students. I would argue that’s a valuable piece of information to have if the goal is to identify which teachers are getting results and which teachers need development.

To suggest that districts & schools that are focusing on implementing new evaluation systems like those we support are not focusing on “defining and implementing best practices effectively” misses a whole lot of evidence to the contrary. What we’re seeing in DC, Tennessee, Harrison County, CO, and countless other places is that these conversations are happening, and with a renewed vigor because educators are working with more data and a stronger framework than ever before.

Back to your original post and my issues with it, however – focusing on inputs is not a new approach. It’s the one we have tried for decades. More pay for earning a Masters degree. Class size restrictions and staffing ratios. Providing funding that can only be used for certain programs. The list goes on and on.

Spielberg: I don’t think anyone thinks we should evaluate teachers on the number and type of degrees they hold, or that we should evaluate schools on how much specialized funding they allocate – I can see why you were concerned if you thought that’s what I recommended. My proposal is to evaluate teachers on the actions they take in pursuit of student outcomes and is something I’m excited to discuss with you.

However, I think it’s important first to discuss my statement about VAM usage more thoroughly because the sound bites and conclusions drawn in and from many of the pieces you link are inconsistent with the actual research findings. For example, if you read the entirety of the report that spawned the first article you link, you’ll notice that there’s a very low correlation between teacher value added scores in consecutive years. I’m passionate about accurate statistical analyses – my background is in mathematical and computational sciences – and I try to read the full text of education research instead of press releases because, as I’ve written before, “our students…depend on us to [ensure] that sound data and accurate statistical analyses drive decision-making. They rely on us to…continuously ask questions, keep an open mind about potential answers, and conduct thorough statistical analyses to better understand reality. They rely on us to distinguish statistical significance from real-world relevance.” When we implement evaluation systems based on misunderstandings of research, we not only alienate people who do their jobs well, but we also make bad employment decisions.

My original statement, which you only quoted part of in your response, was the following: “Using value added modeling (VAM) as a percentage of an evaluation actually reduces the likelihood of better future student outcomes because VAM results have more to do with random error and outside-of-school factors than they have to do with teaching effectiveness.” This statement is, in fact, accurate. The following are well-established facts in support of this claim:

– As I explained in my post, probability theory is extremely clear that decision-making based on results yields lower probabilities of future positive results when compared to decision-making based on factors people completely control.

– In-school factors have never been shown to explain more than about one-third of the opportunity gap. As mentioned in the Shanker Blog post I linked above, estimates of teacher impact on the differences in student test scores are generally in the ballpark of 10% to 15% (the American Statistical Association says it ranges from 1% to 14%). Teachers have an appreciable impact, but teachers do not have even majority control over VAM scores.

Research on both student and teacher incentives is consistent with what we’d expect from the bullet points above – researchers agree that systems that judge performance based on factors over which people have only limited control (in nearly any field) fail to reliably improve performance and future outcomes.

Those two bullet points, the strong research that corroborates the theory, and the existence of an alternative evaluation framework that judges teachers on factors they completely control (which I will talk more about below) would essentially prove my statement even if recent studies hadn’t also indicated that VAM scores correlate poorly with other measures of teacher effectiveness. In addition, principal Ted Appel astutely notes that, “even when school systems use test scores as ‘only a part’ of a holistic evaluation, it infects the entire process as it becomes the piece [that] is most easily and simplistically viewed by the public and media. The result is a perverse incentive to find the easiest route to better outcome scores, often at the expense of the students most in need of great teaching input.”

I also think it’s important to mention that the research on the efficacy of class size reduction, which you seem to oppose, is at worst comparable to the research on the accuracy of VAM results. I haven’t read many of the class size studies conducted in the last few years yet (this one is on my reading list) and thus can’t speak at this time to whether the benefits they find are legitimate, but even Eric Hanushek acknowledges that “there are likely to be situations…where small classes could be very beneficial for student achievement” in his argument that class size reduction isn’t worth the cost. It’s intellectually inconsistent to argue simultaneously that class size reduction doesn’t help students and that making VAM a percentage of evaluations does, especially when (as the writeup you linked on Tennessee reminds us) a large number of teachers in some systems that use VAM have been getting evaluated on the test scores of students they don’t even teach.

None of that is to say that the pieces you link are devoid of value. There’s some research that indicates VAM could be a useful tool, and I’ve actually defended VAM when people confuse VAM as a concept with the specific usage of VAM you recommend. Though student outcome data shouldn’t be used as a percentage of evaluations, there’s a strong theoretical and research basis for using student outcomes in two other ways in an input-based evaluation process. The new teacher evaluation system that San Jose Unified School District (SJUSD) and the San Jose Teachers Association (SJTA) have begun to implement can illustrate what I mean by an input-based evaluation system that uses student outcome data differently and that is more likely to lead to improved student outcomes in the long run.

The Teacher Quality Panel in SJUSD has defined the following five standards of teacher practice:

1) Teachers create and maintain effective environments for student learning.

2) Teachers know the subjects they teach and how to organize the subject matter for student learning.

3) Teachers design high-quality learning experiences and present them effectively.

4) Teachers continually assess student progress, analyze the results, and adapt instruction to promote student achievement.

5) Teachers continuously improve and develop as professional educators.

Note that the fourth standard gives us one of the two important uses of student outcome data – it should drive reflection during a cycle of inquiry. These standards are based on observable teacher inputs, and there’s plenty of direct evidence evaluators can gather about whether teachers are executing these tasks effectively. The beautiful thing about a system like this is that, if we have defined the elements of each standard correctly, the student outcome results should take care of themselves in the long run.

However, there is still the possibility that we haven’t defined the elements of each standard correctly. As a concrete example, SJTA and SJUSD believe Explicit Direct Instruction (EDI) has value as an instructional framework, and someone who executes EDI effectively would certainly do well on standard 3. However, the idea that successful implementation of EDI will lead to better student outcomes in the long run is a prediction, not a fact. That’s where the second usage of student outcome data comes in – as I mentioned in my previous post, we should use student outcome results to conduct Bayesian analysis and figure out if our inputs are actually the correct ones. Let me know if you want me to go into detail about how that process works. Bayesian analysis is really cool (probability is my favorite branch of mathematics, if you haven’t guessed), and it will help us decide, over time, which practices to continue and which ones to reconsider.

I certainly want to acknowledge that many components of systems like IMPACT are excellent ones; increasing the frequency and validity of classroom observations is a really important step, for instance, in executing an input-based model effectively. We definitely need well-trained evaluators and calibration on what great execution of given best practices look like. When I wrote that I’d like to see StudentsFirst “focus on defining and implementing best practices effectively,” I meant that I’d like to see you make these ideas your emphasis. Conducting evaluations on this sort of input-based criteria would make professional development and support significantly more relevant. It would help reverse the teach-to-the-test phenomenon and focus on real learning. It would make feedback more actionable. It would also help make teachers and unions feel supported and respected instead of attacked, and it would enable us to collaboratively identify both great teaching and classrooms that need support. Most importantly, using these kinds of input-based metrics is more likely than the current approach to achieve long-run positive outcomes for our students.

Part 2 of the conversation, posted on August 11, can be found here.

6 responses to “StudentsFirst Vice President Eric Lerum and I Debate Accountability Measures (Part 1)”

Carl Herman

August 4, 2014

Nice, Ben! The data always reveals what happened, and it’s often the greatest challenge to reveal objective realities that most comprehensively show what we’re working with.

Honest people will get closer and closer to the best data with each attempt. I appreciate your work and commitment to shine light to reveal our best data for effective education.

Reply
1. Ben Spielberg
  
  August 4, 2014
  
  Thanks, Carl! Hope the end of summer is treating you well!
  
  Reply
Darren Battaglia (@darrenbattaglia)

August 5, 2014

Ben, I would so like to agree with you about VAM – and I’ll leave the citation search up to you. But as I recall, the very definition of the value-added education function that Hanushek and others use takes in to account previous knowledge, socioeconomics, other factors *and error*. So any variation from these other sources would be negated and you’re left which teacher effects.

It’s not to say that VAM isn’t without problems. Not all VAM models are the same. The lack of random assignment of students to classrooms which can cause previous scores to predict future teachers. Teachers need several years of scores and many teach in areas that are not traditionally assessed. Also – let’s face it, nobody really understands it except for statisticians. How can you evaluate someone on something they don’t understand? It may or may not be good statistics but it isn’t good policy.

Reply
1. Ben Spielberg
  
  August 5, 2014
  
  Thanks so much for reading and for the thoughtful comment, Darren. You’re absolutely right that the goal of most value added systems is to isolate teacher effects, but a large part of the problem is that no statistical model has been shown to accomplish that goal particularly well (and certainly not well enough to justify VAM’s typical uses). I think trying to quantify teacher impact on student learning is a worthwhile endeavor, but it’s a very hard problem and one that we’re still a long way from solving.
  
  As you mention, VAM usage in evaluations would be problematic even if the statistical models worked as intended. I completely agree that we shouldn’t evaluate people on things they don’t understand, and not just for the sake of transparency. If we believe the purpose of evaluation is to help teachers improve – and I think most people do – then evaluations need to provide actionable feedback on how to get better, something VAM scores cannot possibly do.
  
  Reply
Will

August 7, 2014

What a great discussion. I wanted to share a blog post about poverty that has challenged my thinking:

http://gatsbyinla.wordpress.com/2014/07/18/lesson-3-we-need-a-better-language-to-describe-poverty/

I’m not sure what goes into VAM formulas (I know it’s based off of socioeconomic factors, but I’m not sure what factors those are: parent salary?, free/reduced lunch?, parent education levels?, homeowner/renter?, receives food stamps? one adult at home vs two?, incarceration rates of family members?. ELL?, SPED?). But the struggle is how does a system accurately “equalize” different students so that it is fair to evaluate a teacher? How do we “equalize” for the 3 different types of poverty the Gatsby in LA author writes about?

Reply
1. Ben Spielberg
  
  August 7, 2014
  
  That blog post is great – thanks for sending it. I think you raise a lot of really good questions about how to acknowledge the distinct differences that often exist between student populations that share similar socioeconomic characteristics. We need measures of the stress and hardship associated with growing up in a particular environment.
  
  I don’t know what those measures are, but I’ll definitely start brainstorming. Thanks for the comment!
  
  Reply

34justice