Cooks, Chefs, and Teachers: A Long-Form Debate on Evaluation (Part 3a)

StudentsFirst Vice President Eric Lerum and I have been debating teacher evaluation approaches since my blog post about why evaluating teachers based on student test scores is misguided and counterproductive.  Our conversation began to touch on the relationship between anti-poverty activism and education reform conversations, a topic we plan to continue discussing.  First, however, we wanted to focus back on our evaluation debate.  Eric originally compared teachers to cooks, and while I noted that cooks have considerably more control over the outcomes of their work than do teachers, we fleshed that analogy out and continue discussing its applicability to teaching below.

Click here to read Part 1 of the conversation.

Click here to read Part 2 of the conversation.

Lerum: I love the analogy you use for this simple reason – I don’t think we’re as interested in figuring out whether the cook is an “excellent recipe-follower” as we are about whether the cook makes food that tastes delicious. And since we’re talking about the evaluation systems themselves – and not the consequences attached (which by and large, most jurisdictions are not using) – then this really matters. The evaluation instrument may reveal that the cook is not an “excellent recipe follower,” which you gloss over. But that’s an important point. It could certainly identify those cooks that need to work on their recipe-following skills. That’s helpful in creating better cooks.

But taking your hypothetical that it identifies someone who can follow a recipe well and executes our strategies, but then the outcome is still bad – that is also important information. It could cause us to re-evaluate the recipe, the meal choice, certain techniques, even the assessment instrument itself (do the people tasting the food know what good food tastes like?). But all of those would be useful and significant pieces of information that we would not get if we weren’t starting with an evaluation framework that includes outcomes measures.

You clearly make the assumption that nobody would question the evaluation instrument or anything else – if we had this result for multiple cooks, we would just keep going with it and assume it’s the cooks and nothing else. But that’s an unreasonable assumption that I think is founded on a lack of trust and respect for the intentions underlying the evaluation. What we’re focused on is identifying, improving, rewarding, and making decisions based on performance. And we want accurate measures for doing so – nobody is interested in models that do not work. That’s why you constantly see the earliest adopters of such models making improvements as they go.

Also, to clarify, we do not advocate for the “use of standardized test scores as a defined percentage of teacher evaluations.” I assume you probably didn’t mean that literally, but I think it’s important for readers to understand the difference as it’s a common and oft-repeated misconception among critics of reform. We advocate for use of measures of student growth – big difference from just using the scores alone. It doesn’t make any sense to evaluate teachers based on the test scores themselves – there needs to be some measure (such as VAM) of how much students learn over time (their growth), but that is not a single snapshot based on any one test.

I appreciate your recommendation regarding the use of even growth data based on assessments, but again, your recommendation is based on your opinion and I respectfully disagree, as do many researchers and respected analysts (also see here and here – getting at some of the issues you raise as concerns, but proposing different solutions). To go back to your analogy, nobody is interested in going to a restaurant run by really good recipe-followers. They want to go where the food tastes good. Period. Likewise, no parent wants to send her child to a classroom taught by a teacher who creates and executes the best lesson-planning. They want to send their child to a classroom in which she will learn. Outcomes are always part of the equation. Figuring out the best way to measure them may always have some inherent issues with subjectivity or variability, but I believe removing outcomes from the overall evaluation itself betrays to some degree the initial purpose.

Spielberg: I think there’s some confusion here about what I’m advocating for and critiquing.  I’d like to reiterate what I have consistently argued in this exchange – that student outcomes should be a part of the teacher evaluation process in two ways:

1) We should evaluate how well teachers gather data on student achievement, analyze the data, and use the data to reflect on and improve their future instruction.

2) We should examine the correlation between the effective execution of teacher practices and student outcome results.  We should then use the results of this examination to revise our instructional practices as needed.

I have never critiqued the fact that you care about student outcomes and believe they should factor heavily into our thinking – on this point we agree (I’ve never met anyone who works in education who doesn’t).  We also agree that it is better to measure student growth on standardized test scores, as value added modeling (VAM) attempts to do, than to look at absolute scores on standardized tests (I apologize if my earlier wording about StudentsFirst’s position was unclear – I haven’t heard anyone speak in favor of the use of absolute scores in quite some time and assumed everyone reading this exchange would know what I meant).  Furthermore, the “useful and significant pieces of information” you talk about above are all captured in the evaluation framework I recommend.

My issue has always been with the specific way you want to factor student outcomes into evaluation systems.  StudentsFirst supports making teachers’ VAM results a defined percentage of a teacher’s “score” during the evaluation process, do you not?  You highlight places, like DC and Tennessee, that use VAM results in this fashion.  Whether or not this practice is likely to achieve its desired effect is not really a matter of opinion; it’s a matter of mathematical theory and empirical research.  I’ve laid out why StudentsFirst’s approach is inconsistent with the theory and research in earlier parts of our conversation and none of the work you link above refutes that argument.  As you mention, both Matt Di Carlo and Douglas Harris, the authors of the four pieces you linked, identify issues with the typical uses of VAM similar to the ones I discuss.  Their main defense of VAM is only to suggest that other methods of evaluation are similarly problematic; Harris discusses a “lack of reliability in essentially all measures” and Di Carlo notes that “alternative measures are also noisy.”  There is, however, more recent evidence from MET that multiple, full-period classroom observations by multiple evaluators are significantly more reliable than VAM results.  While Di Carlo and Harris do have slightly different opinions than me about the role of value added, Di Carlo’s writing and Harris’s suggestion for evaluation on the whole seem far closer to what I’m advocating than to StudentsFirst’s recommendations, and I’d be very interested to hear their thoughts on this conversation.

That said, I like your focus above on what parents want, and I think it’s a worthwhile exercise to look at the purposes of evaluation systems and how our respective proposals meet the desires and needs of different stakeholders.  I believe evaluation systems have three primary purposes: providing information, facilitating support, and creating incentives.

1) Providing Information – You wrote the following:

…nobody is interested in going to a restaurant run by really good recipe-followers. They want to go where the food tastes good. Period. Likewise, no parent wants to send her child to a classroom taught by a teacher who creates and executes the best lesson-planning. They want to send their child to a classroom in which she will learn.

The first thing I’d note is that this juxtaposition doesn’t make very much sense; students taught by teachers who create and execute the best lesson-planning will most likely learn quite a bit (assuming that the teachers who are great lesson planners are at least decent at other aspects of good teaching). In addition, restaurants run by really good recipe-followers, if the recipes are good, will probably produce good-tasting food.  Good outputs are expected when inputs are well-chosen and executed effectively.

The cooking analogy is a bit problematic here because, in the example you give, the taste of the food is both the ultimately desired outcome and the metric by which you propose to assess the cook’s output.  In the educational setting, the metric – VAM, in the case of our debate – is not the same as the desired output.  In fact, VAM results are a relatively weak proxy for only a subset of the outcomes we care about for kids (those related to academic growth).  To construct a more appropriate analogy for judging a teacher on VAM results, let’s consider a chef who works in a restaurant where we want to eat dinner.  We are interested, ultimately, in the overall dining experience we will have at the restaurant. A measurement tool parallel to VAM, one that gives us a potentially useful but very limited picture of only one aspect of the experience other diners had, could be other diners’ assessments of the smell of the chef’s previous meals.

This analogy is more appropriate because the degree to which different diners value different aspects of a dining experience is highly variable.  All diners likely care to some extent about a combination of the food selection, the sustainability of their meal, the food’s taste, the atmosphere, the service, and the price.  Some, however, might value a beautiful, romantic environment over the taste of their entrees, while others may care about service above all else.  Likewise, some parents may care most about a classroom that fosters kindness, some may prioritize the development of critical thinking skills, and others may hold content knowledge in the highest esteem.

Were I to eat at a restaurant, I’d certainly get some information from knowing other diners’ assessments of previous meals’ smells.  Smell and taste are definitely correlated and I tend to value taste above other considerations when I’m considering a restaurant.  Yet it’s possible that other diners like different kinds of food than me, or that their senses of smell were affected by the weather or allergies when they dined there.  Some food, even though it smells bad, tastes quite good (and vice versa).  If I didn’t look deeper and really analyze what caused the smell ratings, I could very easily choose a sub-optimal restaurant.

What I’d really want to know would be answers to the following questions: what kind of food does the chef plan to make?  Does he source it sustainably?  Is it prepared to order?  Is the wait-staff attentive?  What’s the decor like?  The lighting?  Does the chef accommodate special requests?  How does the chef solicit feedback from his guests, and does he, when necessary, modify his practices in response to the feedback?  If diners could get information on the execution in each of these areas, they would be much better positioned to figure out whether they would enjoy the dining experience than if they focused on other diners’ smell ratings.  A chef who did all of these things well and who used Bayesian analysis to add, drop, and refine menu items and restaurant practices over time would almost certainly maximize the likelihood that future guests would leave satisfied.  A chef with great smell ratings might maximize that probability, but he also might not.

The exact same reasoning applies to the classroom experience.  Good VAM results might indicate a classroom that would provide a learning experience appropriate for a given student, but they might not.  Though I will again note that you don’t advocate for judging teachers solely on VAM, VAM scores tend to be what people focus on when they’re a defined percentage of evaluations.  That focus, again, does not provide very good information.  Whether parents value character development, inspiration, skill building, content mastery, or any other aspect of their children’s educational experience, they would get the best information by concentrating on teacher actions. If a parent knows a teacher’s skill – at establishing a positive classroom environment, at lesson planning, at lesson delivery, at using formative assessment to monitor student progress and adapt instruction, at helping students outside of class, etc. – that parent will be much more informed about the likelihood that a child will learn in a teacher’s class than if that parent focuses attention on the teacher’s VAM results.

2) Facilitating support – A chef with bad smell ratings might not be a very good chef.  But if that’s the case, any system that addressed the questions above – that assessed the chef’s skill at choosing recipes, sourcing great ingredients, making food to order, training his wait-staff, decorating his restaurant, responding to guest feedback, etc. – should also give him poor marks.  Bad results that truly signify bad performance, as opposed to reflecting bad luck or circumstances outside of the chef’s control, are the result of a bad input.  The key idea here is that, if we judge chefs on input execution but monitor outputs to make sure the inputs are comprehensive and accurate, judging chefs on their smell ratings won’t give us any additional information about which chefs need support.

More importantly, making smell ratings a defined percentage of a chef’s evaluation would not help a struggling chef improve his performance.  No matter the other components of his evaluation, he is likely to concentrate primarily on the smell ratings, feel like a failure, and have difficulty focusing on areas in which he can improve.  If we instead show the chef that, despite training the waitstaff well, he is having trouble selecting the best ingredients, we give him an actionable item to consider.  “Try these approaches to selecting new ingredients” is much easier to follow and much less demoralizing a directive than “raise your smell ratings.”

I think the parallel here is pretty clear – if we define and measure appropriate teaching inputs and use outcomes in Bayesian analysis to constantly revise those inputs, making VAM a defined percentage of an evaluation provides no new information about which teachers need support.  Especially because VAM formulas are complex statistical models that aren’t easily understood, the defined-percentage approach also focuses the evaluation away from actionable improvement items and towards the assignment of credit and blame.

3) Creating Incentives – Finally, a third goal of evaluation systems is related to workforce incentives.  First, we often wish to reward and retain high-performers and, in the instances in which support fails, exit consistently low-performers.  For retention and dismissal to improve overall workforce quality, we must base these decisions on accurate performance measures.

I don’t think the incomplete information provided by VAM results and smell ratings needs rehashing here; the argument is the same as above.  We are going to retain a higher percentage of chefs and teachers who are actually excellent if our evaluation systems focus on what they control than if our incentives focus on outputs over which they have limited impact.

Of particular concern to me, however, are the incentives teachers have for working with the highest-need populations.  Even efforts that take great pains to “level the playing field” between teachers with different student populations result in significantly better VAM results for teachers and schools that work with more privileged students.  Research strongly suggests that teachers who work in low-income communities could substantially improve their VAM scores by moving to classrooms with more affluent populations (and keeping their teaching quality constant).  When we make VAM results a defined percentage of an evaluation, we provide incentives for teachers who work with the highest-need populations to leave.  The type of evaluation I’m proposing, if we execute it properly, would eliminate this perverse incentive.

Again, I want to reiterate that I support constantly monitoring student outcomes; we should evaluate teachers on their ability to modify instruction in response to student outcomes, and we should also use outcomes to continuously refine our list of great teaching inputs.  But we rely on evaluation systems to provide accurate and comprehensive information, to help struggling employees improve, and to provide appropriate incentives.  VAM can help us think about good teaching practices, but StudentsFirst’s proposed use of VAM does not help us accomplish the goals of teacher evaluation.

Part 3b – in which we return to our discussion about the relationship between anti-poverty work and education reform – will follow soon!

Update (8/21/14) – Matt Barnum alerted me to the fact that the article I linked above about efforts to “level the playing field” when looking at VAM results actually does provide evidence that “two-step VAM” can eliminate the bias against low-income schools.  That’s exciting because, assuming the results are replicable and accurate, this particular VAM method would eliminate one of the incentive concerns I discussed.  However, while Educators 4 Excellence (Barnum’s organization) advocates for the use of this method, I don’t believe states currently use it (if you know of a state that does, please feel free to let me know).  The significant other issues with VAM would also still exist even with the use of the two-step version.

6 Comments

Filed under Education

6 responses to “Cooks, Chefs, and Teachers: A Long-Form Debate on Evaluation (Part 3a)

  1. limyanko

    I’m really enjoying this debate. Thanks for hosting it, Ben, and I’m glad to see both you and Eric engage each other on the ideas in such a clear and thoughtful way. If only all education policy discussions were so…

    To name my own biases up front: I’m a large fan of VAM philosophically, i.e., I want to understand the impact that teachers have on student academic growth (controlling for other variables). But the more I learn about our current ability to deliver on this aspiration, the more skeptical I become that we ought to be using it today.

    That said, I’d love to ask you two questions:

    1. You state that we ought to use Bayesian reasoning to determine what inputs are effective, and coach/evaluate teachers based on their proficiency with these inputs. Could you go into greater detail about how exactly you would apply Bayes’ theorem here? Specifically, how would you determine your priors, e.g. estimate the probability that direct instruction will lead to student growth? And how would you measure said growth?

    2. I can easily imagine a world in which we implement an evaluation framework based on proficiency with inputs, and 99%+ of all teachers in District P are evaluated as proficient or exemplary on every input every year. (In this case, of course, the system would fail to provide meaningful information or facilitate much support.)

    I think many reformers see this happening now — the vast majority of all teachers rated proficient or exemplary — and that is what drives the desire to have VAM make up a specified portion of the evaluation. They think, “When low value-added scores come in for some teachers, the principal will HAVE to do something about it!”

    I think this approach is doomed to fail. If teachers aren’t being held to high standards or receiving meaningful feedback in their evaluations, it isn’t because that’s impossible under our current evaluation systems (or at least the ones I’ve seen). It’s because there’s a problem in the culture of how we approach teacher evaluation in that school or district. Now, whether that cultural problem stems from principal disengagement, lack of trust in the system, too-strong unions, etc. depends on who you ask. But no matter what the cause, I doubt that a technocratic solution (“Q% of the evaluation must be determined by VAM data!”) will ever solve a cultural problem.

    I’m curious to hear your thoughts. Agree? Disagree?

  2. Thanks, Darin, for the awesome comment. I too am interested in VAM as a concept. I actually created a statistic that tried to do something similar for college basketball players when I was a sophomore in college, and I think it’s an interesting exercise (though, as you know, I am probably at least as concerned as you are about our current use of VAM results).

    On your second question, I think there’s one other possible explanation for high pass rates that you haven’t considered: that most teachers are pretty good at their jobs. 99% does seem too high, and anecdotal cases of bad teaching abound, but I’m curious about what percentage reformers actually think don’t deserve good ratings and what their basis is for making that assessment. The issues I’ve seen with inaccurate evaluations seem to stem from overworked and/or untrained administrators, and while I’d again say that I agree that 99% is too high, I’ve seen as many incorrectly bad evaluations as incorrectly good ones. So I guess I kind of agree, but I think this topic invites a much longer discussion, one that includes a conversation about the issues with grading and rating people in general (something I grappled with a lot as a teacher – I believe that assigning grades to students often causes more harm than good).

    On your first question, determining the initial priors is difficult. Suppose we want to study Teaching Input A (TIA for short) and whether it is an effective practice. Suppose also that we have a 5-point scale for ranking teacher performance on TIA, 5 being the best score and 1 being the worst score. To some extent our prior will be arbitrarily determined based on our feeling about how good a teaching practice TIA happens to be. However, we could potentially use existing VAM research as our baseline. I haven’t fully thought this part through – I think it’s a hard problem that would take a fair amount of work – but I envision something along the lines of looking at the distribution of individual students’ growth on test scores for teachers in each quintile of “performance” (as measured by VAM, with the caveat that I don’t think we are currently very good at measuring true performance) and assigning probabilities based on that distribution that students would, on average, fall within a given band of “performance.” These probabilities would obviously be imprecise, but they wouldn’t be used for high stakes purposes and would only be the starting point, so I don’t think the imprecision would be a huge problem.

    We’d then look at teachers’ scores on TIA and the resultant band in which their students’ results fell. The process is fairly complicated for a large number of teachers, so I’ll try and illustrate the idea by imagining that there is only one teacher to consider. Suppose the teacher gets a 5 on the TIA and students’ scores end up in band 3. We’d need the following probabilities:

    1) The probability we would have assigned, beforehand, to the situation in which a teacher with a 5 on TIA would have students, on average, in band 3. Let’s imagine we used the process above and got 10% (x = .1).

    2) The probability that, assuming TIA is the right practice, students would end up in band 3. Let’s assume, again for simplicity’s sake, that TIA is the only teaching practice that affects student learning. To simplify the calculation even further, let’s assume that student background characteristics don’t impact the results at all, and that 58% of the student results are due to TIA and 42% due to uniformly random variation (I’m choosing the numbers to make subsequent calculation easier). In that scenario, we’d assign this probability at approximately 5% (y = .05).

    3) The probability that, assuming that we’re wrong about TIA, students would end up in band 3. For simplicity, let’s assume in this case that neither background characteristics nor teaching skill impacted student scores – that uniformly random variation is the only factor at work. That probability would be 20% in this case (z = .2).

    The likelihood that this scenario would occur would then seem to be approximately: xy/[xy+z(1-x)], or (.1)(.05)/[(.1)(.05)+(.2)(.9)] = approximately .027 or 2.7%. This low probability would then become our new prior, and if something similar happened in the next year, we’d have good reason to believe that TIA might not be as great as we think it would be.

    This exercise is a gross oversimplification and relied on several assumptions that are divorced from reality, but I hope it clarifies the general idea a bit. I think a decent working model would be very hard but also fun to develop.

  3. The edreform fetish around numeric data and testing is a huge obstacle to productive advances in evaluation. Look how much time and energy we put into trying to massage the math when we’re dealing with remarkably complex human beings in remarkably volatile settings across significant lengths of time with disturbingly narrow and mediocre measurement tools!
    In a blog post I wrote a few years ago ( http://goo.gl/pGJGMz ), I cited studies and reports suggesting the effects of many factors influencing student performance – none of which are figured into VAM model I’ve ever heard about. Claims about effects generally rest on assumptions about certain influences being held constant, or controlled for. But none of these studies attempts to account for the interaction of all these factors when none of them are held constant or controlled for – that would be a nightmare, an impossibility. Okay – welcome to my classroom! (Or any classroom).
    Where’s the VAM model that accounts for any of the following: changes in administration, schedules, counseling, technology, facilities, curriculum, teaching assignments, professional development, and student tutoring? And what about the VAM model that accounts for the interaction of all of the above, and the fact that they do not affect all teachers equally?
    The teacher is not a constant, and the school is not a constant, and the tests are not a constant, and VAM isn’t good enough to account for everything that goes into my work and into my students’ LIVES.
    I will give edreformers credit for pushing the conversation forward about putting learning outcomes into the evaluation discussion. But I agree with Ben’s formulation of the idea: “evaluate how well teachers gather data on student achievement, analyze the data, and use the data to reflect on and improve their future instruction.” If we can trust teachers and school/district staff to make professional judgments about the appropriate ways to do that, we can end one of the most divisive debates out there and move forwards focused on the most important work students and teachers do in the area where reformers and their critics likely have agreement. Bending over backwards to find a way to mandate the incorporation of VAM is diminishing our professional capital. No successful education reform in the nation or the world was ever advanced over the objection of professional practitioners, especially in a manner that reduced their authority and autonomy. I’ll stand with James Popham on this one. He’s one of the foremost experts on assessment and he concluded simply and elegantly that the most logical and productive way to evaluate teachers is similar to other professions – and that is the informed judgment of expert practitioners.

  4. Hi Ben,

    Apologies for taking so long to respond to your (very thoughtful) reply.

    On using Bayesian reasoning in teacher coaching/evaluation — thanks for outlining the example, although I’m afraid I did a poor job of communicating my original questions. I’m familiar with how to apply Bayes’ theorem, generally. It’s how to account for all the variables that your example assumes away (for the sake of simplicity) that I think is so tricky and important to figure out. It seems totally right to say, at a high level, that we should use Bayes’ theorem to update our beliefs about what inputs are effective over time. But it also seems totally right to say that we should use value-added modeling to isolate the teacher’s contribution in student outcomes. The devil is the details.

    (Not to say that I expected you to outline a full model of how to take into account how to isolate the impact of individual best practices, how student background impacts achievement, etc. in your reply! I agree 100% that this would be incredibly challenging to do. But it’s what I would want to see, to move from the space of “theoretical agreement” to endorsing a Bayesian system in practice.)

    Onto the second topic: You said you agree that 99% proficient/exemplary is too high, but “most teachers are pretty good at their jobs.” That suggests that some undefined (but still fairly high) number of teachers should receive a proficient/exemplary rating.

    I have no idea what percentage of teachers should be ranked unsatisfactory, or as you put it, “don’t deserve [their] good ratings.” I would imagine that it varies enormously across the country, based on things such as the strength of local teacher prep programs, quality of coaching and professional development at the district/school, etc. That said, I’ll hazard a guess that you’re overall more sanguine about the quality of instruction nationwide in public schools than me. (Although perhaps not by much! “Most teachers are pretty good at their jobs” implies a pretty broad range.)

    That said, here are two thoughts on why — even if we both agree that 99% proficient/exemplary is too high — I likely would rate more teachers unsatisfactory than you:

    1. It’s well-documented that students of color receive disproportionate rates of suspensions and expulsions, compared to white students with similar infractions and discipline histories. (Example: http://www2.ed.gov/about/reports/annual/ocr/report-to-president-2009-12.pdf) This is a major issue and I think it should be reflected in teachers’ evaluations.

    Let me take a moment to be clear: I’m not advocating that every teacher who disproportionately disciplines students of color should be fired. Systemic racism is pervasive in the United States, and given where we are as a country, I understand why (probably) most teachers unintentionally stereotype their students of color, why White teachers might fail to recognize their privilege or unfairly expect kids of color to confirm to White cultural norms, etc. It’s the same reason that almost ALL of us (including me) stereotype others based on race/ethnicity, having trouble recognizing our own privilege, etc. We live immersed in the smog of racism.

    That said, I DO think it’s critical for these disparities in discipline to be brought to light, formally documented in evaluations, and for that to drive coaching.

    (I want to make this point explicitly because I think it’s so easy — especially now — to conflate teacher evaluation and teacher discipline/firing. I believe the primary purpose of evaluations is to align teachers and administrators around teachers’ strengths and weaknesses, and drive professional development. When and how to fire teachers is a separate, albeit related, conversation.)

    2. In my work, I’ve had the privilege of observing instruction at many schools — district, charter, and BIE — across many states. I’ve observed many TFA corps members and alumni, but also many traditionally-certified educators. Overall, I agree with Grant Wiggins: the majority have been really boring. (He’s written about this a lot. Here’s one example: http://grantwiggins.wordpress.com/2011/07/26/bor-ing/)

    In my opinion, proficient (let alone exemplary) instruction is not frequently boring.

    (I’ll note one possible source of bias here: most of my observations have occurred in the context of high-poverty schools. Perhaps things are different in the suburbs or wealthy neighborhoods. That said, I believe you have previously argued that there is not significant variation in teaching quality between high-poverty and low-poverty areas, and differences in outcomes are primarily the result of out-of-school factors. Moreover, Grant Wiggins’ work does not exclusively occur in high-poverty schools such as the ones where TFA places corps members.)

    Finally, let’s assume for the moment that I’m totally wrong about both of the above points. I’d still argue that any evaluation system that clumps together the vast majority of teachers as proficient/exemplary is poorly-designed, because the categories do not provide much information (to either the teacher or administrators).

    Imagine a district where the evaluation system has three categories: unacceptable, proficient, and exemplary. Let’s say the distribution is something 10%-60%-30%. (I chose 90% proficient/exemplary just because it’s lower than 99%, which we agree is too high, but still fulfills “most teachers are pretty good at their jobs.”) Let’s say that teachers are evaluated based only on inputs, as determined by observations, examining plans, and reflections on their own practice.

    I’d argue that in this case, the “proficient” and “exemplary” categories don’t tell you very much. There’s a huge difference in efficacy between (say) someone at the bottom of the “proficient” distribution and someone at the top. If I were a teacher, I’d want to more fine-grained information, e.g. because I want to improve my practice over time, and one way to determine whether that’s occurring is to see improvement in my evaluations. If I were a principal, I’d want more fine-grained information, too, e.g. if I were comparing two internal transfer candidates who are each rated “proficient.”

    Wouldn’t all parties, in this case, be better served by an evaluation rubric with more categories, which provided more differentiating information?

    • Hey Darin,

      No worries at all about the delay – thanks a lot for the comments!

      I wholeheartedly agree with you that, to endorse a Bayesian system in practice, we’d need significantly more information about how it would work. The devil is definitely in the details. But there are a lot of people currently working on VAM research and on systems that implement VAM for different (and theoretically worse) purposes; I think these people could develop those details if they were to switch their focus. The nice thing about experimentation with VAM (or other outcome measures) as part of a Bayesian process on inputs, as opposed to our current experimentation with VAM, is that the only downside of not getting things right immediately is opportunity cost.

      In terms of evaluation and the ratings of teachers, I think this paragraph that you wrote is key:

      “(I want to make this point explicitly because I think it’s so easy — especially now — to conflate teacher evaluation and teacher discipline/firing. I believe the primary purpose of evaluations is to align teachers and administrators around teachers’ strengths and weaknesses, and drive professional development. When and how to fire teachers is a separate, albeit related, conversation.)”

      While I would argue that boring classes are due far more to institutional problems than teacher quality, I completely agree that all teachers could grow quite a bit, and some more than others. I also completely agree that racial discrepancies in discipline should be attended to during the evaluative process, as should anything else that relates to students’ classroom experience. However, I would go back to the quoted paragraph above when thinking about how to incorporate these ideas into teacher evaluation.

      To me, categories for evaluation are kind of like grades – they don’t seem to serve a growth purpose. Instead, they seem to rate/rank people. The rating is a fast way for somebody unfamiliar with a person’s performance to judge it, but for the person receiving it, the rating mainly seems to make the person happy (if the rating is good) or unhappy (if the rating is bad). People with good ratings might get rewards and people with bad ratings might get punishments as well. I don’t have a clearly defined position on grades and rankings, and I like them in some contexts (I rank baseball players all the time for my fantasy drafts, for example), but I currently lean towards the viewpoint that they do more harm than good in most contexts. I gave grades when I taught, of course, but I think I would have preferred not to do so.

      In the new teacher evaluation system SJTA and SJUSD negotiated, there are only two final categories: “meets standard” and “does not meet standard.” These categories get at the secondary purpose you mentioned – helping us to make employment decisions. However, the focus of the evaluation system is on extensive narrative feedback about the evidence evaluators have gathered from classroom observations, lesson plans, conversations, work samples, and other sources. This feedback, in my opinion, is much more useful as a formative tool than a tiered ranking system (I think it provides the “fine-grained information” you’re talking about). I’m not 100% sure that the SJUSD system is the best possible way to approach evaluation, but I really like that it focuses the evaluation more on growth than on rankings, and that it only uses categories when absolutely necessary. More delineated categories and useful formative feedback aren’t mutually exclusive, but I currently see little that more categories would add and a lot of problems that they might cause.

      Do you have any thoughts on the SJUSD system and/or is there anything you think I’m missing?

      Thanks again for the great comments.

      Ben

  5. Thank yyou for this

Leave a Reply to Ben Spielberg Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s