GRADING THE TEACHERS
The Little-Known Statistician Who Taught Us to Measure Teachers
By Kevin Carey
May 19, 2017
Students enroll in a teacher’s classroom. Nine months later, they take a test. How much did the first event, the teaching, cause the second event, the test scores? Students have vastly different abilities and backgrounds. A great teacher could see lower test scores after being assigned unusually hard-to-teach kids. A mediocre teacher could see higher scores after getting a class of geniuses.
Thirty-five years ago, a statistician, William S. Sanders, offered an answer to that puzzle. It relied, unexpectedly, on statistical methods that were developed to understand animal breeding patterns.
Mr. Sanders died in March in his home state, Tennessee, at age 74, with his name little known outside education circles. But the teacher-assessment method he developed attracted a host of reformers and powerful lawmakers, leading to some of the most bitter conflicts in American education.
“In 1945, the United States government set off an atomic bomb.”
That’s how Mr. Sanders began telling me the story of his life, when we met several years ago.
He was raised on a small dairy farm and earned a doctorate in statistics and quantitative genetics from the University of Tennessee. At the time, Oak Ridge National Laboratory, near Knoxville, was studying the effects of radiation on living things.
Nuclear weapons tests had released clouds of radiation that had drifted with the weather. Sometime later, farm animals downwind began to die. Did the first event, a mushroom cloud, cause the second event, dead sheep? Or did one merely follow the other coincidentally? Solving this problem required expertise in both statistical probability and livestock biology. Oak Ridge hired Bill Sanders.
Then, in 1982, Mr. Sanders chanced upon a newspaper article about the latest controversy in K-12 education.
Tennessee’s governor, Lamar Alexander, who is now chairman of the Senate education committee, wanted to give more status and money to the best schoolteachers. That raised a thorny question: What, exactly, does “best” mean?
Mr. Sanders and a colleague sent Mr. Alexander a letter offering to help. Mr. Alexander ultimately chose not to use Mr. Sanders’s method, but eight years later, Mr. Sanders was summoned by Gov. Ned McWherter to make his case.
How the Slice Joint Made Pizza the Perfect New York City Food
Cancer Pushes New York’s ‘First Girlfriend,’ Sandra Lee, Onto Political Stage
At Trump’s Inauguration, $10,000 for Makeup and Lots of Room Service
Tennessee, an early adopter in standardized testing, administered annual exams in five subjects. Those scores, Mr. Sanders said, could gauge the quality of the students’ teachers. Yet, he cautioned, a simple comparison of a student’s test scores with her scores a year before wasn’t good enough.
William S. Sanders, who sought to measure the “value-added” contributions of individual teachers.
Courtesy of SAS
William S. Sanders, who sought to measure the “value-added” contributions of individual teachers.
CreditCourtesy of SAS
Imagine two students. Both start the year at the same level in math, and both improve by 15 percent. But in previous years, the first student had been improving slowly, by 5 percent annually. For him, 15 percent is a big gain. But the second student had been improving by 30 percent per year. For her, 15 percent is a troublesome slowing down.
To fairly evaluate teachers, Mr. Sanders argued, the state needed to calculate an expected growth trajectory for each student in each subject, based on past test performance, then compare those predictions with their actual growth. Outside-of-school factors like talent, wealth and home life were thus baked into each student’s expected growth. Teachers whose students’ scores consistently grew more than expected were achieving unusually high levels of “value-added.” Those, Mr. Sanders declared, were the best teachers.
Crunching the numbers for millions of scores would require high-powered computers and a small team of statisticians. To his surprise, Mr. Sanders got all that from the state. From that point, Bill Sanders’s professional life was defined by teachers, tests and the increasingly fraught politics between them.
When he began calculating value-added scores en masse, he immediately saw that the ratings fell into a “normal” distribution, or bell curve. A small number of teachers had unusually bad results, a small number had unusually good results, and most were somewhere in the middle.
Then, as now, the vast majority of teacher salary schedules used only two factors: years of service and the number of advanced degrees. Personnel evaluation systems were essentially nonexistent, with nearly all teachers being rated “satisfactory” after a perfunctory review.
The value-added bell curve told a different story. First, it was wide. The effective teachers on one side were achieving much better results than the ineffective teachers on the other. Second, it didn’t support the tenure and credentials system. Other researchers began using methods similar to Mr. Sanders’s to compare different kinds of teachers.
Sign up for The Upshot Newsletter
Get the best of The Upshot’s news, analysis and graphics about politics, policy and everyday life.
Schools were collectively spending billions to give teachers with master’s degrees extra pay. Yet their value-added bell curve looked little different from the curve for teachers without those degrees. Nor did effectiveness grow in lock step with years of service.
People had always known there were great and not-so-great teachers. But they had never been able to quantify the difference. The Sanders idea opened up new vistas of public policy — and created some of the most hard-fought political battles of the age.
Education reformers looked at the left-hand side of the bell curve, where the ineffective teachers were, and thought, “What if we could take them out of the system?” They pushed to change tenure systems that made teachers hard to fire.
Reformers also looked at the right-hand side of the bell curve, where the effective teachers were, and thought, “What if we could have a lot more of those?” They pushed for merit pay systems that would give raises to teachers with good value-added scores, to aid retention and recruitment.
The release of value-added data, as well as policies based upon them, were fiercely opposed by teachers’ unions. When Michelle Rhee, then superintendent of schools in Washington, D.C., decided to base teacher tenure and salaries in part on value-added scores, the American Federation of Teachers spent over a million dollars to unseat Ms. Rhee’s boss, Mayor Adrian Fenty. In New York, the United Federation of Teachers used the scores as a rallying cry against Mayor Michael Bloomberg.
In January 2010, President Obama visited Graham Road Elementary School in Falls Church, Va., to discuss the “Race to the Top” education initiative.
Stephen Crowley/The New York Times
In January 2010, President Obama visited Graham Road Elementary School in Falls Church, Va., to discuss the “Race to the Top” education initiative.CreditStephen Crowley/The New York Times
Mr. Sanders generally stayed out of these arguments — he opposed releasing individual ratings publicly — but he was still scorned as a mysterious guru without proper education credentials. It didn’t help that he made no apologies for the fact that his methods were too complex for most of the teachers whose jobs depended on them to understand.
Controversies also erupted on the national stage. Teacher-centered reforms had tended to revolve around class-size ratios, broad-based salary increases and other policies that, implicitly, saw teachers as interchangeable.
Value-added results suggested that individual teachers could be the primary driver of student improvement — but only the good teachers. The research convinced Bill Gates to spend hundreds of millions of dollars on measuring and improving teacher effectiveness. After Barack Obama’s election in 2008, key advisers used the research to make teacher evaluation a cornerstone of the “Race to the Top” program that gave states economic stimulus funds in exchange for adopting a menu of education reforms.
The policy quickly became a flash point. The Obama administration wanted a substantial portion of each teacher’s rating to be based on “student growth,” which everyone understood to mean some form of value-added results. The unions wanted test scores to matter much less. The Common Core standardized tests, already disliked by opponents of federal power on the right, also gained critics on the left, who objected to their use in evaluating teachers.
The controversies put value-added methods under intense scrutiny. Critics rightly pointed out that the ratings were only as good as the tests themselves, which varied widely in quality. Many educators teach in subjects or grades in which annual testing isn’t required, making value-added scores impossible.
That’s why evaluation systems in Washington, D.C., and elsewhere ultimately leaned more heavily on structured, in-person observations of teacher practice. Unlike value-added ratings, observations can provide diagnosis along with evaluation, showing teachers not just how they’re doing, but how to improve.
The American Statistical Association issued a statement urging caution in using value-added measures for “high-stakes” decisions, in part because scores for individual teachers can change significantly from year to year. But this variance exists in part because teachers are sometimes much more effective with one group of students in one year than another in the next.
Up until his death, Mr. Sanders never tired of pointing out that none of the critiques refuted the central insight of the value-added bell curve: Some teachers are much better than others, for reasons that conventional measures can’t explain. His system is still used in Tennessee today. In the last dozen years or so, the state’s scores on federal N.A.E.P. exams have improved faster than those of the average state.
His data were, he believed, inherently pro-teacher. Kati Haycock, founder of the education civil rights group the Education Trust, says that Mr. Sanders’s work revealed that teacher effectiveness “makes a huge difference in the trajectories and life chances of different kids.”
While the use of value-added ratings to hire, fire and pay teachers may have been limited by political pressure, the importance of the value-added bell curve itself continues to grow — less like a sudden explosion than a chime whose resonance gains in power over time. The questions that occupy lawmakers and administrators today are not whether to identify the most and least effective teachers, but how.
Because a Tennessee farmer turned statistician decided to write a letter to his governor, nobody will ever see the American teaching profession the same way again.
Kevin Carey directs the education policy program at New America. You can follow him on Twitter at @kevincarey1.
The Upshot provides news, analysis and graphics about politics, policy and everyday life. Follow us on Facebook and Twitter. Sign up for our newsletter.
A version of this article appears in print on May 21, 2017, on Page SR5 of the New York edition with the headline: The Man Who Measured Teachers. Order Reprints | Today’s Paper | Subscribe