Tests seem so reasonable at first -- teachers teach, students learn, and demonstrate mastery by passing a test. But as Daniel Koretz says at the start of his 2008 book, Measuring Up: What Educational Testing Really Tells Us, “Achievement testing is a very complex enterprise, and as a result, test scores are widely misunderstood and misused.” Now that is what I call an understatement. Furthermore, despite Common Core claims that better standards and tests mean fewer reasons for concern about their misuse, as Vito Perrone of Harvard University pointed out, “Most items on these various standardized tests remain well within the longstanding technology of testing, primarily to support the mechanical scoring procedures. They still seem to be limited instruments with too much influence” (1999, p. 152).
The testing “enterprise” is poised for a warp drive record-breaker of misuse insanity. In a nutshell, here’s how they plan to connect the dots.
A tiny fraction of what a student knows and can do is hypothetically captured, with some modicum of so-called scientific accuracy, by converting the number of correct answers out of the total number of questions on a standardized test to a raw score. Keep in mind that this single raw test score is still prone to error in its intent to measure what the student knows as the student may have made random choices, guessing correctly (or not), or may simply have had other contextual reasons for the performance including illness, distraction, nerves, etc. The test is also imperfect by design and is likely biased in some ways.
Now that raw score goes through some psychometric process to either be normed to a scale comparing it to other test scores, and/or it is ranked somewhere between unacceptable and excellent based on someone’s judgment of what students should know and be able to do. This is where all hell breaks loose as that converted score gets used.
How might it get used? For one, to tell the students and the parents or guardians how “well” they did which can involve labeling the converted score with a percentile rank, a grade-level equivalent, or just a descriptive meaning such as “meets standard.” However, it will likely be used in what is called a “high stakes” way to assign students to special education, to hold them back a year, or to track them into homogenous groups.
The most pernicious use is to group the scores to make claims about the quality of individual teachers. From there, it’s easy to see how tempting it is to make a claim about the quality of a school, and then a whole district. While we’re at it, let’s compare counties, states, regions, countries.
The cold hard truth, in Koretz’s words, is this:
Scores on a single test are now routinely used as if they were a comprehensive summary of what students know or what schools produce (p. 44-45).
He goes on later to add:
Simply attributing differences in scores to school quality or, similarly, simply assuming that scores themselves are sufficient to reveal educational effectiveness, is unrealistic. And more generally, simple explanations of performance differences are usually naïve. All of this is established science (p. 142).
Things get really tricky when hierarchical linear modeling kicks in to provide a “value-added” way to compare actual scores to a prediction and to use the difference to rate teachers’ effectiveness. Ignoring warnings from experts, these value-added models or VAMS, have been misused by policymakers to weigh heavily in the annual evaluation of teachers. Carol Burris, an outspoken principal who opposed this misuse of standardized test scores, recently wrote of a teacher’s lawsuit filed in New York State by my friend, Sheri Lederman, who hopes her case can become “a tipping point” in bringing this damaging unreliable practice to a grinding halt.
That may be wishful thinking because now the dots are being connected to the colleges and universities that educate teachers. They too are to be evaluated and ranked based on the performance of their candidates for teacher certification on standardized tests, which can be more than four in some cases. New federal regulations currently open for public comment until February 2nd would require these institutions of higher education to also track their teacher graduates, and collect their annual evaluation ratings including the VAMS measure, in order to be considered eligible for the TEACH grant program. (I have previously written of how similar perverse incentives plague the new CAEP accreditation standards for these institutions).
Here’s a test question for Arne Duncan, our Secretary of Education:
TRUE OR FALSE?
“A program’s ability to train future teachers who produce positive results in student learning [as measured by standardized testing] is a clear and important standard of teacher preparation program quality.” (from p. 63 in proposed regulations document)
Here’s a hint, provided by Benjamin Campbell of Richmond,Virginia on the federal register of comments. “Current research indicates that no more than 14% -- and often far less – of a student’s learning as measured by standard tests – the only standardized measure – can be attributed to the teacher.”
The bad news is that Arne Duncan, and a whole slew of politicians and policymakers in line behind him, think the correct answer to this question is TRUE. They actually believe harsh punitive consequences work and lead to improvement. They think closing schools and teacher education programs is a good idea. They don’t care if any of their plans are based on faulty data, junk science, or illogical statistics. They blithely ignore extant research, recommendations from experts, and, to put it bluntly, common sense. The question remains – what are we going to do about it?
As Captain Jean-Luc Picard would say, “Engage.”