By Jonathan Supovitz
Test-based accountability — the use of tests to hold individuals or institutions responsible for performance through the application of rewards and/or sanctions — has become the cornerstone of U.S. federal education policy, and the past decade has witnessed a widespread adoption of test-based accountability systems in states throughout the country.
Consider just one material manifestation of this burgeoning trend: test sales have grown in constant dollars from approximately $260 million annually in 1997 to approximately $700 million today — nearly a threefold increase. What influence has our substantial investment in testing and test-based accountability policy had on the behavior and performance of American educators and students? Is high-stakes testing a substantive reform, or an intervention that merely reveals shortcomings in the system but does little to actually improve teaching and learning? This article summarizes a review of the evidence from the past decade of test-based reform (Supovitz, 2009).
Why We Test
Four major theories underlie our current reliance on high-stakes tests: motivational theory, which argues that test-based accountability can motivate improvement; the theory of alignment, which contends that test-based accountability can spur alignment of major components of the educational system; information theory, which posits that such systems provide information that can be used to guide improvement; and symbolism, which maintains that testing systems signal important values to stakeholders.
Motivational theory is the predominant theory underlying test-based accountability. According to this concept, the extrinsic rewards and sanctions associated with the high-stakes test serve to motivate educators across the system to improve their performance. This presumes that lack of will, not lack of capacity, stands in the way of system-wide improvement.
The theory of alignment holds that system-wide improvement is most likely to occur if educational leaders align the major components of the educational system that surrounds schools (standards, curriculum, and assessments) so that they mutually reinforce each other. Alignment is usually thought of in terms of synchronizing the surrounding system, but can also be thought of as coherence between the external accountability system for schools and schools' sense of internal accountability (Ablemann et al. 2004). Information theory maintains that student performance data are useful for teachers and administrators to make decisions about students and programs and that providing such data to local educators and giving them incentives to improve their performance will guide classroom and organizational decision-making. Symbolism theory has also contributed to the growth and prevalence of highstakes testing. In this model, the accountability system is seen to signal important values to stakeholders and, in particular, the public (Airasian 1988). This particular theory is manifested in the notion of "public answerability" — the idea that the public has a right to expect that its resources will be used responsibly and that public institutions are accountable to the public. High-stakes assessment results thus serve as evidence that public education is, in essence, responsible and rigorous.
The Movement Toward Measuring Outcomes
Over the past 20 years, the nation has shifted from the tracking of educational inputs (e.g., per pupil expenditures, teacher salaries, class size, required courses, seat time) as indicators of educational performance to an increased emphasis on testing as a means to hold schools accountable for educational outcomes. Standardized test results have become the primary indicator of school and student performance, with public reporting, monetary or nonmonetary rewards, a range of interventions for low-performing schools, and even state takeover, as consequences for the quality of performance (Elmore et al. 1996; Fuhrman and Elmore 2004). By the early 1990s, standardized, multiple-choice high-stakes testing came under siege from many constituencies for containing gender bias, ethnic prejudice, and socioeconomic favoritism. Critics bemoaned the narrowing of curriculum and instruction and the perverse incentives inherent within high-stakes testing to retain and reclassify students. Many maintained that multiple-choice testing, with its emphasis on recall of isolated bits of knowledge, represented an outdated behaviorist view of learning. Moreover, research confirmed many of these critiques (see, for example, Garcia and Pearson 1994; Supovitz and Brennan 1997; Hamilton 2003). To address these problems, educators introduced a bevy of alternative forms of assessment (e.g., portfolios, performance assessments, and open-ended tasks). Advocates saw these as more valid measures of student performance and as potential catalysts for school reform. As several states and national organizations began incorporating alternative forms of assessments into their test-based accountability systems, researchers examined the influence of these new forms of assessments on policy and practice. Findings on the potential for alternative assessments to deliver richer, less biased measures of student performance, however, were mixed. Scoring reliability was found to be high in science performance assessments (Baxter et al. 1992) but unreliable in portfolio assessments (Koretz et al. 1994). Portfolio assessments were found to reduce racial/ethnic gaps in performance but exacerbate gender differences (Supovitz and Brennan 1997). Performance task content continued to produce gender-related biases (Jovanovic et al. 1994). Alternative assessments were also found to be cost prohibitive. For example, the cost of large-scale science performance assessments in California was estimated to be 20 to 60 times higher than standardized multiple-choice assessments for an equally reliable score (Stecher and Klein 1997). Research also indicated that teachers continued to organize instruction around the timing of high-stakes assessments, regardless of their format (Borko and Elliott 1999). Performance assessments also influenced curricular activities and assessment practices (Lane et al. 1999), and teachers still prepared students for the test rather than the larger learning goals of the curriculum (Stecher and Barron 1999). While collective research dampened the enthusiasm for alternative assessments, some elements were incorporated into the dominant forms of high-stakes testing, including open-ended writing and performance tasks.
The NCLB Era
In 2001, test-based accountability was incorporated into the No Child Left Behind Act (NCLB), a major federal reform intended to bring about widespread improvements in student performance and reduce inequities between ethnic groups and other traditionally under-served populations. NCLB required states to adopt test-based accountability systems, testing annually in reading, math, and eventually science from grades 3 through 8 and one year of high school. States were to define proficiency and adequate yearly progress (AYP) to get all students to proficiency in 12 years. Schools that failed to make AYP for two consecutive years would be identified for improvement and students from those schools would have the right to transfer to another public school. The legislation also required measureable objectives for sub-groups of students and for states to certify teachers as highly qualified. Studies and analyses on NCLB have begun to emerge. A four-year analysis conducted by the Center for Education Policy (Renter et al. 2006) surveyed state policymakers, district administrators, and schools. Many of the study’s respondents credited NCLB for rising student performance. At the same time, they indicated a narrowing of curriculum with a focus on reading and math that has reduced instructional time for other subjects, a shift reflecting efforts to align curriculum and instruction with assessments.
Likewise, a synthesis of pre- and post-NCLB literature (Herman 2004) concluded that accountability attracts teachers’ attention to the extent that teachers are more influenced by testing than by the standards themselves. In practice, Herman found, test preparation merges with instruction, with a concurrent de-emphasis of non-tested content.Some evidence suggests improvements in national performance associated with test-based accountability (Hanushek and Raymond 2004; Kober et al. 2008). But credit for these gains has also been given to school district policies and programs (Renter et al. 2006). Thus, while performance is improving, the contribution of high-stakes testing remains unclear.
Still other research has explored ways to use data from high-stakes tests to improve instruction—and again the verdict on the value of these assessments is mixed. These studies have typically found that the data provided by high-stakes exams contain general information about student performance but lack the nuance to provide fine-grained instructional guidance (Supovitz and Klein 2003). In response to such analyses, many districts have moved to more frequent quarterly or benchmark assessments (Goertz, Nabors Oláh, and Riggan 2009), but these instruments may suffer from the same problems.
What Have We Learned?
So, what has the past decade of testing policy taught us about the likelihood of improving the education system through high-stakes testing? Let’s review the effects according to the four theories of testing policy described earlier:
- High-stakes testing does motivate educators, but responses are often superficial. In the best cases, high-stakes testing has focused instruction toward important and developmentally appropriate literacy and numeracy skills—but at the expense of a narrower curricular experience for students and a steadier diet of test preparation activities in classrooms, particularly in low-performing schools, which are the targets of test-based accountability.
- Test-based accountability fosters alignment of the central components of the educational system. The evidence does indicate that high-stakes testing encourages educators to align curriculum, standards, and assessments. However, it is uncertain whether narrower exams are being aligned to more comprehensive standards, or vice versa.
- High-stakes testing regimes have limits as information tools. The data from
high-stakes tests are useful to policymakers for assessing school and system-level performance but insufficient for individual-level accountability and provide meager information to teachers for instructional guidance.
- Test-based accountability is an appealing political strategy. High-stakes testing answers a real need for policymakers to demonstrate to the public that they are spending tax dollars judiciously.
In sum, the evidence indicates that high-stakes assessments foster alignment in the system and provide an important representation of policymakers’ commitment to improving the public education system. At the classroom level, high-stakes tests motivate teachers to change their practices, but these changes tend to be more superficial adjustments in content coverage and test preparation practices rather than engendering deeper improvements in instructional efforts. In addition, the data provided by annual assessments are of limited utility to teachers.
Where Do We Go From Here?
Over the past couple of decades, test-based reform in the United States has gone through two major cycles: first, a widespread exploration into a variety of alternative assessment forms, then an increased emphasis on annual testing and state test performance as the authoritative indicator of the quality of schools and districts. Despite their substantive differences, these two cycles were each born of the best intentions: a desire to raise the performance of students and to redress inequalities in the educational system. But these disparities are driven by our social priorities, not our educational system, and they will not be remedied by rewriting the tests.
Our experiences with high-stakes testing should teach us about their serious
limitations. Rather than investing in substantial efforts to improve teaching and learning, we have developed a system that relies on summative testing as the cure for what ails us. While these tests certainly have a place in our efforts to improve our schools, their inflated role is indicative of our lack of will to enact deeper, more substantive reforms. In short, we have been effective in setting goals and motivating educators through high-stakes testing—but we have done so without building their capacity to achieve the goals for which we are holding them accountable.
In the next decade, developments in two test-related areas may well contribute to how much substantive progress we make. First, teachers need better testing tools and the capacity to utilize them effectively. Increasingly, we have the expertise to embed valuable information about patterns of student understanding into assessments that can be used by teachers for real instructional guidance. With better information about student subject matter (mis)conceptions and problem-solving strategies, and the skills to act on that information, teachers can hone their instructional responses to improve students’ understanding.
Second, we need to reform the reform. We must find a way to assimilate short-, medium-, and long-cycle assessments into a more coherent system that takes advantage of the strengths of each and ameliorates the undue influence that a single high-stakes assessment carries. A more robust assessment system might begin in the schools with more formative assessments, continue with a set of curriculumspecific interim assessments that act like guideposts, and culminate in a summative annual assessment. Such an aligned assessment system would reduce our emphasis on, and attention to, annual high-stakes assessments.
While technological advances make the integration and standardization of such concepts more feasible than ever before, the political challenges are considerable. But if we are serious about improving the education our youth receive, we must relegate high-stakes accountability to its proper place as a measurement and incentive companion to deeper instructional reforms.
Jonathan Supovitz is an associate professor at Penn GSE and a senior researcher for
the Consortium for Policy Research in Education.
Ablemann, C.H., Elmore, R.F., Even, J., Kenyon, S., & Marshall, J. (2004). When accountability knocks, will anyone answer? In R.F. Elmore (Ed.), School Reform from the Inside Out, 133-199. Cambridge, MA: Harvard Education Press.
Airasian, P. W. (1988). Symbolic validation: The case of state-mandated, high stakes testing. Educational Evaluation and Policy Analysis, 10(4), 301-313.
Baxter, G.P., Shavelson, R.J., Goldman, S.R., & Pine, J. (1992). Evaluation of procedure-based scoring on hands-on science assessment. Journal of Educational Measurement, 29(1), 1-17.
Borko, H., & Elliott, R. (1999). Hands-on pedagogy versus hands-off accountability: Tensions between competing commitments for exemplary mathematics teachers in Kentucky. Phi Delta Kappan, 80(5), 394-400.
Elmore, R.F., Ablemann, C.H., & Fuhrman, S.H. (1996). The new accountability in state education reform: From process to performance. In H.F. Ladd (Ed.), Holding Schools Accountable: Performance-based Reform in Education, 65-98. Washington, DC: Brookings Institution.
Fuhrman, S.H., & Elmore, R.F. (2004). Introduction. In S.H. Fuhrman and R.F. Elmore (Eds.), Redesigning Accountability Systems for Education, 3-14. New York: Teachers College Press.
Garcia, G.E., & Pearson, P.D. (1994). Assessment and diversity. In L. Darling-Hammond (Ed.), Review of Research in Education, 20, 337-391. Washington DC: American Educational Research Association.
Goertz, M.E., Nabors Oláh, L., & Riggan, M. (2009). Can Interim Assessments Be Used for Instructional Change? (CPRE policy Brief RB-51). Philadelphia, PA: Consortium for Policy Research in Education.
Hamilton, L. (2003). Assessment as a policy tool. Review of Research in Education, 27, 25-68.
Hanushek, E. A., & Raymond, M. E. (2004). Does school accountability lead to improved student performance? National Bureau of Economic Research Working Paper No. W10591. Cambridge, MA: Author.
Herman, J. L. (2004). The effects of testing on instruction. In S.H. Fuhrman and R.F. Elmore (Eds.), Redesigning Accountability Systems for Education, 141-166. New York: Teachers College Press.
Jovanovic, J., Solano-Flores, G., & Shavelson, R. J. (1994). Performance-based assessments: Will gender differences in science be eliminated? Education and Urban Society, 26(4), 352-366.
Kober, N., Chudowsky, N., & Chudowsky, V. (2008). Has Student Achievement Increased Since 2002? State Test Score Trends through 2006–07. Washington, DC: Center on Education Policy.
Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). The Vermont portfolio assessment program: Findings and implications. Educational Measurement: Issues and Practice, 13(3), 5-16.
Lane, S., Ventrice, J., Cerillo, T.L., Parke, C.S., & Stone, C.A. (1999). Impact of the Maryland School Performance Assessment Program (MSPAP): Evidence from the principal, teacher, and student questionnaires (reading, writing, and science). Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Canada.
Renter, D.S., Scott, C., Kober, N., Chudowsky, N., Chudowsky, V., Joftus, S., & Zabala, D. (2006). From the Capital to the Classroom: Year 4 of the No Child Left Behind Act. Washington, DC: Center on Education Policy.
Stecher, B., & Barron, S. (1999). Test based accountability: The perverse consequences of milepost testing. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada.
Stecher, B.M., & Klein, S.P. (1997). The cost of science performance assessments in large-scale testing programs. Educational Evaluation and Policy Analysis, 19(1), 1-14.
Supovitz, J.A. (2009). Can high stakes testing leverage educational improvement? Prospects from the last decade of testing and accountability reform. Journal of Educational Change, 10(2), 211-227.
Supovitz, J.A., & Brennan, R.T. (1997). Mirror, mirror on the wall, Which is the fairest test of all? An examination of the equitability of portfolio assessment relative to standardized tests. Harvard Educational Review, 67(3), 472- 506.
Supovitz, J.A., & Klein, V. (2003). Mapping a course for improved student learning: How innovative schools systematically use student performance data to guide improvement. Philadelphia, PA: Consortium for Policy Research in Education.