Roughly two decades ago, education policy makers in the United States began to rely more heavily on standardized test scores as performance metrics for teachers and schools. During the late 1980s and through the 1990s, many states adopted test-based accountability systems that spelled out rewards and sanctions for teachers and principals as a function of the performance of their students on standardized tests, and when the federal government adopted the No Child Left Behind Act (NCLB) of 2001, test-based accountability became a nation-wide policy. The proponents of this development cite the need to bring "business practices" into public schools and the need to make schools "data driven" in ways that mirror the practices of private-sector companies.
It is ironic that, during this same period, economists who study the design of jobs and the structure of incentives within private firms began to take more seriously the task of explaining why firms rarely attach performance pay to objective measures of output. Most incentive pay takes the form of raises, promotions, or bonuses related to subjective evaluations of a broad range of qualitative and quantitative information, and Holmstrom and Milgrom (1991) argue that, in many instances, firms find it optimal to pay workers a fixed base wage and monitor their allocation of effort among tasks even if the firm has access to "good" performance measures in a statistical sense. Their key insight is that jobs may involve multiple tasks, and as a result, incentive pay based on any given performance measure can easily lead to undesirable distortions in the amount of effort allocated to various tasks, even if the performance measure is highly correlated with total output. They discuss "teaching to the test"in response to test-based accountability systems as an example of such a distortion, and in recent years, a significant literature has explored the extent to which test-based accountability systems actually create increases in subject mastery or only increases in measured performance on a specific type of exam.
In recent work,1 we explore a different effect of test-based accountability systems on the allocation of teacher effort, and we find evidence consistent with the hypothesis that test-based accountability systems not only shape decisions of teachers concerning what to teach but also whom to teach. We show that even though advocates of NCLB offered it as a remedy for disadvantaged children who receive poor service from their public schools, the design of NCLB almost guarantees that the most academically disadvantaged children will not benefit from its implementation and may actually be harmed.
Under NCLB, schools are judged against a standard called Adequate Yearly Progress (AYP). NCLB requires all students in certaingrade levels to take standardized tests, and each state education agency selects or constructs its own tests and defines the scores required for proficiency in various subjects. The law requires states to set yearly targets concerning the fraction of students in a school that must be proficient, and these AYP targets must increase at regular intervals, ultimately reaching 100% by the year 2014. The current AYP targets for proficiency rates are far below 100% in most states, and the 2014 goal of universal proficiency appears to be one of the least credible parts of the legislation in terms of the expectations of teachers and principals. Thus, the most expedient strategy for many schools is to devote extra attention to students whose recent test results place them near the proficiency standards in their state.
NCLB provides almost no incentive to devote extra attention to students who are far below grade level. Existing work on educational production functions suggests that it is somewhere between extremely costly and impossible to bring a child up several grade levels in a short period of time, and educators know that many of their students will soon be in another school, either because of geographic mobility or the natural progressions from elementary school to middle school or middle school to high school. Further, NCLB provides weak incentives to devote extra resources to the most advanced students because they are going to be proficient regardless.
To measure the impact of NCLB on student test scores, we compare the achievement of fifth graders in Chicago who took exams following NCLB with the achievement of comparable fifth graders who took exams just prior to NCLB. For both cohorts, we are able to condition on a set of third grades scores from tests administered prior to NCLB. We find noteworthy increases in reading and math scores among students in the middle of the achievement distribution, but we also find that the least academically advantaged students in Chicago scored the same or worse following NCLB than one would have expected given the pre-NCLB relationships between third and fifth grade scores. We find weak evidence of systematic gains from accountability among high achievers. While our empirical results provide strong circumstantial evidence that teachers respond to proficiency count systems by shifting attention to students near the proficiency standard, a growing ethnographic literature provides more direct evidence. Teachers report that school principals explicitly tell them to concentrate their efforts on the so-called “bubble” students near the proficiency threshold, and some schools use rather sophisticated software to identify groups of students whose proficiency status is most likely to improve given extra instruction.
Our results highlight the breadth of concerns that education policy makers must confront when designing accountability systems built around test scores. Accountability proponents responded to previous concerns about schools tailoring instruction to specific assessments by arguing that policy makers simply need to develop tests that are "worth teaching to," but our results show that, even with perfect assessments, the task of mapping a matrix of scores for a set of students into an overall performance index for the teachers is a daunting design challenge. The current NCLB system shifts effort to students near their state's proficiency level for rather transparent reasons. However, it is difficult to imagine how one could design a set of exams, scales measuring exam performance, and rules for aggregating scores into a performance index that would not provide incentives ex post to allocate relatively more attention to some types of students than others.
In addition, test-based accountability systems are likely to shape the equilibrium assignment of teachers to students. Because states make AYP calculations based on the level of student performance rather than improvements in student performance, teachers in schools that educate primarily advantaged students in homogeneous communities often face relatively little pressure under NCLB while schools that educate large numbers of academically disadvantaged students face the constant threat of failure even if their students are performing well given their baseline skill levels at school entry. There is only suggestive evidence at this point, but it seems quite likely that this feature of NCLB adversely affects the willingness of teachers to teach in disadvantaged schools. NCLB includes the placement of highly qualified teachers in every classroom as an explicit objective, and the AYP system does create pressure to hire and retain teachers based on performance. Nonetheless, AYP's reliance on levels of student performance and not improvements in student performance reduces the number of quality teachers who are willing to work in disadvantaged schools.
Recently, the federal Department of Education allowed a few states to calculate AYP based, in part, on growth in student achievement during a school year rather than levels only. This is a step in the right direction, but without careful design work, these systems may simply create a different set of unintended effort distortions among teachers. Knowledge does not come on a natural scale, and given any particular scale, gains of a given size may be easier to achieve at some points on the scale than others. Thus, something as apparently pedestrian as the scaling of exams could have significant and unintended consequences for the allocation of teacher effort to different types of students if states do not carefully design value-added versions of AYP.
Because teachers are charged with fostering knowledge, character, and other things that are hard to measure, it is not obvious that incentive systems built around objective performance measures are even desirable strategies for monitoring teachers. Test-based accountability systems have nonetheless enjoyed strong support because school principals and others who monitor the performance of teachers in public schools are seen as agents of large bureaucracies that, especially in cities, have a long record of disappointing results. Nonetheless, our empirical results and the insights gained from research on the economics of organizations suggest that policy makers must tackle difficult design questions in order to construct accountability systems that deliver quality instruction for all students regardless of their aptitude and prior achievement. Policy makers should either take these design issues more seriously or follow the lead of many private sector firms and look for other ways to monitor and motivate teachers.