Dissecting Common Core Assessment Myths and Realities

[From Fairtest.org]


A new fact sheet shows that the Common Core Assessments, which are being rolled out for widespread implementation in the 2014-2015 school year, are not significantly different from the standardized exams currently administered in many states. At the same time, plans call for more high-stakes tests with even greater costs.

“Despite proponents’ claims that the Common Core would lead to a new breed of assessments that focus on higher-order, critical thinking skills, the planned tests are predominantly the same-old multiple-choice questions,” explained Dr. Monty Neill, Executive Director of the National Center for Fair & Open Testing (FairTest).

Dr. Neill continued, “Rather than ending ‘No Child Left Behind’ testing overkill, the Common Core will flood classrooms with even more standardized exams. Their scores will continue to be misused to make high-stakes educational decisions, including high school graduation. They will also end up costing taxpayers millions more for new tests and the computer systems required to deliver them.”

The FairTest fact sheet also challenges the notion that harder tests are automatically better. It states, “If a child struggles to clear the high bar at five feet, she will not become a ‘world class’ jumper because someone raised the bar to six feet and yelled ‘jump higher,’ or if her ‘poor’ performance is used to punish her coach.” Scores recently plummeted in New York State and Kentucky where Common Core tests were initially administered.

Based on its analysis, FairTest is calling for an indefinite moratorium on the Common Core tests. “As the prestigious Gordon Commission of educational experts recently concluded, these exams are not the better assessments our schools need,” Dr. Neill concluded. “Instead, a system of classroom- based performance assessments, evaluations of student work portfolios, and school quality reviews will help improve learning and teaching.”

Assessment in California Teacher Education

(This column is adapted from a talk I gave at the University of Kyoto in January, 2012)

Those in the field of assessment often refer to two important standards that assessments are expected to meet, reliability and validity. Reliability meaning that the same results would be obtained if the assessment were given again, or if a different person was scoring the assessment.

Validity means that the assessment actually measures, assesses, what it claims to be measuring/assessing—and whether it predicts how one will perform in the future (Ormrod, 2005).

One type of validity is “face validity”—that is, it is accepted that the assessment actually does measure what it claims to measure, without needing statistical proof that it does. The road test portion of the driving test might be an example of that: We can easily agree that if we want to know if someone knows how to drive, we can sit in a car with them and watch them drive. Now, what constitutes good enough driving to pass the test, that is where things might get more difficult to agree. Both how good is good enough, and which things matter most, e.g. how well the student parked, used turn signals, obeyed signs, and how much should each count, can be controversial.

Other tests need to have their validity demonstrated. The paper pencil portion of the driving test might be one of those. Do we have any evidence that those who do better on the written portion are actually better drivers?

However, while we accept that the road test has more face validity, we might wonder about its reliability, the possible subjective nature. The written portion is more reliable, you either filled in the correct bubble/answer, or you did not. However, on the driving portion, maybe the traffic conditions were more difficult when you took it than when your friend did, maybe one instructor is tougher grader than another. Maybe he had a fight with his spouse that morning! Despite these shortcomings, we accept the trade-offs as worth the advantages of such an authentic assessment. A built in safeguard is the opportunity for second, third, and as many opportunities as needed to retake the test.

It is easy to create paper and pencil assessments that are reliable and easy to administer. However, how one’s score correlates to real life application of the knowledge or skill that the assessment is designed to measure is harder to determine. Some, such as myself, argue that there is a built in tension between reliability and authenticity. Real life tasks and situation are by their nature not standardizable: Conditions vary, there is ambiguity, and there is more than one right way to approach a situation or problem. Creativity, a very important human trait, cannot be measured, and one’s ability to act effectively in novel situations is also by its nature not standardizable. Therefore to assess one’s ability to use one’s knowledge and skills in real life situation is likely to have a degree of unreliability, unpredictably.

Furthermore, what one person views as good enough, as quality, in most real life applications also varies. A movie I thought was well acted and crafted, my best friend thought was poorly acted and rang false. And that is in movies made by highly paid seasoned professionals! Multiple publishers initially turned down a number of best selling classics in literature.

Compulsory public schooling in the United States was instituted at a particular point in history, with other changes and advances happening. Part of that was the belief in scientific experts and the new field of psychology as a science rather than philosophy, and the invention of standardized intelligence tests. Americans often want to find the one right way (Smith, 1988). Americans are known for their obsession with measuring everything, and putting numbers to everything. This has played into schools in the forms of tests that can be reduced to numerical scores, and a belief that if everyone takes the same test at the same time in the same way, and test is designed by outside experts, it is therefore objective.

Critics of the standardized tests of today point out the shortcomings of such tests: they don’t really have reliability at the individual level, they are culturally biased, and their inauthenticity—their lack of actual validity in terms of measuring any important, useful skill, ability or knowledge beyond the school house walls. They also object to the indirect influence of these tests in encouraging the teaching of discreet skills and rote knowledge that is quickly forgotten once the test is over (Hursh, 2005; Kohn, 2000; Meier, 2002; Ohanian, 1999).

However, it must be remembered that standardized tests were put in place in part as a seemingly fairer alternative to an aristocratic system, where social position and money was what decided who got into the best schools and got the best jobs. Standardized tests were seen as scientifically objective tests, and therefore gave an equal chance to all. One could rise by one’s merit, not relying on family name or wealth (Smith, 1988).

What authentic assessment is proposing to do is to let people show what they know and can do based on merit, but also more accurately than standardized tests reflect the skills and abilities the person should have by seeing how they apply that knowledge in a realistic situation.

Of course even “authentic assessment” is always a matter of degree. Authentic assessments are generally applied in somewhat contrived or hypothetical situations. In school situations it is rarely practical or even possible to have students demonstrate in the real life situation, and even authentic assessments give us just a sample of the full skill being assessed. To go back to the driving test example, even on the road test, not nearly every possible driving situation is encountered. The driver is asked to carry out a predetermined set of maneuvers at the direction of the tester over a relatively short period.

A large issue for authentic assessment is to overcome the issues of “bias,” which is really an issue of reliability—would a different scorer give that person the same score? One way to address this is through multiple assessors. For instance, at some high schools that use portfolio or exhibitions for graduation, such as was developed at Central Park East Secondary School, they use multiple assessors, while also having outside experts examine their system, and watch it in practice to help them improve and refine it (Gold, 1993; Meier, 1995; Meier 2002).

Another common system to obtain more reliability that is used in authentic or performance based assessment systems is to have scorers be calibrated. A set of benchmarks are set up—examples of the performance assessment carried out at different levels, and the scorers are first trained on what qualities to look for, and then they are asked to score these benchmark examples to see if they give them the expected scores. In theory, only when they can consistently give the expected scores are they considered calibrated, and therefore the scores are considered reliable.

I will now discuss efforts in California to bring a more authentic, yet standardized, assessment in a systematic way to credential teachers.

California teachers are given their credential based a variety of factors. Some have been (and still are) standardized paper and pencil tests. However, as we have discussed, there is a sense that these are not good indicators of how well they would actually teach. These tests are used as measures of minimum knowledge of basic skills. On the more authentic side these candidates are placed in classrooms to learn to teach alongside practicing teachers. In most teacher education programs in California, this is a semester long placement. In some, such as where I currently teach, we require two semesters of student teaching. However, some worry about the standards of those assessing that experience. Were they tough enough? Are they consistent? There is no standard set of measures for that experience. The same could be said of the other criteria, that they pass their college courses to become a teacher. Were the standards from one program to another, even one class to another, consistent (Chung, 2005)?

The legislature of the State of California decided to institute a performance based assessment system on top of the other criteria to both provide an authentic, yet valid and reliable way to measure whether a candidate was ready to become a teacher.

Linda Darling Hammond of Stanford University led a consortium of universities with foundation support to develop such a system, called Performance Assessment of California Teachers (PACT) (another similar system was also developed by the Education Testing System, the CalTPA). In the PACT assessment teacher candidates develop a 3-5 day lesson plan in mathematics or reading, they carry out the lessons in their placement, and videotape those lessons. They document all of this, providing a detailed description of the context where they taught the lesson, describing the school, the classroom and what they know about the students. They provide the lesson plans, and some discussion about those lesson plans. They reflect on what happened when they gave the lessons, what changes they made along the way, and what changes they might make if they were to give these lessons again. They select a 20-minute portion of the video for the portfolio, and discuss what is in that portion. They also provide examples of the assessment used in the lesson from three students of varying abilities. They discuss what they saw overall in reviewing the student assessment, and what they learned about the three students in particular.

This portfolio is then read and scored on a set of 12 rubrics. Several rubrics address issues of planning, several look at the execution of the lessons, several others look at the issue of assessment. The issue of how the lessons helped student access and learn “academic language” is also assessed by two of the rubrics. The people who score these assessments go through a two day scoring and calibration training, and must re-calibrate every year.

In practice, despite the training and calibration, there are still sometimes disagreements (if a student fails, it automatically gets scored by a second scorer. Randomly ten percent get two scorers to check reliability). While in the large majority of cases we probably score the candidates similarly, there are cases where we have scored them quite differently. In such a system, there is room for interpretation. If the rubric asks us if the lesson was appropriate for the students, or the teacher gave clear feedback, what one of us interprets as appropriate or clear may not be the same as another.

These are the trade-offs for a more authentic system. For everything we do, that we add, something is also lost, traded. On the positive side, in my institution it has meant that we have had dialog among the faculty about creating a more cohesive experience for the student. However, as many high stakes assessment systems can do, preparing our students for the assessment itself has taken significant university class time, time that used to be spent on content. In that way students may be losing out. Some also wonder to what extent is the ability to write well, to theorize being assessed, rather than the actual ability to teach. Though assessors are told that the writing itself is not being assessed, it is for the most part a written assessment, albeit of a performance (along with the short video clip).

It is certainly a system that is more uniform than what was in place before. From my experience with the system, it does appear that the stakes have been raised for student teachers. Are the teachers who have now gone through this system, better prepared? Are we better at keeping out unprepared teachers, while not excluding prepared ones through this system? That is a much more difficult question to answer for which there are no solid “facts.”

The problem in the United States is that people are looking for a foolproof “fair” system. The attempt is to avoid human judgment, which by its nature full of biases and well, judgment! Standardized tests, paper pencil tests, offer us the illusion of avoiding judgment, but it just moves such judgment to the creator of the test. It offers reliability often at the cost of meaningfulness.

In the United States we rely on human judgment for our criminal justice system, our courts—very important high stakes decision—and while mistakes are made, maybe even often, it is seen as better than the alternative. Authentic assessment systems at heart require the same faith. A faith that the trade-off of allowing for human judgment is better than the reductionism required to assess in a standardized form. I believe we need to bring more of such human judgment back to our educational system.


Chung, R. R. (2005). The performance assessment for California teachers (PACT) and beginning teacher development: Can a performance assessment promote expert teaching practice? Stanford University. Proquest dissertations and theses, 598p.
Retrieved from http://search.Proquest.Com/docview/305434959?Accountid=10355 Unpublished Dissertation, Stanford University.

Gold, J. (Producer & Director), & Lanzoni, M. (Ed). (1993). Graduation by portfolio: Central Park East Secondary School [Videotape]. New York: Post Production, 29th Street Video Inc. http://vimeo.com/13992931

Hursh, D. (2005). The growth of high-stakes testing in the USA: Accountability, markets and the decline in educational equality. British Educational Research Journal, 31(5), 605-622.

Kohn, A. (2000). The case against standardized testing: Raising the scores, ruining the schools. Portsmouth, NH: Heinemann.

Meier, D. (1995). The power of their ideas: Lessons for America from a small school in Harlem. Boston: Beacon Press.

Meier, D. (2002). In schools we trust: Creating communities of learning in an era of testing and standardization. Boston: Beacon Press.

Ohanian, S. (1999). One size fits few: The folly of educational standards. Portsmouth, NH: Heinemann.

Ormrod, J. E. (2005). Educational psychology: Developing learners (4th ed.): Prentice Hall.

Smith, F. (1988). Joining the literacy club: Further essays into education. Portsmouth, NH: Heinemann.

Merit Pay

As the idea of merit pay sweeps the nation, and the federal government is pushing the idea down the throats of the states using the old carrot/stick approach, I have been thinking much about this topic. Florida is about to vote on such a bill, tying teacher pay to test scores.

Merit pay is popular in part because on the surface it has such a ring of fairness. Shouldn’t better teachers get rewarded for it? However, in reality, it is fraught with many complications and difficulties.

The issue also gets further confused as there are really two issues. One is teacher evaluation and the other is teacher compensation. Without a fair way to evaluate teachers, merit pay cannot be fair.

Some people complain that current teacher evaluation systems are poor. Usually a principal announces they will come in and observe. The principal makes notes and bases the teacher’s evaluation to a large part on this single observation. Often this happens only once every other year for experienced teachers. I would agree that this method is lacking—but that makes the idea of merit pay more, not less problematic. People also complain that bad teachers are allowed to keep teaching and impossible to fire. That is mostly a gross exaggeration. The problem is that in part it is based on that few are “fired” in the technical sense of the word that would show up on public records. That is because at least 9 out of 10 times, the teacher resigns before being fired. That is typical in any field. Certainly in any professional field I have ever heard of, the employee in danger of being fired is generally encouraged to resign, sparing the employer of the legal steps of actually firing the person, and sparing the employee of having it on their record. All the principals that I admire tell me that they can and do get rid of the teachers they think are not serving the students. While it is not easy, why should it be? If a principal could easily fire any teacher, it would make teaching a risky profession, especially for those with interesting ideas. Fear is never a good long term motivator. Teacher “tenure” (it is not actually technically “tenure”) just means that due process must be observed. Is due process a good thing or not?

But back to merit pay. Shouldn’t teachers get paid more for being better? First off, who get to decide who is better and how? Test scores seem to be the idea in vogue. That is what they are proposing in Florida, and already using in various places. However, our current testing system tests only a tiny fraction of what is important for children to know (and does so in such a poor way). In elementary schools it is rote math and reading skills. That is it. Basing pay on just that would encourage teachers even more than they already are to only focus on what is likely to be on the test, at the expense of everything else (many elementary school, due to NCLB have already reduced the curriculum to almost only these two areas). There is an axiom in the social sciences known as Campbell’s Law that says that the higher the stakes on a particular social indicator (e.g. a single test score), the more the use of that indicator corrupts the original intent, as it encourages people to manipulate the system to look good on that indicator regardless of other effects. We see that happening already—retaining students so they take the easier test; pushing kids to disappear from the system. There is the focus on the kids that show the most promise of moving from one category to the next, while ignoring others. Not to mention the examples of out and out cheating—changing test answers and such. Teachers start to resent the “low” students” the “slow” students, as they put their pay or job in danger, rather than being seen as a challenge, as the place to make a real difference.

There is also the issue of motivation. Merit pay is seen as a way to motivate teachers to work harder. When most of us think of motivation, we often think of rewards. However, the most effective motivation is actually not extrinsic rewards. The most effective motivation is the enjoyment or intrinsic reward of the activity itself. Virtually all teachers go into teaching because they want to make a difference in their students lives, to be successful teachers—not for the great pay! What psychological theory has demonstrated again and again is that the more you externally reward someone for what they find intrinsically motivating, the less motivated they become for the thing itself, as the reward replaces their intrinsic motivation. They no longer care if the results are real, as long as they get the reward. Recent studies have demonstrated that bonuses in business are actually likely to make workers less, not more productive. Extrinsic rewards actually lead to less intrinsic interest in a job well done, not more.

School reform research has shown that the most effective school are those where teachers work together closely and have a shared vision. But merit pay is likely to increase competition among teachers, discouraging collaboration. In today’s climate of limited resources, if one teacher gets a bonus, it comes from the pool that everyone gets paid from, pitting teachers against each other for these limited resources. It becomes in my self interest to sabotage the other teachers to increase my chances of getting that money, or at least not to help them.

It is a truism that teachers are underpaid. Despite that, there is no compelling evidence that teachers leave the field over issues of pay, or that more pay gets them to work harder. It is possible we might attract a higher quality pool of candidates if teacher pay was significantly higher. However, in studies of what makes teachers satisfied or dissatisfied with their job, other working conditions are much higher on the list. How they are treated, what types of autonomy they have, what types of support they receive, resources, class sizes, and leadership all rate higher than issues of pay.

Mostly, merit pay is a side show, a distraction to any real answer to solving the difficult problems of educational reform. It is another quick fix solution that can be used to undermine teachers and the unions that represent them in the move to privatize schooling.