(This column is adapted from a talk I gave at the University of Kyoto in January, 2012)
Those in the field of assessment often refer to two important standards that assessments are expected to meet, reliability and validity. Reliability meaning that the same results would be obtained if the assessment were given again, or if a different person was scoring the assessment.
Validity means that the assessment actually measures, assesses, what it claims to be measuring/assessing—and whether it predicts how one will perform in the future (Ormrod, 2005).
One type of validity is “face validity”—that is, it is accepted that the assessment actually does measure what it claims to measure, without needing statistical proof that it does. The road test portion of the driving test might be an example of that: We can easily agree that if we want to know if someone knows how to drive, we can sit in a car with them and watch them drive. Now, what constitutes good enough driving to pass the test, that is where things might get more difficult to agree. Both how good is good enough, and which things matter most, e.g. how well the student parked, used turn signals, obeyed signs, and how much should each count, can be controversial.
Other tests need to have their validity demonstrated. The paper pencil portion of the driving test might be one of those. Do we have any evidence that those who do better on the written portion are actually better drivers?
However, while we accept that the road test has more face validity, we might wonder about its reliability, the possible subjective nature. The written portion is more reliable, you either filled in the correct bubble/answer, or you did not. However, on the driving portion, maybe the traffic conditions were more difficult when you took it than when your friend did, maybe one instructor is tougher grader than another. Maybe he had a fight with his spouse that morning! Despite these shortcomings, we accept the trade-offs as worth the advantages of such an authentic assessment. A built in safeguard is the opportunity for second, third, and as many opportunities as needed to retake the test.
It is easy to create paper and pencil assessments that are reliable and easy to administer. However, how one’s score correlates to real life application of the knowledge or skill that the assessment is designed to measure is harder to determine. Some, such as myself, argue that there is a built in tension between reliability and authenticity. Real life tasks and situation are by their nature not standardizable: Conditions vary, there is ambiguity, and there is more than one right way to approach a situation or problem. Creativity, a very important human trait, cannot be measured, and one’s ability to act effectively in novel situations is also by its nature not standardizable. Therefore to assess one’s ability to use one’s knowledge and skills in real life situation is likely to have a degree of unreliability, unpredictably.
Furthermore, what one person views as good enough, as quality, in most real life applications also varies. A movie I thought was well acted and crafted, my best friend thought was poorly acted and rang false. And that is in movies made by highly paid seasoned professionals! Multiple publishers initially turned down a number of best selling classics in literature.
Compulsory public schooling in the United States was instituted at a particular point in history, with other changes and advances happening. Part of that was the belief in scientific experts and the new field of psychology as a science rather than philosophy, and the invention of standardized intelligence tests. Americans often want to find the one right way (Smith, 1988). Americans are known for their obsession with measuring everything, and putting numbers to everything. This has played into schools in the forms of tests that can be reduced to numerical scores, and a belief that if everyone takes the same test at the same time in the same way, and test is designed by outside experts, it is therefore objective.
Critics of the standardized tests of today point out the shortcomings of such tests: they don’t really have reliability at the individual level, they are culturally biased, and their inauthenticity—their lack of actual validity in terms of measuring any important, useful skill, ability or knowledge beyond the school house walls. They also object to the indirect influence of these tests in encouraging the teaching of discreet skills and rote knowledge that is quickly forgotten once the test is over (Hursh, 2005; Kohn, 2000; Meier, 2002; Ohanian, 1999).
However, it must be remembered that standardized tests were put in place in part as a seemingly fairer alternative to an aristocratic system, where social position and money was what decided who got into the best schools and got the best jobs. Standardized tests were seen as scientifically objective tests, and therefore gave an equal chance to all. One could rise by one’s merit, not relying on family name or wealth (Smith, 1988).
What authentic assessment is proposing to do is to let people show what they know and can do based on merit, but also more accurately than standardized tests reflect the skills and abilities the person should have by seeing how they apply that knowledge in a realistic situation.
Of course even “authentic assessment” is always a matter of degree. Authentic assessments are generally applied in somewhat contrived or hypothetical situations. In school situations it is rarely practical or even possible to have students demonstrate in the real life situation, and even authentic assessments give us just a sample of the full skill being assessed. To go back to the driving test example, even on the road test, not nearly every possible driving situation is encountered. The driver is asked to carry out a predetermined set of maneuvers at the direction of the tester over a relatively short period.
A large issue for authentic assessment is to overcome the issues of “bias,” which is really an issue of reliability—would a different scorer give that person the same score? One way to address this is through multiple assessors. For instance, at some high schools that use portfolio or exhibitions for graduation, such as was developed at Central Park East Secondary School, they use multiple assessors, while also having outside experts examine their system, and watch it in practice to help them improve and refine it (Gold, 1993; Meier, 1995; Meier 2002).
Another common system to obtain more reliability that is used in authentic or performance based assessment systems is to have scorers be calibrated. A set of benchmarks are set up—examples of the performance assessment carried out at different levels, and the scorers are first trained on what qualities to look for, and then they are asked to score these benchmark examples to see if they give them the expected scores. In theory, only when they can consistently give the expected scores are they considered calibrated, and therefore the scores are considered reliable.
I will now discuss efforts in California to bring a more authentic, yet standardized, assessment in a systematic way to credential teachers.
California teachers are given their credential based a variety of factors. Some have been (and still are) standardized paper and pencil tests. However, as we have discussed, there is a sense that these are not good indicators of how well they would actually teach. These tests are used as measures of minimum knowledge of basic skills. On the more authentic side these candidates are placed in classrooms to learn to teach alongside practicing teachers. In most teacher education programs in California, this is a semester long placement. In some, such as where I currently teach, we require two semesters of student teaching. However, some worry about the standards of those assessing that experience. Were they tough enough? Are they consistent? There is no standard set of measures for that experience. The same could be said of the other criteria, that they pass their college courses to become a teacher. Were the standards from one program to another, even one class to another, consistent (Chung, 2005)?
The legislature of the State of California decided to institute a performance based assessment system on top of the other criteria to both provide an authentic, yet valid and reliable way to measure whether a candidate was ready to become a teacher.
Linda Darling Hammond of Stanford University led a consortium of universities with foundation support to develop such a system, called Performance Assessment of California Teachers (PACT) (another similar system was also developed by the Education Testing System, the CalTPA). In the PACT assessment teacher candidates develop a 3-5 day lesson plan in mathematics or reading, they carry out the lessons in their placement, and videotape those lessons. They document all of this, providing a detailed description of the context where they taught the lesson, describing the school, the classroom and what they know about the students. They provide the lesson plans, and some discussion about those lesson plans. They reflect on what happened when they gave the lessons, what changes they made along the way, and what changes they might make if they were to give these lessons again. They select a 20-minute portion of the video for the portfolio, and discuss what is in that portion. They also provide examples of the assessment used in the lesson from three students of varying abilities. They discuss what they saw overall in reviewing the student assessment, and what they learned about the three students in particular.
This portfolio is then read and scored on a set of 12 rubrics. Several rubrics address issues of planning, several look at the execution of the lessons, several others look at the issue of assessment. The issue of how the lessons helped student access and learn “academic language” is also assessed by two of the rubrics. The people who score these assessments go through a two day scoring and calibration training, and must re-calibrate every year.
In practice, despite the training and calibration, there are still sometimes disagreements (if a student fails, it automatically gets scored by a second scorer. Randomly ten percent get two scorers to check reliability). While in the large majority of cases we probably score the candidates similarly, there are cases where we have scored them quite differently. In such a system, there is room for interpretation. If the rubric asks us if the lesson was appropriate for the students, or the teacher gave clear feedback, what one of us interprets as appropriate or clear may not be the same as another.
These are the trade-offs for a more authentic system. For everything we do, that we add, something is also lost, traded. On the positive side, in my institution it has meant that we have had dialog among the faculty about creating a more cohesive experience for the student. However, as many high stakes assessment systems can do, preparing our students for the assessment itself has taken significant university class time, time that used to be spent on content. In that way students may be losing out. Some also wonder to what extent is the ability to write well, to theorize being assessed, rather than the actual ability to teach. Though assessors are told that the writing itself is not being assessed, it is for the most part a written assessment, albeit of a performance (along with the short video clip).
It is certainly a system that is more uniform than what was in place before. From my experience with the system, it does appear that the stakes have been raised for student teachers. Are the teachers who have now gone through this system, better prepared? Are we better at keeping out unprepared teachers, while not excluding prepared ones through this system? That is a much more difficult question to answer for which there are no solid “facts.”
The problem in the United States is that people are looking for a foolproof “fair” system. The attempt is to avoid human judgment, which by its nature full of biases and well, judgment! Standardized tests, paper pencil tests, offer us the illusion of avoiding judgment, but it just moves such judgment to the creator of the test. It offers reliability often at the cost of meaningfulness.
In the United States we rely on human judgment for our criminal justice system, our courts—very important high stakes decision—and while mistakes are made, maybe even often, it is seen as better than the alternative. Authentic assessment systems at heart require the same faith. A faith that the trade-off of allowing for human judgment is better than the reductionism required to assess in a standardized form. I believe we need to bring more of such human judgment back to our educational system.
References:
Chung, R. R. (2005). The performance assessment for California teachers (PACT) and beginning teacher development: Can a performance assessment promote expert teaching practice? Stanford University. Proquest dissertations and theses, 598p.
Retrieved from http://search.Proquest.Com/docview/305434959?Accountid=10355 Unpublished Dissertation, Stanford University.
Gold, J. (Producer & Director), & Lanzoni, M. (Ed). (1993). Graduation by portfolio: Central Park East Secondary School [Videotape]. New York: Post Production, 29th Street Video Inc. http://vimeo.com/13992931
Hursh, D. (2005). The growth of high-stakes testing in the USA: Accountability, markets and the decline in educational equality. British Educational Research Journal, 31(5), 605-622.
Kohn, A. (2000). The case against standardized testing: Raising the scores, ruining the schools. Portsmouth, NH: Heinemann.
Meier, D. (1995). The power of their ideas: Lessons for America from a small school in Harlem. Boston: Beacon Press.
Meier, D. (2002). In schools we trust: Creating communities of learning in an era of testing and standardization. Boston: Beacon Press.
Ohanian, S. (1999). One size fits few: The folly of educational standards. Portsmouth, NH: Heinemann.
Ormrod, J. E. (2005). Educational psychology: Developing learners (4th ed.): Prentice Hall.
Smith, F. (1988). Joining the literacy club: Further essays into education. Portsmouth, NH: Heinemann.