Bilingual Education: The Research

For those who have any doubts on the efficacy of bilingual education, below is a summary of the evidence from over more than 20 years.  I will follow up with my summary of why it works in a future blog.


National Literacy Panel on Language-Minority Children and Youth (U.S.), August, D., & Shanahan, T. (2006). Executive summary: Developing literacy in second-language learners: Report of the national literacy panel on language minority children and youth. Mahwah, N.J.: Lawrence Erlbaum Associates. [meta-analysis]

“The research indicates that instructional programs work when they provide opportunities for students to develop proficiency in their first language. Studies that compare bilingual instruction with English-only instruction demonstrate that language-minority students instructed in their native language as well as in English perform better, on average, on measures of English reading proficiency than language-minority students instructed only in English. This is the case at both the elementary and secondary levels” (p.11).


Rolstad, K., Mahoney, K., & Glass, G. V. (2005). The big picture: A meta-analysis of program effectiveness research on English language learners. Educational Policy, 19(4), 572-594. [meta analysis]

“Empirical evidence considered here indicates that bilingual education is more beneficial for ELL [English language learner] students than all-English approaches such as ESL [English as a second language] and SI [Structured immersion]. Moreover, students in long-term DBE [Developmental bilingual education] programs performed better than students in short-term TBE [transitional bilingual education] programs.”  (p.19)


Kellie R., Mahoney K. & Glass, G. (2005)  Weighing the evidence: A metat-analysis of bilingual education in Arizona. Bilingual Research Journal. 29(1)

Abstract: This article reviews the current policy context in the state of Arizona for program options for English language learners and produces a meta-analysis of studies on the effectiveness of bilingual education that have been conducted in the state in or after 1985. The study presents an analysis of a sample of evaluation studies (N = 4), which demonstrates a positive effect for bilingual education on all measures, both in English and the native language of English language learners, when compared to English-only instructional alternatives. We conclude that current state policy is at odds with the best synthesis of the empirical evidence, and we recommend that current policy mandating English-only and forbidding bilingual education be abandoned in favor of program choices made at the level of the local community.


Hofstetter, C. H. (2004). Effects of a transitional bilingual education program: Findings, issues, and next steps. Bilingual Research Journal, 28(3), 355-377. [primary research]

“After 4 years in their respective programs, students in ALA [Academic Language Acquisition, a form of transitional bilingual education] and SEI [Structured English Immersion] classes displayed only nominal differences, at best, in their performance on various achievement indicators. ALA and SEI students… were comparable on English-language SAT–9 tests in reading, mathematics, and language arts, as well as the reading and listening and speaking portions of the CELDT, an English-proficiency test. The only significant difference among groups occurred in writing, where students in… ALA … scored lower than their peers.” (p.16)


Howard, E. R., Sugarman, J., & Christian, D. (2003). Trends in two-way immersion education: A review of the literature (Report No. 63): Center for Applied Linguistics. [research summary]

“On aggregate, the research summarized in this section indicates that both native Spanish speakers and native English speakers in TWI [two-way immersion] programs perform as well or better than their peers educated in other types of programs, both on English standardized achievement tests and Spanish standardized achievement tests.” (p.30)


Thomas, W., & Collier, V. (2002). Executive summary: A national study of school effectiveness for language minority students’ long-term academic achievement. Washington, DC: Center for Research on Education, Diversity & Excellence. [primary research]

“Enrichment 90-10 and 50-50 one-way and two-way developmental bilingual education (DBE) programs (or dual language, bilingual immersion) are the only programs we have found to date that assist students to fully reach the 50th percentile in both L1 and L2 in all subjects and to maintain that level of high achievement, or reach even higher levels through the end of schooling. The fewest dropouts come from these programs.” (p.7)


Snow, C., Burns, S., & Griffin, P. (1998). Preventing reading difficulties in young children [electronic version] . Washington, DC: National Academy Press. Retrieved  March 24, 2007 from [Research summary]

“The accumulated wisdom of research in the field of bilingualism and literacy tends to converge on the conclusion that initial literacy instruction in a second language … carries with it a higher risk of reading problems and of lower ultimate literacy attainment than initial literacy instruction in a first language.”


Greene, J. (1997). A meta-analysis of the Rossell and Baker review of bilingual education research. Bilingual Research Journal, 21(2-3), 103-122.

“Despite the relatively small number of studies, the strength and consistency of these results, especially from the highest quality randomized experiments, increases confidence in the conclusion that bilingual programs are effective at increasing standardized test scores measured in English.”


Thomas, W., & Collier, V. (1997). School effectiveness for language minority students. Washington, DC: National Clearinghouse for Bilingual Education. [primary research]

The first predictor of long-term school success is cognitively complex on-grade-level academic instruction through students’ first language for as long as possible (at least through Grade 5 or 6) and cognitively complex on-grade-level academic instruction through the second language (English) for part of the school day, in each succeeding grade throughout students’ schooling…. The second predictor of long-term school success is the use of current approaches to teaching the academic curriculum through two languages.” (p.16)


Ramirez, J. D. (1992). Executive summary: Longitudinal study of structured English immersion strategy, early-exit and late-exit transitional bilingual education programs for language-minority children. Bilingual Research Journal, 16, 1-62. [primary research]

“Providing substantial instruction in the child’s primary language does not impede the learning of English language or reading skills.” (p.44)


Willig, A. C. (1985) A Meta-Analysis of Selected Studies on the Effectiveness of Bilingual Education. Review of Educational Research

“Meta analysis results were compared with a traditional review of bilingual education program effectiveness. When controlled for methodological inadequacies, participation in bilingual education programs consistently produced differences favoring bilingual education.”

What is the Evidence?

Deborah Meier, in collaboration with her faculty at Central Park East Secondary School, developed five habits of mind that were at the heart of their school. One of those habits of mind was to ask “What is the evidence?”

I was rereading an article on Direct Instruction(1) that I have my teaching credential students read. The article ends with the claim that Direct Instruction, unlike discovery approaches to learning, has research evidence demonstrating its effectiveness. However, as educational reformer Deborah Meier keeps reminding us about such claims, we have to always ask what counts as evidence? How is achievement defined? Effective at what?

In educational research test score results almost always constitute the evidence, and more and more particularly, the scores on the standardized test mandated by each state to meet the rules of the No Child Left Behind legislation.

However, we must look at all the assumptions that are built into using such test scores as evidence of learning. The assumption that test scores are meaningful and accurate has been one that is questioned by many educational experts (see, for example, Alfie Kohn’s The Case against Standardized Testing(2), or the FairTest website for more in depth information on this topic).

CausationOne assumption is that such tests actually test what they claim to test. If what we really want to know is how people can use a skill in an authentic situation, how close to that performance are their results on a multiple choice paper and pencil test? Can you imagine if we only used the written test to decide whether someone could drive? When researchers have looked how people do at using math algorithms in school, and then how they try to solve real problems that require the same math in their daily lives, they see little connection between to the two.

Even in something that seems as basic as reading, where one does read in the test and then answer questions about it, researchers have found that often the reason students get the answer right or wrong has as much to do with their prior knowledge and cultural assumptions about the content as it does about being able to read the passage(3). And often, in the case of so called reading tests, it is not reading at all that is tested, but what are called reading subskills, which are believed by some to be precursors to skilled reading, such as recognizing certain sound or spelling patterns. However, doing well on such subskills has not been shown to be connected to comprehension of what one reads (see my article on Reading First for more on this(4)). Typical standard reading tests also test other aspects of knowledge of language, such as recognizing synonyms and homonyms. While these and others may be a good terms to understand, does knowing the terms make one a better reader, or just more knowledgeable about linguistics?

The next major assumption I want to challenge is that short term results on such tests predict long term results. This is often not the case. If early learning is speeded up in order to improve short term test results, it can result in leaving students with a shaky foundation, therefore actually leading to poorer long term results. There is a parallel in business. When financial institutions and businesses go for short term profits to please stockholders, it is often at the risk of the long term stability and interest of the company, as we have seen with our recent economic collapse. In math, teaching the rote memorization of algorithms may help students pass the next test, where each problem is presented just as you taught it, but then in the following years, without a foundation in the concepts that underlie those algorithms, such students’ abilities to understand more complex concepts and solve the more complex problems that go with those concepts will not be there, and their scores will collapse like a house of cards. This sort of short-sightedness exists in many areas of the curriculum, especially when there are large pressures to get those short term results.

Another aspect I want to challenge is whether the possible side effects have been looked at. When pharmaceutical companies tests new drugs, they are required to not just look at whether the drug cures the ailment, but also what are the possible side effects on other aspects of health. This never seems to be done in educational research. In the pursuit of raising test scores, might the new methods create other problems? We act as if the child is made up of discrete skills and knowledge, each of which can be taught and measured separately, without an effect on anything else, rather than looking at the child as a whole being. For instance, are we increasing obesity, as schools cut out recess and other activities in which students are more active to spend more time studying the tested subjects?

Even in terms of the activity we are testing, might the way we teach have an effect not just on how well one does it, but whether one wants to do it? Stephen Krashen pointed out in his book on whole language(5) that studies comparing free reading time to direct instruction of reading found the test scores were similar. However, which is more likely to lead to a love of reading—students who get to choose what they read, or those who read decontextualized texts over which they no say, and then get tested regularly on those passages? Yet, this love and desire to read is not assessed.

The last assumption I want to examine is that what we are testing is what matters most. No one questions that students should be able to read, write and do arithmetic. But if you ask parents and teachers what they mean by a well educated person, and what they want their children to get out of school, these generally are not the first ones they mention. How does the students treat others? How motivated are they for further learning? Do they like school? Do they have empathy for others? Are they likely to be civic minded and civically active?

Others questions we might ask are: how persistent is a student in the face of difficult tasks? What is their ability to put together knowledge and abilities from a variety of areas and use them in novel ways? Can they express their ideas effectively? Do they listen to the ideas of others? How and what we teach can and does have effects on these as well. There are many others each of us might think are equally or more important. Yet, these almost never get asked or taken seriously in educational research, particularity not the research that is used for policy. The very question of what is most important to assess is not even asked.

There have been a few exceptions to this trend. In the area of progressive education, for instance, I can name several. In the 1930s, there was the Eight Year Study(6) which matched students who went to high schools implementing progressive methodologies to those in traditional high schools, and then followed them through college. This study looked at a wide variety of definitions of success, finding that those who attended the more progressive schools showed better results.

David Bensman did a study of the progressive Central Park East schools, (a group of public schools in New York City serving predominantly low income African-American and Latino students) that looked not just at the test scores, but looked at college, employment, civic involvement and their impressions of the impact of the school in their lives(7). He also found that these students did much better than their counterparts who went to neighboring schools.

A friend just sent me a recent master’s thesis on the Peninsula School, a progressive independent k-6 school, comparing the graduates in regards to their high school achievement to a random sample of their high school classmates who had gone to other elementary schools—finding the students at the progressive school did better academically. Not only that, but the study also found they had better attitudes toward school and their learning experiences(8).

A study done on types of programs for second language learners, while not going beyond test scores, was at least longitudinal, using a very large sample and following students throughout the grades, found that programs that used more of the primary language, and those that used methodologies where language was taught in context embedded ways, had better results(9). This despite the fact that in the early grades the students with more English instruction and less primary language did better. Short term results were negatively correlated with long term results in this case.

Whenever someone says that the evidence proved that a certain method is better, one must ask, what is that evidence? Did the assessment really match your definition of what it means to be able to do or know that? Were the results short or long term, and if short term, what is the evidence that these short term results will add up to long term success? Also, it is important to ask what are the effects on other aspects of learning or the life of the student. And most importantly, are they assessing what really matters?


1. Tarver, Sarah G. “Direct Instruction: Teaching for Generalization, Application and Integration of Knowledge.” Learning Disabilities 10, no. 4 (2000): 201-07.

2. Kohn, Alfie. The Case against Standardized Testing: Raising the Scores, Ruining the Schools. Portsmouth, NH: Heinemann, 2000.

3. Meier, Deborah. “Why Reading Tests Don’t Test Reading.” Dissent, Fall 1981, 457-66; and Meier, Deborah “The Fatal Defects of Reading Tests.” In The Open Classroom Reader, edited by Charles Silberman. New York: Random House, 1973.

4. Meier, Nicholas. “Reading First.” Critical Literacy 3, no. 2 (2009): 69-83.

5. Krashen, Stephen D. Three Arguments against Whole Language & Why They Are Wrong: Heinemann, 1999.

6. Aiken, Wilford M. The Story of the Eight-Year Study. New York: Harper and Row, 1942.

7. Bensman, David. Central Park East and Its Graduates: Learning by Heart. New York: Teachers College Press, 2000.

8. Dinwiddie, James, and Anne M. Young. “Comparative Outcomes for Progressive School and Non-Progressives School Students.” Maasters Thesis, San Jose State University, 2010.

9. Thomas, Wayne, and Virginia Collier. “School Effectiveness for Language Minority Students.” 97. Washington, DC: National Clearinghouse for Bilingual Education, 1997.

Best Practices

The term “best practices” has become popular over the last decade. For me the term is problematic in a number of ways. First, it leaves off the essential question: “best” for what? Despite statewide standards and the current move toward national standards, we do not all agree on the aims and purposes of public education. Far from it, as I have discovered every time I teach a new group of teacher candidates.

The other problematic assumption is that there is a best method for whatever our purpose is. While there are practices that are generally more effective then others, human beings and the teacher/students relationship, not to mention all the other contextual variables, are so complex that no one practice is likely to always be the best, if even effective at all.

Let us consider an analogy. Let us say I want to find the “best” shoe size, so I can provide all my students with the right shoes. I do a controlled study, and find that when I give size 10 shoes, more students have shoes that fit them than any other size. Now I can mandate that everyone be given size 10 shoes. But men’s and women’s feet are different you complain. Okay, I may need to do some differentiation. Women get a women’s size 8 1/2. How about ethnic groups? Mexican Americans tend to be smaller. Okay, Mexican-Americans men get a size 9…..

We can all see the utter absurdity of this. But this is what we are doing to our school children, especially to the most needy and disadvantaged school children. I spend a lot of time in a lot of different schools as a researcher and as a supervisor of student teachers. In schools that are considered “Program Improvement” under No Child Left Behind, I see teachers mandated to give lessons where every child is on the same page at the same time doing the same exercises, often not just in the one classroom, but in every class at that grade level. There is a pacing guide to keep up with. The students must move on, whether they got it or not (and do it whether they already know it or not). A few “differentiated” students may be allowed to get special help (by missing out on some other activity, or after school). Extensive data is kept on how the students are doing, with unit tests every few weeks that are diagnosed, often through sophisticated computer programs, Students’ scores get displayed in staff lounges (part of the data driven philosophy). Yet the teacher really cannot make much use of the data, since no matter what it says, they must keep to the pacing guide. This is seen as “equity” under NCLB. All children are afforded the same curriculum, the same instruction. After all, this curriculum has been designated as “research based,” since it uses strategies that the Reading Panel found to be most effective. We must have equally high expectations for all! Most of you probably think I am exaggerating. I assure you I am not. If you think so, find a school that has been designated as a “Reading First” school, and serves predominantly low income students. Maybe it is different in your state, but here in California, what I described above, I have seen over and over again.

The problem is that educational experts are being asked the wrong question: Which is the best method? Such a question was asked of the recent federal National Reading Panel—to come up with the best method for teaching reading. Textbook publishers created the materials that are used by “Reading First” schools, supposedly based on the recommendations of this Reading Panel. However, such one-size-fits-all thinking is equally absurd for teaching as it is for shoe size. Instead we need to be asking, what is the best way to support classrooms and teachers where each child will be best supported to learn in the most effective way? No two children are the same, and even the same child may need something different from day to day.

The best schools, schools that succeed with large percentages of students, are ones where teachers work together collaboratively getting to know the students. In these school they devise curriculum that allows all students to find ways into it, no matter what their learning differences styles and abilities are. These schools honor these differences, while expecting, cajoling, pushing, all students to do their best.

Reading First

[Click here for the full version of this article as published in Critical Literacy Vol. 3, No.2]

A front page article in Education Week  (May 7, 2008) proclaims that “Reading First Doesn’t Help Pupils ‘Get It.'” This assessment is based on the U.S. Department of Education’s Reading First Impact Report. For those of you who are not aware, Reading First is a Federally funded grant program for “failing” school districts that use textbooks approved by the Federal government as being based on “scientifically based reading instruction.” What makes it scientifically based? That it presumably follows the advice of the National Reading Panel. The question becomes, why haven’t such programs shown effectiveness if they are scientifically based?

It turns out that where these reading programs are failing is in the area of “reading comprehension.” The report documents that schools using the program are increasing their use of the recommended practices. These programs do appear to help at so called decoding skills. However, the use of these recommended practices and these gains in decoding skills do not appear to translate into improvements in actual reading—that is, making meaning of text. Those who actually read the Reading Panel’s report should not be surprised. The fact is that the report did not have any evidence that the recommended strategies would help in reading comprehension. The only “scientifically based evidence” the panel found was that a limited amount of systematic phonics and phonemic awareness instruction would raise scores on tests of phonics and phonemic awareness for “regular” beginner readers(1). Just as in the evidence from the programs used in the field, there was no evidence in the Reading Panel’s report that such practices improved reading comprehension.

That advancement of such skills would raise actual reading ability is based on a theory of reading that is in fact quite controversial among reading researchers and specialists. Many of the foremost reading researchers, theorists and specialists have always contended that only a minimal amount of “skills-based” teaching is helpful, and that reading is most effectively learned through… reading(2)! (With support and help from those who already know how.) Moreover, the Reading Panel report found evidence for even this limited effectiveness of the skills-based approach only for students who were not shown to be problem readers, have learning difficulties nor to be second language learners. Yet these Reading First programs are often used for students of all types. In California, teachers are often mandated to use these programs in schools serving overwhelmingly Latino students whose dominant language is Spanish—in the name of scientifically–based curriculum. These skills–based strategies are applied in these programs for a much larger proportion of the teaching day than the research supports (more than a minimal amount is overkill—it’s like trying to pour more water into an already full glass). And at grades that the research has no evidence for effectiveness (the research on these approaches only looked at first or second grade). Students, whether they are already reading fluently for meaning or not, at all grade levels, are spending hours every week on decoding and phonics skills in these programs.

A friend of mine teaches kindergarten at one of these Reading First schools. The school is made up of over 90% Latino students. On the phone with her just the other day, she was telling me how they are constantly advised to examine the data on the students (another educational buzzword currently popular is “data-driven instruction”). She tells me that she is all for examining and basing instruction on data about her students. In fact the school spends considerable staff development time doing just that—examining the scores of the students so they know just where each student is. She can tell you exactly where each of her students measures up on all of the assessments which are carried out at the end of each six-week unit. Yet then she is told to keep all the students on the same page at the same time, and that she should not deviate from the script in the textbook (see, we’re not leaving them behind, they are on the same page as all the other students!). So what good does it do her to have this data? This practice ignores the research on the uselessness of teaching above students level of understanding(3). If you move on when students don’t get it, they certainly aren’t going to get the next lesson which builds on knowledge from the previous lessons, especially in a skills–based approach(4)! Her story of being mandated to use a one-size-fits-all approach is one I see and hear repeatedly from many of the student teachers and the experienced teachers I work with as a professor of teacher education, particularly those working with low-income minority children.

One of the worst problems of such programs is that they not only ignore the expertise that teachers bring to teaching their actual students—they try to prohibit it! Good teachers have always known that different children learn in different ways. Anyone who has taught knows that. Any parent with more than one child knows that. Good teaching is about figuring out that way for each student. If we really want to “Leave No Child Behind,” we have to stop tying teachers hands with scripted one-size-fits-all programs. We must allow them to do what they are trained to do, and spend a career getting better at—figuring out how the actual students sitting in front of them learn, and adapt their teaching to the students, not the other way around! (Which is part of the argument for small class sizes, but that’s another topic).

[Click here for the full version of this article as published in Critical Literacy Vol. 3, No.2]


1. Elaine M. Garan, “What Does the Report of the National Reading Panel Really Tell Us About Teaching Phonics.” Language Arts 79, no.1 (2001): 61-71.

2. Edward A. Chittenden, Terry S. Salinger, and Ann M. Bussis, Inquiry into Meaning: An Investigation of Learning to Read (New York: Teachers College Press, 2001). Gerald Coles, Misreading Reading: The Bad Science That Hurts Children (Heinemann, 2000); Kenneth S. Goodman, In Defense Of Good Teaching: What Teachers Need to Know about the “Reading Wars” (York, Me: Stenhouse Publishers, 1998); Stephen D. Krashen, Three Arguments against Whole Language & Why They Are Wrong (Heinemann, 1999); Jeff McQuillan, Literacy Crisis: False Claims, Real Solutions (Heinemann, 1998); and Frank Smith, Understanding Reading: A Psycholinguistic Analysis of Reading and Learning to Read. 6th ed. (Lawrence Erlbaum Associates, 2004).

3. John D. Bransford, Ann L. Brown, and Rodney L. Cocking, How People Learn: Brain, Mind, Experience, and School (Washington, DC: National Academy Press, 2000); and Linda Darling-Hammond, Barbara Low, Bob Rossbach, and Jay Nelson. The Learning Classroom: Theory into Practice (Burlington, VT: Annenberg/CPB, 2003).

4. James H. Block & Robert B. Burns, “Mastery Learning.” Review of the Research in Education, 4 (1976) 3-49; and J. Ronald Gentile & James P. Lalley, Standards and Mastery Learning (Corwin Press, 2003).

Educational Research

There are a variety of important issues in regards to educational research these days. One hot topic right now is that our current federal administration has restricted the definition of exceptable research to only one type of research design. This design is known as the experimental design. Qualitative research, which allows us to look at what actually goes on in classrooms and schools, and with children, as well as at the process of how education is working, is not deemed acceptable. Neither are other designs of quantitative research, which might examine a particular school or setting or situation, without having a matched control group. This decision to only allow this type of research does not come from any consensus within the scientific or educational research community as to what counts as research(1). It is a political decision by the current federal administration. This policy has important implications for our schools. One implication is that it highly influences what research gets done. It does so directly by the fact the government sponsors research. Federal dollars will only sponsor research that fits the administration’s definition. It affects schools secondarily by what research they cite and use for their policy decisions. Researchers who want their research to influence these policies are likely to adhere to those protocols. Third, outside researchers and universities may decide to only fund research that follows that research paradigm, again restricting what research gets done. It also affects in some cases what practices schools may use, as the federal government insists that it only fund “scientifically proven” methods. Federal monies for curriculum and instruction are therefore funneled to areas that are supported by this one particular type of research.

I raise the above issue of what counts as research to make a point about educational research in general. This point is about the limitations of much of the research that is done and has been done in education even before the current policies. The above policies will only exacerbate the one’s I will address below.

Two difficulties that I will address here in regards to interpreting educational research are, one: what was used to measure the effects; and two: over what period were the effects measured.

Most educational research uses standardized tests to measure the success or failure of a particular program, or method or other variable of interest(2). However, the validity and reliability of these tests as actual measures of what they purport to measure is highly controversial(3). I will use the example of reading. I recently went to a talk about the research on learning to read. The presenter argued that the research showed that phonemic awareness was required to learn to read. However, the research cited actually showed that the explicit teaching of phonemic awareness and phonics helped students to score higher on tests of phonemes and phonics! This has been part of the trouble with the debate between whole language versus phonics and “phonemic awareness” advocates. Whole language theorists tend to use measures such as comprehension, reading for pleasure, and quantity of reading as their measures of success. Phonics and phonemic awareness advocates tend to use standardized tests that focus on phonics and phonemic awareness skills as their measures of success. How they define “reading” and how they measure reading end up predicting the outcomes they are looking for! According to Elaine Garan(4), a member of the National Reading Panel, the panel made this error in its recommendations—in limiting its analysis to studies using the experimental design, and focusing on experiments that looked at reading sub-skills, it biased its own conclusions.

Similar scenarios occur across many areas of educational research. It is not that research can say anything, but that one must be careful to examine how the researcher defined and measured success of the variable they claim to be examining. The reader and user of the research must be able to decide if they agree with the researcher’s definition, and whether the tool used to measure it is valid according to that definition.

The second problem is with the short-term aspect of most educational research. Most research is done over a one school year or shorter duration. There is an assumption that if gains are shown, they will persist over time. However, much of what we know from experience and other research contradicts that assumption. I refer us here to the “Three Little Pigs” analogy. Let us say we decide to study what materials are best for building houses. We have three identical pigs, all building houses. We notice one is building his house from straw, another from sticks, and a third from bricks. First, what is our measure of success? It is going to be how far has each pig gotten in building his house. After day one, we look to see how far each has gotten, and we notice the pig who is building his house from straw is already done. The one using sticks has his walls mostly up. The pig building with bricks is just getting his foundation done. We conclude straws much be the best material for house building, and mandate straw—based on research!

As most of us are aware, we often forget what we learned in a class or course soon after the class is over, or even during the class, right after the test! Short terms gains often do not correlate to long-term gains. Sometimes it is just do to lack of use—the knowledge or skills learned are not used again, and therefore we don’t remember them. Sometimes it may be that a strong foundation was not built, and so, like the straw house, our understanding collapses easily when it needs to support more complex use or understanding. Researchers Wayne Thomas and Virginia Collier(5) have shown evidence of this particularly in language learning, where English-only methods show slight gains in early language learning for English language learners, but students in bilingual classes overtake them in later years, due to, according to language theory, a stronger foundation in their primary language. Research on developmental versus skills based approaches to early childhood education have shown similar patterns. Early academic advantages for skills based approaches are lost over the years to longer-term advantages for the developmental approaches(6).

A main reason for this problem is summed up in an old joke I will repeat here:

 It is late evening and a woman sees a man on the street by a lamppost who looks like he is searching for something. She asks him if she can help.

He responds, “Yes, I dropped my keys.” Together they continue to look for a while. Finally, as they are having no luck, she asks him if he can remember, when and where he last had them, so they might narrow their search.

He tells her, “Oh, yes,” points across the street, and says, “I dropped them over there somewhere.”

“Then why are we looking here?”

“The light is better”

Short term designs and standardized test measurement is the lamppost. It is very difficult to carry out long-term research. It is expensive, so funding is difficult. The researcher must commit to the long haul. They may need a team who can also commit this time. The “subjects” are hard to keep track of as years go by. And the variables get more complex as time passes. At the end of a school term, or of our test of the method, we can be fairly sure that the large majority of our subjects will be right there in the same place for us to administer our tests.

Standardized tests are given to virtually all students, can easily be compared across students, classes, schools, even districts or possibly states. Even if the standardized tests are particular to the study, they tend to be quicker, easier and less expensive to administer than other measures. They are also easier to run statistical analyses on.

However, what good does it do for me to know that “such and such” a reading series or teaching method led to higher test scores for these second graders, if there is no evidence that these higher test scores actually lead to an adult who reads, understands what they reads, and knows how to use what they read to better their life and their society?

As they say “Garbage in, garbage out.” All of the advantages of time and money and statistical reliability do not matter if they will not really answer the questions we want answers to. If what I want to know is: will what am studying lead to a better educated citizen?, then I better make sure that the tools I use to measure that really do measure it.

Now I come back to my original discussion of what counts as research by the government. The federal government defines research only as the experimental design. This design lends itself well to short-term research using quantifiable scores, such as those of standardized tests. The second issue—what counts as evidence—is also more restricted. It is especially difficult to get long term research to fit the experimental design, as following exactly matched groups over years becomes more and more difficult as time passes. Many questions cannot be studied using matched samples, as in many instances it would be unethical to randomly assign students to different groups. Should we randomly retain some students and not others to see the effects of this policy? In other cases, it is impossible. For instance we cannot clone a school or district and recreate the exact same situation if we want to understand policy or curriculum decisions made on that scale. What makes for an educated and successful citizen is not always easily quantifiable, and definitions vary. Therefore, the narrow type of research the government allows also restricts what types of questions even get asked by the research.

It is my contention that although the experimental design in research is commendable and valuable where practical, it can never be the only model of research to answer the complex questions about human learning and behavior. To answer such questions we must use the broader definition of research that virtually all scientific disciplines understand.

If we want to answer important questions in education we are going to have to find a way to fund long term research, and use more complex measures of success that are more closely aligned with the actual skills and knowledge that successful members of society need and use.


1. Debra Viadero, “AERA Stresses Value of Alternatives to ‘Gold Standard’,” Education Week, April 18 2007.

2. Deborah W Meier, “Needed: Thoughtful Research for Thoughtful Schools,” in Issues in Education Research, ed. Ellen Condliffe Lagemann and Lee Shulman (San Francisco: Jossey-Bass, 1999).

3. Alfie Kohn, The Case against Standardized Testing: Raising the Scores, Ruining the Schools (Portsmouth, NH: Heinemann, 2000), Deborah W Meier, In Schools We Trust: Creating Communities of Learning in an Era of Testing and Standardization (Boston: Beacon Press, 2002), Susan Ohanian, One Size Fits Few: The Folly of Educational Standards (Portsmouth, NH: Heinemann, 1999).

4. Elaine M. Garan, “What Does the Report of the National Reading Panel Really Tell Us About Teaching Phonics,” Language Arts 79, no. 1 (2001).

5. Wayne Thomas and Virginia Collier, “School Effectiveness for Language Minority Students,”  (Washington, DC: National Clearinghouse for Bilingual Education, 1997), Wayne Thomas and Virginia Collier, “A National Study of School Effectiveness for Language Minority Students’ Long-Term Academic Achievement: Executive Summary,”  (Washington, DC: Center for Research on Education, Diversity & Excellence, 2002).

6. Rebecca A. Marcon, “Moving up the Grades: Relationship between Preschool Model and Later School Success,” Early Childhood Research & Practice 4, no. 1 (2002), Jeanne E. Montie, Zongping Xiang, and Lawrence J. Schweinhart, “Preschool Experience in 10 Countries: Cognitive and Language Performance at Age 7.” Early Childhood Research Quarterly 21, no. 3 (2006): 313-31.

School Reform: Where is the Evidence?

Under the Bush administration, the rhetoric is that the decisions we make in schools should be based on “scientific” evidence. Not only must it be “scientific,” according to the administration, but it must be based on the controlled experimental design, which is actually just one acceptable form of evidence within the scientific paradigm. No actual scientific field relies exclusively on this one form. However, putting that aside, even accepting a broader range of scientific evidence, the basic tenets mandated by the No Child Left Behind Act (NCLB) are not based on any empirical evidence, controlled experiment or not(1), and many, as I will outline, are in contradiction to accepted educational and organizational theory.

What are some of these mandates that I am referring to? High stakes testing, external tutoring programs, state takeover or charter school reform for “failing schools,” are the ones I will discuss in this column.

High Stakes testing has been around for a long time, and each time it is used it tends to show gains in test scores in the early years which quickly flatten out. Long term educational improvement of any sort has never been demonstrated. NCLB is different in that the stakes are quite higher than in previous reforms, so many argue that previous evidence(2) is not valid. However, the best that that leaves us with is an untested experiment on a massive scale, affecting nearly every public school child in the nation. I won’t even discuss here the massive amounts of monies going to the corporations the make these tests. They get money for developing the test, then selling the tests to the schools, and then for scoring the tests. Then they develop curriculum to help students prepare for these very tests that they design so schools can boost their test scores.

Another major feature of NCLB is that schools that do not reach the required test score goals must offer children tutoring that is done by an outside agency. The theory is that if the school failed the children, they are obviously not qualified to help these children. There is some logic to that theory. However, again, there is no evidence that outside agencies, as a generic category, are better equipped to help failing students than the public schools themselves(3). The administration did not first pilot this approach in some places, and test it against in-house support to demonstrate that it was more effective. Therefore, this mandate is another massive untested experiment, moving enormous Federal dollars from the public to the private sector.

If schools continue to fail to reach mandated test score goals (with rising moving targets—every year a larger percentage of students are required to “pass” the test), then they can be taken over by the state or turned over to private charter agencies. What is the record on this? School districts have been taken over by city or state governments in the past. In California, Compton, and recently Oakland have been the targets of such take-overs. In neither case have there been any significant changes in the education students receive. State governments, not surprisingly, have shown no more capability for creating positive educational changes than the local bodies they replaced. In fact, one would be hard pressed to find a theory to support why one would expect them to(4).

Charter schools, which began as an experiment in the early 1990s, and quickly spread across states and cities nationwide, were based on a theory that more freedom from state regulations and forcing local public schools to compete for students would create educational innovations and improvements. This is based on the market theory. This is a reasonable theory, especially in a country whose economy is based on such a theory. In fact many charter schools are exciting places, with innovative pedagogy showing successful results. However, after extensive research, charter schools as a class, have shown no higher test scores than their public school counterparts(5).

Another possibility in some states is the use of vouchers to send children to private schools. However, again, if you hold demographic variables constant, even private schools show no better results on standardized test scores than do public schools(6). If we are supposedly doing this reform in the name of accountability, private schools have no accountability either to state governments nor their local constituencies. There are no public school boards nor open meetings laws required of private schools, nor are their financial records open to public or government scrutiny. Once more, this aspect of NCLB is based on a theory which current evidence does not support. In most states private schools do not have to administer the same standardized tests that NCLB holds public schools accountable to. While public schools, who are answerable to the public directly, are not trusted without such tests, for some reason, private schools do not need to demonstrate any such accountability.

Is there evidence for other ideas? There is something that schools that have made a significant and dramatic difference for students have in common—local control. Some of the most effective schools are those where the people closest to the kids—the teachers, parents, and community—are actively involved in deciding the mission and curriculum of the school. It appears to matter less what that curriculum and vision is than that it was made by those closest to the kids. Virtually all of the reforms being called for at the sate and national level are based on a profound mistrust of those very people. Yet it should be obvious that when people feel coerced they are less likely to work efficiently. When people feel empowered, they are most effective. The evidence bears this out. Find a school that has significantly beaten the odds with low-income and minority students, and I’ll bet it did not happen based on external mandates! Progressive examples such as the Central Park East schools in New York, as well as models based on more conservative ideas, such as the KIPP academy and Core Knowledge demonstrate this. Not only that, but it honors our democratic ideals. Democracy is based on  the absurd idea that all citizens are capable of making the important decisions in the public sphere and should do so an equal basis. While it is absurd, no one yet has devised a better alternative.

In terms of a particular approach to learning, there have been a number of longitudinal studies showing the success of progressive and developmental approaches to teaching and learning. These are forms of teaching and learning that are the opposite of the scripted teacher-centered approaches mandated in schools that fail to meet the standardized test score targets required under NCLB. The famous Eight-year study, started in the 1930s, which followed students from their freshmen year in high school to four years after graduation found that those in the progressive schools did better on all significant measures, both in high school and in college, than their matched counterparts(7). The more extensive the reforms, the more impressive the results. Despite these dramatic findings, the public mood had shifted away from such innovations, and the results were ignored after they were published. Another more recent example is the Central Park East schools (both elementary and secondary schools) in New York City, and their resounding success of working with poor minority students in East Harlem, with 80 to 90% of the graduates getting into and being successful in four year colleges. Yet these schools are under constant attack to discontinue their innovative approaches(8). A couple of recent studies of preschool practices, comparing developmental child-centered approaches against academic skills based approaches have shown better academic and social outcomes in later elementary grades for those in the child-centered developmental programs. One of these was done in Florida(9), and the other an international study covering over 5,000 students in 1,800 preschool setting in 10 different countries(10).

The reforms of NCLB are based on a premise that those closest to the children should not be trusted to make the important decision about their education. The teachers should not be trusted to make the important decisions about how to teach the children, and the parents should not be trusted to govern the schools locally. It is based on a theory that unless coerced, these parties will not act in the best interest of their own children. It is based on a theory that unless coerced, children are not interested in learning. This is in direct contradiction to the basis of democracy. Democracy is based on the theory that no one is better positioned nor has more of a right to make decisions over their own lives than those most directly effected.

As children spend twelve or more years incarcerated in these institutions called schools, which are becoming more and more anti-democratic, our children are losing the one public place where they might learn what it means to be citizens in a democracy, where they might experience democracy in practice.

If you believe as I do that NCLB is counter to the educational needs of our children and the democratic needs of our society, at a minimum let your state and federal representatives know, as NCLB is up for reauthorization very soon. Unless they hear otherwise, these legislatures will take the politically safe course and not make any significant changes. If you would like to be more involved see my links page for some organizations that are working actively on this issue, such as The Forum for Education and Democracy, and FairTest.


1. Gerald W. Bracy, “Things Fall Apart: NCLB Self-Destructs,” Phi Delta Kappan, February 2007.

2. A.L. Armrein and David C. Berliner, “High-Stakes Testing, Uncertainty, and Student Learning,” Education Policy Analysis Archives 10, no. 18 (2002).

3. Bracy, “Things Fall Apart: NCLB Self-Destructs.”

4. Kenneth K. Wong and Francis X. Shen, “Measuring the Effectiveness of City and State Takeover as a School Reform Strategy,” Peabody Journal of Education 78, no. 4 (2003).

5. Katrina Bulkley and Jennifer Fisler, “A Decade of Charter Schools: From Theory to Practice,” Educational Policy 17, no. 3 (2003).

6. Sarah Theule Lubienski and Christopher Lubienski, “A New Look at Public and Private Schools: Student Background and Mathematics Achievement,” Phi Delta Kappan, May 2005.

7. Wilford M. Aiken, The Story of the Eight-Year Study (New York: Harper and Row, 1942).

8. David Bensman, Central Park East and Its Graduates: Learning by Heart (New York: Teachers College Press, 2000).

9. Rebecca A. Marcon, “Moving up the Grades: Relationship between Preschool Model and Later School Success,” Early Childhood Research & Practice 4, no. 1 (2002).

10. Jeanne E. Montie, Zongping Xiang, and Lawrence J. Schweinhart, “Preschool Experience in 10 Countries: Cognitive and Language Performance at Age 7,” Early Childhood Research Quarterly 21, no. 3 (2006).