Prologue

On December 7, 2007, Jean Stockard wrote the first of a series of communications to What Works Clearinghouse about its unreasonable evaluation of Reading Mastery. WWC had asserted that Reading Mastery had no studies that documented its effectiveness as a "beginning reading" program. Jean listed several Reading Mastery studies that should be used as evidential support. She also pointed out some serious problems with WWC's acceptance of studies that purportedly provided evidence of effectiveness for Reading Recovery.

On June 25, 2008, Jean sent a more detailed critique to WWC and Mathematica, the organization that operates WWC under a federal contract. This communication was followed by what is best described as hemming and hawing. We became cautiously optimistic when we received a letter from Mathematica indicating that it was very concerned with the allegations Jean had made and was providing a careful appraisal of WWC's practices. Jean didn't receive a response that addressed the various issues she had raised until September 8.

The response consisted of a detailed account from WWC and a letter from Mathematica stating that it found no irregularities in any WWC practice or judgment. The response from WWC appears as Appendix A of "Machinations of What Works Clearinghouse." The justifications toggle between sophomoric and casuistic. Some are contradictory.

The response that I wrote was not addressed to WWC or Mathematica. It deals with the issues in the Sept. 8 letter from a logical but less empirical standpoint than Jean's analysis provides. The basic argument in my paper is that what WWC did could not have occurred by chance and was not the product of applying rigorous standards. Instead the outcome was the product of rubber standards, logically unsound arguments, and possibly confusion.

I consider WWC a very dangerous organization. It is not fulfilling its role of providing the field with honest information about what works, but rather seems bent on finding evidence for programs it would like to believe are effective (like Reading Recovery and Everyday Mathematics).

Machinations of What Works Clearinghouse
by Siegfried Engelmann

The following critique of What Works Clearinghouse (WWC) is based largely on a letter sent to Jean Stockard from Mathematica on September 8, 2008. The letter appears as Appendix A.

The conclusion of this critique is that What Works Clearinghouse is so irreparably biased that it would have to be thoroughly reoriented and reorganized under different management rules to perform the function of providing reliable, accurate information about what works.

This conclusion derives from two facts:

1. There are over 90 studies that examine the effectiveness of Reading Mastery (and its predecessor, DISTAR Reading). Most of these studies have appeared in refereed journals.

2. The WWC has concluded that ―No studies of Reading Mastery that fell within the scope of the Beginning Reading review meet WWC evidence standards
(http://ies.ed.gov/ncee/wwc/reports/beginning_reading/rdgmastery/).

In other words, there is complete discontinuity between two groups that are in the domain of evaluating whether studies document effectiveness. The Reading Mastery group is composed of over 150 professional researchers who conducted the studies and at least the same number of reviewers for refereed journals who judged that the studies provide evidence of effectiveness for Reading Mastery.

This group also includes authors of several meta-analyses that summarized studies and those who reported the extensive research base of Reading Mastery, such as the American Institutes for Research. The second group consists of those who judge effectiveness of studies for WWC.

The discontinuity in judgments between these two groups provides prima facie evidence that WWC reviewers use procedures, standards, and evaluation criteria that are not in agreement with criteria used by any of the professionals who judged that Reading Mastery studies provide evidence of effectiveness. If WWC had found and reported on all the studies, the likelihood that the rejection of all would have happened by chance is 1 in 1,237,940,000,000,000,000,000,000,000 trials. In fact, however, WWC used a questionable ploy to reduce the number of studies it deemed to be reviewable, and it failed to locate a fairly large percentage of remaining studies. The ploy was to disallow any studies that were reported earlier than 1985. The ploy eliminated at least 38 of the studies, dropping the total of reviewable studies to 54. If all of these studies had been reviewed by WWC, the odds of rejecting all of them by chance would be 1 in 18,014,398,000,000,000 trials. WWC reported that it located 61 of the studies, but all but 15 of these were not really studies worthy of review. (They were ―success stories and other types of anecdotal material.) So WWC located only about 27 percent of the reviewable studies. The probability that the rejection of all 15 legitimate studies was a chance occurrence is 1 in 32,768, but this number, coupled with the rationales that WWC used to reject these studies, leaves little doubt that the stripping of Reading Mastery's evidence base was the result of intent (possibly tainted by ineptitude). Search Procedures

How could a serious search of the literature reveal only 27 percent of the studies? The WWC protocol for Beginning Reading1 (WWC Evidence Review Protocol for Beginning Reading Interventions) provides an elaborate description of its search procedures (pp. 11–16). It lists sources such as ERIC thesaurus, 1 http://ies.ed.gov/ncee/WWC/PDF/BR_protocol.pdf

PyschINFO thesaurus, and dissertation abstracts. It has a list of 29 ―hand searched journals and a list of gray-area sources, including associations, such as the American Educational Research Association. One of the gray areas searched was ―prior reviews and research syntheses (i.e., using the reference lists of prior reviews and research syntheses to make sure we have not omitted key studies) (p. 16). Possibly the discrepancy between the number of studies conducted and those found by WWC hinges on the WWC's definition of ―key studies. Using this reference-list search technique, Jean Stockard identified 54 studies that occurred no earlier than 1985, and 38 earlier studies. (See appendices.) At best, there seems to be a striking contrast between what the Beginning Reading protocol indicates WWC does and the performance results of the search for studies involving Reading Mastery. The Analysis of Legitimate Studies WWC Found In addition to the procedural inadequacy is the discrepancy between the judgments of WWC and those of the authors and reviewers of the 15 studies that WWC found. Can the discrepancy in judgment be explained as a conspiracy, or is it the effect of scrupulously applying WWC's rigorous standards? The outcome is not the product of rigor. Some of the rejected studies have raw scores that show huge outcome differences between matched controls and the experimental group. These studies adhere to basic experimental-design-and- reporting procedures that have been in place since long before 1985. This paper discusses one of the studies in some detail.

The numbers and the discontinuity between WWC's judgment and those of others who evaluated the studies favorably strongly suggest that WWC intentionally applied selection criteria that were specifically designed to reject Reading Mastery studies. Distortion Techniques WWC describes its criteria for accepting studies in its protocol for Beginning Reading. Although this protocol does not have total concordance with reasonable scientific standards, rigorous application of its standards would result in at least some of the rejected studies being accepted. Therefore, the ultimate cause of at least some rejections has to be that WWC created distortions where they were necessary to achieve rejection of specific studies. Following is a list of ―distortion techniques that were used by WWC. 1. The use of various standards and criteria that are not commonly recognized by the scientific community. 2. The use of justifications that are largely argumentative (based on correlations, not data about causations) and that have limited or no empirical data to support the position argued. 3. The use of floating standards so that experiments with similar design and results could be viewed variously as evidential support for an approach or as lack of evidence. A subtype of this category would be a specific cut point (for example, 1985) that could be ignored according to the ―discretion of the project manager or person who decides whether a given study is a nay or a yea. WWC's letter of September 8 provides evidence of the three techniques. Cut Date of 1985

An example of 1, 2, and 3 (―uncommon standards, argumentative justifications, and floating standards) is the WWC limitation that no studies reported earlier than 1985 are accepted unless the WWC principal investigator deems the study important enough to report. This criterion uniquely affects Reading Mastery, which had at least 38 studies that had been generally recognized as providing evidence of effectiveness (Appendix B). No other extant program had more than one or two studies. So given that this cut date affects only one model, but affects it in a serious way, why was the date established in the first place? This kind of cut date has no precedence in science. Studies are recognized by their quality and the extent to which the conclusions drawn are consistent with current evidence. Since 1985, at least 54 studies have examined the effectiveness of Reading Mastery (Appendix C). These outcomes are perfectly consistent with the outcomes of the 38 earlier studies. Therefore, there is no scientific basis for applying the cut date of 1985 to Reading Mastery. The justification that WWC provides for this cut date in its letter of September 8 is strictly correlational and is presented as modal conditionals (this may happen and that may happen), but it provides not one bit of evidence about whether the premise WWC espouses is based in fact. . . . the fact that preschool enrollment has increased, combined with the fact that more preschool and kindergarten programs run full-day, means that students in the early grades may be better prepared to receive reading instruction today than students 25 years ago. Moreover, it is possible that any changes in reading readiness over this period may not have been evenly distributed, since differences in reading ability by socioeconomic status and race are apparent at the kindergarten level . . . Any of these changes could have implications for the effectiveness of an intervention. If school readiness has increased, then an intervention that was effective 25 years ago may not be effective in more recent years. (p. 2, Appendix A) This type of argument is categorized in logic as an argument from ignorance. Its basic form is: We don't know if the true condition is A or B. Therefore, we conclude that the true condition is A.

In this case, we don't know how many of these postulated possible causal relations are true; therefore, we'll assume that all of them are true (or could be true). In contrast, the logical conclusion to this situation would be either to state ―We don't know so let's not change it, or Let's do some pointed research to obtain information about these counterfactual conditions. The recitation of ―possibilities provides evidence of the instructional naiveté of the author. The assertion that the children are better prepared now and therefore what was effective 25 years ago might not be effective now is logically impossible. Lower performers make all the mistakes that higher performers make. They make additional mistakes that higher performers don't make and their mistakes are more persistent, more difficult to correct.

Therefore, if the program is easier for them now because of their higher degree of undefined ―readiness,
they will make fewer mistakes and progress through the program sequence faster. The justification WWC provides for the cut date in its letter of September 8 is specious: . . . [The date] is used for two reasons. First, by limiting the reviews to research to this time period, WWC reviews reflect reasonably current research. . . . Second, the timeframe ensures that the research reviewed is examining versions of interventions that are most likely to be available to practitioners today. (p. 2, Appendix A) Unless there is current data to show that Reading Mastery was effective but is not effective now, there would be no need to remove its strong data base established before 1985. From an argumentative perspective, consider the difficulties that would have been created in other fields if the history of what works was erased every 25 years or so and had to be reestablished. Nature of Beginning Reading WWC's cut date is also highly insensitive to the nature of beginning reading. Unlike history, paleontology, biology, and other areas that that are subject to change as the world and knowledge of the world change, beginning reading for grades K–3 is stable because nothing of significance has changed in the last 40 years. The instructional goal is the same—to teach children strategies and information that would permit them to read material that could be easily covered with a vocabulary of 4,000 words. The frequency of these words has not changed. The syntax of the language has not changed significantly. For these reasons, the content of the first four levels of Reading Mastery has not changed over the years. If the 1972 edition of DISTAR were used with the training that is used today, the results would logically have to correlate .9 or better with the performance of children who went through the current edition of Reading Mastery. Amazon.com provides evidence of this strong correlation. The program Teach Your Child to Read in 100 Easy Lessons is an abridged version of Reading Mastery, designed for parents and based on the 1972 edition of Reading Mastery. It has more than 450 reviews by parents. Two points about the reviews are important: 1. No other beginning reading program has anything approximating the number of positive reviews that Teach Your Child has. 2. There is no tendency that documents a greater percentage of negative reports in more current years. The program continues to have an average sales ranking of around 400th of all books sold by Amazon.com and more than 90% of the reviews rate the program with the highest positive rating. The book is ranked #1 in Family Activities; #3 in School-Age Children; and #8 in Education. Project Follow Through A floating standard associated with this cut date is expressed in the September 8 letter. WWC principal investigators have the option to expand the period for which studies can be reviewed, if they believe that important research will be excluded. (p. 2, Appendix A)

The principal investigator did not reinstate any Reading Mastery studies even though one of these studies, Project Follow Through, was the largest educational experiment ever conducted, involving 200,000 children, 22 models of instruction (many of which are around today), and 180 communities that spanned the full demographic range of at-risk students in grades K through 3. WWC presents a great deal of rhetoric about causal validity. A strong argument can be presented that Follow Through's procedures and design achieved more causal validity than any other effectiveness study of what works ever conducted. As pharmaceutical effectiveness trials show, one of the greatest threats to internal validity is whether subjects take the medication on schedule and whether they provide accurate reports. The same problem occurs in education. Few effectiveness studies provide for reasonable monitoring; however, Follow Through had two levels of monitoring to assure that the participating sites implemented the adopted model according to the sponsors' specification. Most of the studies that WWC endorses as showing evidence of effectiveness have no provision for monitoring classrooms to determine the extent to which reports are accurate. Other details of the Follow Through design and evaluation are as sophisticated as those of current studies. For each Follow Through model there were two comparison groups, one local and one national. The outcome differences of models were measured by units of ―educational significance,which were defined as outcomes that were at least

Tuesday

October 14th, 2008

Siegfried (Zig)

Engelmann

Subscribe

Enter your email to subscribe to daily Education News!

Hot Topics

Career Index

Plan your career as an educator using our free online datacase of useful information.

View All