JMDE

Journal of MultiDisciplinary Evaluation

Number 3, October 2005

ISSN 1556-8180

 

Editors

E. Jane Davidson & Michael Scriven

 

Associate Editors

Chris L. S. Coryn & Daniela C. Schröter

 

Assistant Editors

Thomaz Chianca

Nadini Persaud

Lori Wingate

Ryo Sasaki

Brandon W. Youker

 

Webmaster

Joe Fee

 

Mission

The news and thinking

of the profession and discipline of evaluation

in the world, for the world

 

A peer-reviewed journal published in association with

 The Interdisciplinary Doctoral Program in Evaluation

The Evaluation Center, Western Michigan University

 

Editorial Board

Katrina Bledsoe

Shawn Kana'iaupuni

Nicole Bowman

Ana Carolina Letichevsky

Robert Brinkerhoff

Mel Mark

Tina Christie

Masafumi Nagao

J. Bradley Cousins

Michael Quinn Patton

Lois-Ellen Datta

Patricia Rogers

Stewart Donaldson

Nick Smith

Gene Glass

Robert Stake

Richard Hake

James Stronge

John Hattie

Dan Stufflebeam

Rodney Hopson

Helen Timperley

Iraj Imam

Bob Williams

 


Table of Contents

PART I

In This Issue. i

Michael Scriven

Editorial iii

Michael Scriven

Articles

Using Test Standard-Setting Methods in Educational Program Evaluation: Addressing the Issue of How Good is Good Enough. 1

Paul R. Brandon

The Value of Evaluation Standards: A Comparative Assessment 30

Robert Picciotto

The 2004 Claremont Debate: Lipsey vs. Scriven. 60

Stewart I. Donaldson and Christina A. Christie

Evaluation Capacity Building and Humanitarian Organization. 78

Ridde Valéry and Sahibullah Shakir

Ideas to Consider

Ethnography and Evaluation: Their Relationship and Three Anthropological Models of Evaluation  113

Brandon W. Youker

Book Reviews

Revisiting Realistic Evaluation. 143

Chris L. S. Coryn

 

 


 

In This Issue

Michael Scriven

This is a particularly interesting issue, which is just as well since it’s also our longest to date—over 220 pages, and I doubt you can find a way to shorten it without a hundred readers feeling seriously deprived!

Remember that you can arrange to be notified when a new issue comes out by registering at our website (http://evaluation.wmich.edu/jmde/subscribe.html); the next issue will be out in a month or so, with some heavy coverage of the ‘causal wars’. And we are now officially registered with an ISSN number—can’t be done without two issues on record—so that we’re in the scientific journal databases, which gives us more status in scholarly circles. In popular circles, we have over 11,000 hits on the two issues that came out before this one, which suggests (but does not prove) that more people look at our pages (perhaps briefly) than all other evaluation journals put together. Keep that in mind as you’re thinking about where to publish!

As usual, we continue our coverage of the international evaluation world, with no less than two reports on evaluation in China, a very interesting one on evaluation in Japan, a new correspondent writing about the scene in Germany, and one on New Zealand (where my co-editor runs a consulting business), plus an update on Canada. Our coverage of journals and events of note includes a report on the First International Congress on Qualitative Inquiry, which almost burst the seams at the University of Illinois last (northern) spring; and a complete list of all international associations from the International Organization for Cooperation in Evaluation, about which we expect to have an article in the next issue.

The major articles are by major authors: the architect of evaluation at the World Bank, Robert Picciotto, writes on “The Value of Evaluation Standards”; Paul Brandon, the standards guru, addresses the great problem of high-stakes testing—how do you set the lines between the grades—and there’s a study of evaluation capacity-building in Afghanistan by two who did it there. That paper illustrates our policy of ‘naturalistic editing’—editing that leaves the flavor of the writing intact, at some cost to the grammar of Standard English—and the description of conditions in Afghanistan will bring tears to many eyes.

A serious paper on ethnography for evaluation by Brandon Youker looks at three anthropological models of evaluation, and Chris Coryn, one of our associate editors who did more than anyone to pull this issue together, reviews Realistic Evaluation. The latest issues of the major journals are also reported on by our best reporters.

Next issue we switch over to the Canadian software for online free journals, a very nice package paid for by the Canadian government, to whom our thanks. It will improve our operations considerably. And don’t forget: this is an evaluation journal, run by evaluators, so we like to hear criticism. Tell us how to improve!

 


Editorial

The Evaluation of Disasters

Michael Scriven

In the last few years, we have seen some mighty catastrophes on the face of the earth, some wrought by human hands directly and others from great national disasters.  Of the latter, the losses from the great tsunami of the Indian Ocean make the others look minor, but to many communities they were a whole world lost. These included huge earthquakes, floods, and wildfires worldwide, and in the U. S. most recently, the hurricanes Katrina and Rita. Where humans were the direct causes, the acts of warmongers and terrorists alike, not too easily distinguished in their impact on the innocent, have altered not just cities but countries forever, and for the worse—usually in the name of improvement. And. Lurking in the wings, are worse possibilities still, widely thought by experts to be inevitable: for example, new epidemics, perhaps as bird flu crosses the species boundary en masse, and mimics or surpasses previous flu epidemics that have killed millions before, perhaps tens or hundreds of millions next time around (because the fast transportation of people, foodstuffs, and other goods make us all neighbors). We are all well aware that global warming, meteor impacts, and black market hydrogen bombs pose great risks of even greater disaster. We must ask, what has evaluation contributed to aiding humankind cope with these events, and what could it contribute that it has not so far provided?

It’s clear that these events pose new challenges for most evaluators, since the usual work of the program evaluator covers only parts of great disasters. We know how to evaluate the relief programs, the health services, the educational makeshift arrangements. But evaluation of the conditions that led to, or exacerbated the impact of these events; evaluation of the developments from them that are aimed to reduce the impact of their inevitable successors: these are a different kind of beast. These call for multidisciplinary effort of considerable novelty, and this journal will try to serve its mission of keeping its readers abreast of efforts to develop good methods and tools for doing this kind of evaluation. Meanwhile, there are a few interesting developments that may inspire us to develop improved models for this new task. Perhaps the time has come to develop what might be called the Failure Case Method?

To take one example of developments that are a possibly relevant to disaster evaluation, there are many of us who feel that one of the most interesting emerging trends in evaluation in recent years has been the emphasis on a systems approach, and surely that is one emphasis that disaster evaluation requires, when we start looking evaluatively at the precursor conditions in preparedness studies. Relatedly, one must view epidemiology, a fast-developing science in its own right, as a model worth considering for its focus on finding and fixing causes of trouble, past and future. The same is true of ecobiology, another of the recent additions to the scientific Pantheon. Television has made us increasingly aware of a third player that values the systems approach—forensic pathology, portrayed on the tube as a science far more sophisticated than its actual embodiment in real labs, where DNA matching is still taking a matter of weeks not hours. And engineering has contributed a similar discipline in the form of applied research work of the investigation of the accident investigations of the National Transportation Advisory Board. In all of these cases, as with natural disasters and terrorist strikes, one great methodological lesson stands out: they are all primary cause-hunting sciences and none of them has ever felt unable to go to work even though they’ve never seen a randomly controlled experiment. So, to pick up a theme that recurs briefly in this issue, there are some important issues in evaluation methodology where we may be able to learn something from a study of the existing disaster-hunting and disaster-prevention disciplines. Our nearest approach to date, and a worthy one it is, though low-profile so far, is evaluation of peace-maintenance efforts, with a small appearance at AEA last year.

But perhaps the most important element in disaster evaluation that is familiar to most evaluators is the ‘blame game,’ the search for responsibility. It’s an integral part of aircraft and rail crash investigations, and it poses no insuperable barrier to reliable conclusions there, or in its courts. We must take it in our stride, though of course it helps to arm oneself with the basic tools of ethical and legal analysis. For the bottom line in all of this is simple enough: a good proportion of the disastrous events themselves, and a larger proportion of their terrible consequences, are avoidable by human action. If we take on disaster evaluation and don’t step up to do the ethical analysis, and do it rigorously, the job won’t be completely done. Evaluators need to grow into this new aspect of a new task as they have so often grown before. It may be the greatest challenge we’ll ever face.    


Articles

Using Test Standard-Setting Methods in Educational Program Evaluation: Addressing the Issue of How Good is Good Enough

Paul R. Brandon

School districts in the United States and elsewhere commonly use standard setting to assign value to student test and assessment scores. That is, they set standards to show “how good is good enough.” This paper presents a summary of the empirical findings on the most widely-studied test standard-setting method and describes what the conclusions of the summary suggest about the use of test standard-setting in educational program evaluations.

The purpose of setting test or assessment standards is to establish judgmentally the cutscores that show the dividing points between levels of student performance such as pass and fail, basic and proficient, proficient and advanced, and so forth. Cutscores are established with methods such as the modified Angoff method, the contrasting-groups method, the bookmark method, and several others (Cizek, 2001). As part of student and school accountability efforts, districts report to students the performance levels at which their scores fall and report to policymakers and to the public the percentages of students achieving at the various performance levels. The U. S. No Child Left Behind Act has enshrined the use of cutscores, in that schools are required to identify and report student proficiency levels and to increase the levels of students who score below proficiency.

Cutscores are set either by making judgments about test items or about examinees’ performance on tests or assessments. Methods for making judgments about test items are known as test-centered methods, and methods for making judgments about examinee performance are known as examinee-centered methods (Jaeger, 1989). The test-centered method that for years was the most frequently used and that remains the most widely studied method is the modified Angoff method (Angoff, 1971), and probably the most frequently studied examinee-centered method is the contrasting-groups method. In preparation for studying how and when to use test standard-setting methods in educational program evaluations, I conducted exhaustive reviews of the literature on these two methods (Brandon, 2002, 2004).

Before districts or states set cutscores, they first must develop performance standards. A performance standard is a statement defining and describing the knowledge or skills that students must show at a particular performance level. Performance standards are developed before cutscores are set; cutscores are the operationalized versions of performance standards. Sometimes policy makers specify performance standards and sometimes the panels of judges that set cutscores develop them.

Under what conditions and for what purposes might it be appropriate to conduct standard setting in program evaluations? This topic has been discussed sketchily by some (e.g., Cook, Leviton, & Shadish, 1985; Rossi & Freeman, 1993; Shadish, Cook, & Leviton, 1991; Worthen, Sanders, & Fitzpatrick, 1997) and somewhat more thoroughly by a few others (e.g., Fink, Kosecoff, & Brook, 1986; Henry, McTaggart, & McMillan, 1992; Patton, 1997; Wholey, 1979). The inattention given to the topic is unfortunate, because the appropriateness of using standard-setting methods in program evaluation has not been thoroughly discussed, and the types of evaluation instances in which using cutscores would be helpful and appropriate have not been well-established.

This article examines the use of test standard setting in educational program evaluations. It begins with a recounting of the primary findings of my review of the literature on the modified Angoff method (Brandon, 2004). I focus on this method because it has been examined empirically more than any other method. However, despite the relative abundance of research on the method, the empirical literature does not provide strong support for the validity of modified Angoff cutscores. Therefore, in this article, I am cautious about applying the method in program evaluation. I argue that it is appropriate under certain testing conditions in formative evaluation studies or when conducting preliminary summative studies of program outcomes. Studies of these types require a lesser degree of validity than summative evaluations used by policymakers to make go/no-go program decisions. Based on the results of the literature review, I discuss flaws in the methods of modified Angoff studies. I then discuss

1.     the types of decisions that might be made when interpreting evaluation results in light of cutscores and the strengths of the conclusions made based on test standard setting in evaluations,

2.     the program evaluation scenarios in which it is appropriate to use cutscores for interpreting evaluation results, with a focus on the stage of evaluation and the types of evaluation designs, and

3.     four criteria that evaluators should address when using cutscores to help interpret evaluation results.

This article is limited by my decision to base conclusions primarily on empirical findings about the modified Angoff research. Some evaluators might wish to know what standard-setting methods other than the modified Angoff method can be used in program evaluations. Psychometricians and researchers are continually developing new standard-setting methods (Cizek, 2001); many such as the bookmark method are proving promising, and evaluators might wish to learn from the research on them. However, the intent of this article is base conclusions on empirical research, and little sound research has been conducted methods other than the modified Angoff. For example, considerable attention has been paid to the contrasting-groups method, which for years probably was used more than any other examinee-centered approach, but little research has been conducted on it (Brandon, 2002). I base my conclusions solely on the research on the modified Angoff method because I have adopted a conservative approach to applying the standard-setting literature to program evaluation. I limit myself to the best research available; the body of modified-Angoff research may be less comprehensive than desirable, but it is broader and goes deeper than the research on other methods.

The article also is limited because it does not suggest how to apply standard setting methods for purposes other than test standard setting in program evaluation. Other than brief comments in the final paragraph of the article, I do not speculate about using the method for other purposes. Very little program evaluation research has been conducted on using standard-setting methods for purposes other than testing. (I have experimented in two evaluations with applying standard-setting methods to judging how well the evaluated programs were implemented, but the success of the efforts was mixed.) There was no research on test standard-setting methods when they were first put into wide use; I do not intend to repeat that scenario by making recommendations about using standard setting in program evaluation for purposes other than tests without an empirical basis for my suggestions. The place for extensive speculation about other uses of standard setting in program evaluation is elsewhere.

The Methodological Soundness of the Modified Angoff Method

To learn about the soundness of test standard-setting, it is useful to discuss the modified Angoff method, not only because it is an exemplar of one of the two primary types of test standard setting, but also because more empirical research has been conducted on it than any other standard-setting method. As this section shows, the evidence for the effectiveness and validity of the method is less convincing than desirable, the literature is narrow, and many of the studies of the standard-setting method are unsound or incomplete.

The modified Angoff method includes three primary steps. The method is called modified because some aspects of it were developed after Angoff (1971) first proposed it. The first step is to select and train judges. The second step is to define and describe the performance level that examinees must meet—that is, to establish the performance standard. Judges can conduct this step, but often policymakers or others provide judges with the performance standard. The third step is to make item estimates—that is, to establish estimates of the probabilities that examinees will correctly answer the items on the test or assessment at the level of the performance standard. Usually judges conduct two or three rounds of item estimation. Between rounds, the judges review empirical information such as the difficulty level of each item and have discussions about their item estimates; then, if they wish, they revise their estimates in the next round. After the three steps are conducted the cutscore is calculated by summing the item estimates for each judge and averaging the sums across judges.

Researchers and practitioners have studied the modified Angoff method more than any other, but some of the findings on the steps are inconclusive:

Selecting and training judges. Some of the research on selecting and training judges provides conclusive findings, but other research does not. Studies suggest that the appropriate number of judges for modified Angoff studies is 10–20. The conclusions of the small number of empirical studies on this topic (Brandon, 2004) generally were within this range.

Selecting judges for their subject-matter expertise can enhance item estimation, but not all judges need have high levels of expertise. Research on this topic is inconclusive because of some of the studies that I identified had methodological flaws and because other studies examined incomplete versions of modified Angoff standard setting.

Very little research has been conducted on training judges, and no results bear summarizing here.

Defining and describing the performance standard. The findings of a small body of studies support the conclusion that definitions and descriptions of performance standards should be made using a set of prescribed steps and that performance standards should be fully explicated. Research on the topic is inconclusive because about half of the studies on it were simulations of standard-setting that did not include or fully implement all the modified Angoff steps (Brandon, 2004).

Defining and describing performance standards is a difficult step to carry out fully and validly. Developing statements of performance standards for high school graduation tests requires judges to have a full understanding of the knowledge and skills that teenagers must have upon entering the workforce or post-secondary education, and developing performance standards for earlier school grades requires judges to estimate the level of students’ knowledge and skills necessary for success in the following grades. In both these standard-setting instances, judges must know what they are setting proficiency scores for. That is, they must understand the purpose of the standard setting and the context that students will be in when the students use the knowledge and skills that are addressed in the examination. “To say that adequacy must be defined for some purpose has important implications for validating passing scores as well as validating performance standards. This condition is much more stringent than requiring the passing score to be consistent with the description of performance standards” (Camilli, Cizek, & Lugg, 2001, p. 459). Understanding what scores are set for is not a trivial endeavor; indeed, some would say it is impossible: “Performance standards simply cannot help us decide whether Johnny or PS 19 or Colorado has enough reading skill, because there is no sensible answer to the question, ‘Enough reading skill for what?’ beyond the trivial level of ‘Enough reading skill to answer test question 36 correctly’” (Burton, 1978, p. 270).

There are no well-established developmental theories to guide methods for estimating what students’ necessary levels of performance should be upon graduation. What students need to know and be able to do depends upon the educational or vocational paths they will follow upon graduation. The proficiency level necessary for someone to go directly into the workforce is different from level necessary for someone to enter a community college, which in turn varies from the level necessary someone entering a competitive four-year post-secondary educational institution. The minimum levels of knowledge and skills necessary to succeed in these settings, as well as the highest levels of proficiency that can be expected, vary among these settings. Similar issues apply to setting cutscores for elementary and middle school tests and assessments. Kane (2001, pp. 58, 82–83) said,

There are generally no accepted performance standards for life after high school and no empirical base of information relating performance in history or science in eighth or twelfth grade to success in life (however that might be defined)… Standards seem most arbitrary when the contingencies they are designed to address are very vague and open-ended. The standards set on a high school graduation test are likely to be judgmental, because the level of skill that a graduate will need for work or life will depend on where they work and how they choose to live, and therefore there is no clear focal activity or contingency that can serve as a guide in standard setting. Standard-setting judges must know what students must be proficient for.

A comparison with standard setting in the military is informative. In military settings, training standards are established and applied in personnel decision making. Military training standards address clear external criteria such as the knowledge and skills necessary to operate equipment or perform specialized tasks. This is also more or less the case in standard setting for licensure or certification—a topic addressed in much of the standard-setting literature. It is not the case in K–12 education, where “it is highly unlikely that a teacher will have had experience in the career that his or her students eventually choose to enter. . . . Schools are relatively isolated from the world of work and the consequences of the quality of education they provide, whereas military training centers and operating units are tightly integrated” (Hanser, 1998, p. 82). If traditional K–12 standard-setting methods were used in the military, “the trainers who set the training standards could be quite divorced from field experience” (Hanser, p. 92)—a clearly unacceptable state of affairs. “Standards that are relatively context free are difficult to set and accept” (Hanser, p. 93).

Making item estimates. More research has been conducted on making item estimates than on any other modified-Angoff step. Some of the findings of this research support the conclusion that cutscores are valid, but other findings make us question the strength of that conclusion.

The findings of research on the extent to which item estimates are correlated with item difficulty levels—a relatively common thread of research in the empirical standard-setting literature—suggest that the estimates moderately mirror item difficulty. This finding is an indication of the validity of the estimates.

Other studies have examined the effects of activities between standard-setting rounds, when judges review empirical information about items and discuss this information and their item estimates. The results of these studies suggest that judges’ between-round activities affect the magnitude of cutscores. However, these results are tentative because about a third of the studies on the topic have not confirmed these findings (Brandon, 2004) .

Other results suggest that judges’ between-round activities decrease item estimates’ variability and increase their reliability from round to round (desirable results). However, the results about decreasing variability are inconclusive because of large standard deviations, and the results about increasing reliability are inconclusive because of the number of studies is small and the methods for calculating reliability varied among studies. Hurtz and Auerbach (2003) found that judges’ discussions among themselves reduced the variability of cutscores but that reviewing empirical information did not.

Researchers also have examined the absolute value of the differences between item estimates and empirical p-values. Their studies address item accuracy. The rationale behind the studies is that there should be small differences between item estimates and the empirical p-values of examinees whose scores are deemed to be close to the cutscore. Although some evidence has been found that judges are able to make estimates accurately, the results of several studies suggest that item estimation might be less valid than desirable because judges tend to underestimate the difficulty of hard items and overestimate the difficulty of easy items. Of all the findings about item estimates, these are the most troubling for the validity of modified Angoff cutscores. Indeed, Shepard (1995, p. 151) concluded that findings such as these showed that “judges were unable to maintain a consistent view of the performance they expected” and thus made judgments that were “internally inconsistent and contradictory.”

Conclusions About the Modified Angoff Method and Its Literature

The findings about item accuracy and the findings about the “proficiency for what” issue lead us to be concerned about using cutscores for a wide variety of program evaluation purposes. These are not the only reasons to be cautious about using the method in program evaluations, however. There also are three flaws in the literature that throw doubt on using the method for a broad array of evaluation scenarios.

The first flaw has to do with the breadth of the literature: It is broader than the research on other standard-setting methods, but it is still narrower than desirable. Insufficient empirical research has been conducted on some steps of the modified Angoff method, particularly on selecting judges, the need for judge subject-matter expertise, judge training, and defining and describing the performance standard.

More research has been conducted on the modified Angoff method than any other standard-setting method,