JMDE
Journal of MultiDisciplinary Evaluation
Number 3,
October 2005
ISSN 1556-8180
Editors
E. Jane Davidson &
Michael Scriven
Associate Editors
Chris L. S. Coryn & Daniela
C. Schröter
Assistant Editors
Thomaz Chianca
Nadini Persaud
Lori Wingate
Ryo Sasaki
Brandon W. Youker
Webmaster
Joe Fee
—The news and thinking
of
the profession and discipline of evaluation
in the world, for the world—
A peer-reviewed journal published in association with
The Interdisciplinary Doctoral Program in
Evaluation
The
Editorial Board
|
Katrina Bledsoe |
Shawn Kana'iaupuni |
|
Nicole Bowman |
Ana Carolina Letichevsky |
|
Robert Brinkerhoff |
Mel Mark |
|
Tina Christie |
Masafumi Nagao |
|
J. Bradley Cousins |
Michael Quinn Patton |
|
Lois-Ellen Datta |
Patricia Rogers |
|
Stewart Donaldson |
Nick Smith |
|
Gene Glass |
Robert Stake |
|
Richard Hake |
James Stronge |
|
John Hattie |
Dan Stufflebeam |
|
Rodney Hopson |
Helen Timperley |
|
Iraj Imam |
Bob Williams |
Table of Contents
PART
I
The Value of Evaluation Standards: A Comparative
Assessment
The 2004 Claremont Debate: Lipsey vs. Scriven
Stewart I.
Donaldson and Christina A. Christie
Evaluation Capacity Building and Humanitarian
Organization
Ridde Valéry
and Sahibullah Shakir
Ethnography and Evaluation: Their Relationship and Three
Anthropological Models of Evaluation
Revisiting Realistic Evaluation
This is a particularly interesting issue, which is just as well since it’s also our longest to date—over 220 pages, and I doubt you can find a way to shorten it without a hundred readers feeling seriously deprived!
Remember that you can arrange to be notified when a new issue comes out by registering at our website (http://evaluation.wmich.edu/jmde/subscribe.html); the next issue will be out in a month or so, with some heavy coverage of the ‘causal wars’. And we are now officially registered with an ISSN number—can’t be done without two issues on record—so that we’re in the scientific journal databases, which gives us more status in scholarly circles. In popular circles, we have over 11,000 hits on the two issues that came out before this one, which suggests (but does not prove) that more people look at our pages (perhaps briefly) than all other evaluation journals put together. Keep that in mind as you’re thinking about where to publish!
As usual, we
continue our coverage of the international evaluation world, with no less than
two reports on evaluation in China, a very interesting one on evaluation in
Japan, a new correspondent writing about the scene in Germany, and one on New
Zealand (where my co-editor runs a consulting business), plus an update on
Canada. Our coverage of journals and events of note includes a report on the
First International Congress on Qualitative Inquiry, which almost burst the
seams at the
The major articles
are by major authors: the architect of evaluation at the World Bank, Robert
Picciotto, writes on “The Value of Evaluation Standards”; Paul Brandon, the
standards guru, addresses the great problem of high-stakes testing—how do you set
the lines between the grades—and there’s a study of evaluation
capacity-building in Afghanistan by two who did it there. That paper
illustrates our policy of ‘naturalistic editing’—editing that leaves the flavor
of the writing intact, at some cost to the grammar of Standard English—and the
description of conditions in
A serious paper on ethnography for evaluation by Brandon Youker looks at three anthropological models of evaluation, and Chris Coryn, one of our associate editors who did more than anyone to pull this issue together, reviews Realistic Evaluation. The latest issues of the major journals are also reported on by our best reporters.
Next issue we switch over to the Canadian software for online free journals, a very nice package paid for by the Canadian government, to whom our thanks. It will improve our operations considerably. And don’t forget: this is an evaluation journal, run by evaluators, so we like to hear criticism. Tell us how to improve!
The Evaluation of Disasters
In the last few
years, we have seen some mighty catastrophes on the face of the earth, some
wrought by human hands directly and others from great national disasters. Of the latter, the losses from the great
tsunami of the
It’s clear that these events pose new challenges for most evaluators, since the usual work of the program evaluator covers only parts of great disasters. We know how to evaluate the relief programs, the health services, the educational makeshift arrangements. But evaluation of the conditions that led to, or exacerbated the impact of these events; evaluation of the developments from them that are aimed to reduce the impact of their inevitable successors: these are a different kind of beast. These call for multidisciplinary effort of considerable novelty, and this journal will try to serve its mission of keeping its readers abreast of efforts to develop good methods and tools for doing this kind of evaluation. Meanwhile, there are a few interesting developments that may inspire us to develop improved models for this new task. Perhaps the time has come to develop what might be called the Failure Case Method?
To take one example of developments that are a possibly relevant to disaster evaluation, there are many of us who feel that one of the most interesting emerging trends in evaluation in recent years has been the emphasis on a systems approach, and surely that is one emphasis that disaster evaluation requires, when we start looking evaluatively at the precursor conditions in preparedness studies. Relatedly, one must view epidemiology, a fast-developing science in its own right, as a model worth considering for its focus on finding and fixing causes of trouble, past and future. The same is true of ecobiology, another of the recent additions to the scientific Pantheon. Television has made us increasingly aware of a third player that values the systems approach—forensic pathology, portrayed on the tube as a science far more sophisticated than its actual embodiment in real labs, where DNA matching is still taking a matter of weeks not hours. And engineering has contributed a similar discipline in the form of applied research work of the investigation of the accident investigations of the National Transportation Advisory Board. In all of these cases, as with natural disasters and terrorist strikes, one great methodological lesson stands out: they are all primary cause-hunting sciences and none of them has ever felt unable to go to work even though they’ve never seen a randomly controlled experiment. So, to pick up a theme that recurs briefly in this issue, there are some important issues in evaluation methodology where we may be able to learn something from a study of the existing disaster-hunting and disaster-prevention disciplines. Our nearest approach to date, and a worthy one it is, though low-profile so far, is evaluation of peace-maintenance efforts, with a small appearance at AEA last year.
But perhaps the most important element in disaster evaluation that is familiar to most evaluators is the ‘blame game,’ the search for responsibility. It’s an integral part of aircraft and rail crash investigations, and it poses no insuperable barrier to reliable conclusions there, or in its courts. We must take it in our stride, though of course it helps to arm oneself with the basic tools of ethical and legal analysis. For the bottom line in all of this is simple enough: a good proportion of the disastrous events themselves, and a larger proportion of their terrible consequences, are avoidable by human action. If we take on disaster evaluation and don’t step up to do the ethical analysis, and do it rigorously, the job won’t be completely done. Evaluators need to grow into this new aspect of a new task as they have so often grown before. It may be the greatest challenge we’ll ever face.
School
districts in the
The purpose of setting test or assessment standards is to establish judgmentally the cutscores that show the dividing points between levels of student performance such as pass and fail, basic and proficient, proficient and advanced, and so forth. Cutscores are established with methods such as the modified Angoff method, the contrasting-groups method, the bookmark method, and several others (Cizek, 2001). As part of student and school accountability efforts, districts report to students the performance levels at which their scores fall and report to policymakers and to the public the percentages of students achieving at the various performance levels. The U. S. No Child Left Behind Act has enshrined the use of cutscores, in that schools are required to identify and report student proficiency levels and to increase the levels of students who score below proficiency.
Cutscores
are set either by making judgments about test items or about examinees’
performance on tests or assessments. Methods for making judgments about test
items are known as test-centered methods,
and methods for making judgments about examinee performance are known as examinee-centered methods (Jaeger,
1989). The test-centered method that for years was the most frequently used and
that remains the most widely studied method is the modified Angoff method
(Angoff, 1971), and probably the most frequently studied examinee-centered
method is the contrasting-groups method. In preparation for studying how and
when to use test standard-setting methods in educational program evaluations, I
conducted exhaustive reviews of the literature on these two methods (
Before districts or states set cutscores, they first must develop performance standards. A performance standard is a statement defining and describing the knowledge or skills that students must show at a particular performance level. Performance standards are developed before cutscores are set; cutscores are the operationalized versions of performance standards. Sometimes policy makers specify performance standards and sometimes the panels of judges that set cutscores develop them.
Under what conditions and for what purposes might it be appropriate to conduct standard setting in program evaluations? This topic has been discussed sketchily by some (e.g., Cook, Leviton, & Shadish, 1985; Rossi & Freeman, 1993; Shadish, Cook, & Leviton, 1991; Worthen, Sanders, & Fitzpatrick, 1997) and somewhat more thoroughly by a few others (e.g., Fink, Kosecoff, & Brook, 1986; Henry, McTaggart, & McMillan, 1992; Patton, 1997; Wholey, 1979). The inattention given to the topic is unfortunate, because the appropriateness of using standard-setting methods in program evaluation has not been thoroughly discussed, and the types of evaluation instances in which using cutscores would be helpful and appropriate have not been well-established.
This
article examines the use of test standard setting in educational program
evaluations. It begins with a recounting of the primary findings of my review
of the literature on the modified Angoff method (
1. the types of decisions that might be made when interpreting evaluation results in light of cutscores and the strengths of the conclusions made based on test standard setting in evaluations,
2. the program evaluation scenarios in which it is appropriate to use cutscores for interpreting evaluation results, with a focus on the stage of evaluation and the types of evaluation designs, and
3. four criteria that evaluators should address when using cutscores to help interpret evaluation results.
This
article is limited by my decision to base conclusions primarily on empirical
findings about the modified Angoff research. Some evaluators might wish to know
what standard-setting methods other than the modified Angoff method can be used
in program evaluations. Psychometricians and researchers are continually
developing new standard-setting methods (Cizek, 2001); many such as the
bookmark method are proving promising, and evaluators might wish to learn from
the research on them. However, the intent of this article is base conclusions
on empirical research, and little sound research has been conducted methods
other than the modified Angoff. For example, considerable attention has been
paid to the contrasting-groups method, which for years probably was used more
than any other examinee-centered approach, but little research has been
conducted on it (
The article also is limited because it does not suggest how to apply standard setting methods for purposes other than test standard setting in program evaluation. Other than brief comments in the final paragraph of the article, I do not speculate about using the method for other purposes. Very little program evaluation research has been conducted on using standard-setting methods for purposes other than testing. (I have experimented in two evaluations with applying standard-setting methods to judging how well the evaluated programs were implemented, but the success of the efforts was mixed.) There was no research on test standard-setting methods when they were first put into wide use; I do not intend to repeat that scenario by making recommendations about using standard setting in program evaluation for purposes other than tests without an empirical basis for my suggestions. The place for extensive speculation about other uses of standard setting in program evaluation is elsewhere.
The Methodological Soundness of the Modified Angoff Method
To learn about the soundness of test standard-setting, it is useful to discuss the modified Angoff method, not only because it is an exemplar of one of the two primary types of test standard setting, but also because more empirical research has been conducted on it than any other standard-setting method. As this section shows, the evidence for the effectiveness and validity of the method is less convincing than desirable, the literature is narrow, and many of the studies of the standard-setting method are unsound or incomplete.
The modified Angoff method includes three primary steps. The method is called modified because some aspects of it were developed after Angoff (1971) first proposed it. The first step is to select and train judges. The second step is to define and describe the performance level that examinees must meet—that is, to establish the performance standard. Judges can conduct this step, but often policymakers or others provide judges with the performance standard. The third step is to make item estimates—that is, to establish estimates of the probabilities that examinees will correctly answer the items on the test or assessment at the level of the performance standard. Usually judges conduct two or three rounds of item estimation. Between rounds, the judges review empirical information such as the difficulty level of each item and have discussions about their item estimates; then, if they wish, they revise their estimates in the next round. After the three steps are conducted the cutscore is calculated by summing the item estimates for each judge and averaging the sums across judges.
Researchers and practitioners have studied the modified Angoff method more than any other, but some of the findings on the steps are inconclusive:
Selecting
and training judges. Some of
the research on selecting and training judges provides conclusive findings, but
other research does not. Studies suggest that the appropriate number of judges
for modified Angoff studies is 10–20. The conclusions of the small number of
empirical studies on this topic (
Selecting judges for their subject-matter expertise can enhance item estimation, but not all judges need have high levels of expertise. Research on this topic is inconclusive because of some of the studies that I identified had methodological flaws and because other studies examined incomplete versions of modified Angoff standard setting.
Very little research has been conducted on training judges, and no results bear summarizing here.
Defining
and describing the performance standard. The findings of a small body of studies support the conclusion
that definitions and descriptions of performance standards should be made using
a set of prescribed steps and that performance standards should be fully
explicated. Research on the topic is inconclusive because about half of the
studies on it were simulations of standard-setting that did not include or
fully implement all the modified Angoff steps (
Defining and describing performance standards is a difficult step to carry out fully and validly. Developing statements of performance standards for high school graduation tests requires judges to have a full understanding of the knowledge and skills that teenagers must have upon entering the workforce or post-secondary education, and developing performance standards for earlier school grades requires judges to estimate the level of students’ knowledge and skills necessary for success in the following grades. In both these standard-setting instances, judges must know what they are setting proficiency scores for. That is, they must understand the purpose of the standard setting and the context that students will be in when the students use the knowledge and skills that are addressed in the examination. “To say that adequacy must be defined for some purpose has important implications for validating passing scores as well as validating performance standards. This condition is much more stringent than requiring the passing score to be consistent with the description of performance standards” (Camilli, Cizek, & Lugg, 2001, p. 459). Understanding what scores are set for is not a trivial endeavor; indeed, some would say it is impossible: “Performance standards simply cannot help us decide whether Johnny or PS 19 or Colorado has enough reading skill, because there is no sensible answer to the question, ‘Enough reading skill for what?’ beyond the trivial level of ‘Enough reading skill to answer test question 36 correctly’” (Burton, 1978, p. 270).
There are no well-established developmental theories to guide methods for estimating what students’ necessary levels of performance should be upon graduation. What students need to know and be able to do depends upon the educational or vocational paths they will follow upon graduation. The proficiency level necessary for someone to go directly into the workforce is different from level necessary for someone to enter a community college, which in turn varies from the level necessary someone entering a competitive four-year post-secondary educational institution. The minimum levels of knowledge and skills necessary to succeed in these settings, as well as the highest levels of proficiency that can be expected, vary among these settings. Similar issues apply to setting cutscores for elementary and middle school tests and assessments. Kane (2001, pp. 58, 82–83) said,
There are generally no
accepted performance standards for life after high school and no empirical base
of information relating performance in history or science in eighth or twelfth
grade to success in life (however that might be defined)… Standards seem most
arbitrary when the contingencies they are designed to address are very vague
and open-ended. The standards set on a high school graduation test are likely
to be judgmental, because the level of skill that a graduate will need for work
or life will depend on where they work and how they choose to live, and
therefore there is no clear focal activity or contingency that can serve as a
guide in standard setting. Standard-setting judges must know what students must
be proficient for.
A comparison with standard setting in the military is informative. In military settings, training standards are established and applied in personnel decision making. Military training standards address clear external criteria such as the knowledge and skills necessary to operate equipment or perform specialized tasks. This is also more or less the case in standard setting for licensure or certification—a topic addressed in much of the standard-setting literature. It is not the case in K–12 education, where “it is highly unlikely that a teacher will have had experience in the career that his or her students eventually choose to enter. . . . Schools are relatively isolated from the world of work and the consequences of the quality of education they provide, whereas military training centers and operating units are tightly integrated” (Hanser, 1998, p. 82). If traditional K–12 standard-setting methods were used in the military, “the trainers who set the training standards could be quite divorced from field experience” (Hanser, p. 92)—a clearly unacceptable state of affairs. “Standards that are relatively context free are difficult to set and accept” (Hanser, p. 93).
Making item estimates. More research has been conducted on making item estimates than on any other modified-Angoff step. Some of the findings of this research support the conclusion that cutscores are valid, but other findings make us question the strength of that conclusion.
The findings of research on the extent to which item estimates are correlated with item difficulty levels—a relatively common thread of research in the empirical standard-setting literature—suggest that the estimates moderately mirror item difficulty. This finding is an indication of the validity of the estimates.
Other
studies have examined the effects of activities between standard-setting
rounds, when judges review empirical information about items and discuss this
information and their item estimates. The results of these studies suggest that
judges’ between-round activities affect the magnitude of cutscores. However,
these results are tentative because about a third of the studies on the topic
have not confirmed these findings (
Other results suggest that judges’ between-round activities decrease item estimates’ variability and increase their reliability from round to round (desirable results). However, the results about decreasing variability are inconclusive because of large standard deviations, and the results about increasing reliability are inconclusive because of the number of studies is small and the methods for calculating reliability varied among studies. Hurtz and Auerbach (2003) found that judges’ discussions among themselves reduced the variability of cutscores but that reviewing empirical information did not.
Researchers also have examined the absolute value of the differences between item estimates and empirical p-values. Their studies address item accuracy. The rationale behind the studies is that there should be small differences between item estimates and the empirical p-values of examinees whose scores are deemed to be close to the cutscore. Although some evidence has been found that judges are able to make estimates accurately, the results of several studies suggest that item estimation might be less valid than desirable because judges tend to underestimate the difficulty of hard items and overestimate the difficulty of easy items. Of all the findings about item estimates, these are the most troubling for the validity of modified Angoff cutscores. Indeed, Shepard (1995, p. 151) concluded that findings such as these showed that “judges were unable to maintain a consistent view of the performance they expected” and thus made judgments that were “internally inconsistent and contradictory.”
Conclusions About the Modified Angoff Method and Its Literature
The findings about item accuracy and the findings about the “proficiency for what” issue lead us to be concerned about using cutscores for a wide variety of program evaluation purposes. These are not the only reasons to be cautious about using the method in program evaluations, however. There also are three flaws in the literature that throw doubt on using the method for a broad array of evaluation scenarios.
The first flaw has to do with the breadth of the literature: It is broader than the research on other standard-setting methods, but it is still narrower than desirable. Insufficient empirical research has been conducted on some steps of the modified Angoff method, particularly on selecting judges, the need for judge subject-matter expertise, judge training, and defining and describing the performance standard.
More research has been conducted on the modified Angoff method than any other standard-setting method,