JMDE
Journal of MultiDisciplinary Evaluation
Number 3,
October 2005
ISSN 1556-8180
Editors
E. Jane Davidson &
Michael Scriven
Associate Editors
Chris L. S. Coryn & Daniela
C. Schröter
Assistant Editors
Thomaz Chianca
Nadini Persaud
Lori Wingate
Ryo Sasaki
Brandon W. Youker
Webmaster
Joe Fee
—The news and thinking
of
the profession and discipline of evaluation
in the world, for the world—
A peer-reviewed journal published in association with
The Interdisciplinary Doctoral Program in
Evaluation
The
Editorial Board
|
Katrina Bledsoe |
Shawn Kana'iaupuni |
|
Nicole Bowman |
Ana Carolina Letichevsky |
|
Robert Brinkerhoff |
Mel Mark |
|
Tina Christie |
Masafumi Nagao |
|
J. Bradley Cousins |
Michael Quinn Patton |
|
Lois-Ellen Datta |
Patricia Rogers |
|
Stewart Donaldson |
Nick Smith |
|
Gene Glass |
Robert Stake |
|
Richard Hake |
James Stronge |
|
John Hattie |
Dan Stufflebeam |
|
Rodney Hopson |
Helen Timperley |
|
Iraj Imam |
Bob Williams |
Table of Contents
PART
I
The Value of Evaluation Standards: A Comparative
Assessment
The 2004 Claremont Debate: Lipsey vs. Scriven
Stewart I.
Donaldson and Christina A. Christie
Evaluation Capacity Building and Humanitarian
Organization
Ridde Valéry
and Sahibullah Shakir
Ethnography and Evaluation: Their Relationship and Three
Anthropological Models of Evaluation
Revisiting Realistic Evaluation
This is a particularly interesting issue, which is just as well since it’s also our longest to date—over 220 pages, and I doubt you can find a way to shorten it without a hundred readers feeling seriously deprived!
Remember that you can arrange to be notified when a new issue comes out by registering at our website (http://evaluation.wmich.edu/jmde/subscribe.html); the next issue will be out in a month or so, with some heavy coverage of the ‘causal wars’. And we are now officially registered with an ISSN number—can’t be done without two issues on record—so that we’re in the scientific journal databases, which gives us more status in scholarly circles. In popular circles, we have over 11,000 hits on the two issues that came out before this one, which suggests (but does not prove) that more people look at our pages (perhaps briefly) than all other evaluation journals put together. Keep that in mind as you’re thinking about where to publish!
As usual, we
continue our coverage of the international evaluation world, with no less than
two reports on evaluation in China, a very interesting one on evaluation in
Japan, a new correspondent writing about the scene in Germany, and one on New
Zealand (where my co-editor runs a consulting business), plus an update on
Canada. Our coverage of journals and events of note includes a report on the
First International Congress on Qualitative Inquiry, which almost burst the
seams at the
The major articles
are by major authors: the architect of evaluation at the World Bank, Robert
Picciotto, writes on “The Value of Evaluation Standards”; Paul Brandon, the
standards guru, addresses the great problem of high-stakes testing—how do you set
the lines between the grades—and there’s a study of evaluation
capacity-building in Afghanistan by two who did it there. That paper
illustrates our policy of ‘naturalistic editing’—editing that leaves the flavor
of the writing intact, at some cost to the grammar of Standard English—and the
description of conditions in
A serious paper on ethnography for evaluation by Brandon Youker looks at three anthropological models of evaluation, and Chris Coryn, one of our associate editors who did more than anyone to pull this issue together, reviews Realistic Evaluation. The latest issues of the major journals are also reported on by our best reporters.
Next issue we switch over to the Canadian software for online free journals, a very nice package paid for by the Canadian government, to whom our thanks. It will improve our operations considerably. And don’t forget: this is an evaluation journal, run by evaluators, so we like to hear criticism. Tell us how to improve!
The Evaluation of Disasters
In the last few
years, we have seen some mighty catastrophes on the face of the earth, some
wrought by human hands directly and others from great national disasters. Of the latter, the losses from the great
tsunami of the
It’s clear that these events pose new challenges for most evaluators, since the usual work of the program evaluator covers only parts of great disasters. We know how to evaluate the relief programs, the health services, the educational makeshift arrangements. But evaluation of the conditions that led to, or exacerbated the impact of these events; evaluation of the developments from them that are aimed to reduce the impact of their inevitable successors: these are a different kind of beast. These call for multidisciplinary effort of considerable novelty, and this journal will try to serve its mission of keeping its readers abreast of efforts to develop good methods and tools for doing this kind of evaluation. Meanwhile, there are a few interesting developments that may inspire us to develop improved models for this new task. Perhaps the time has come to develop what might be called the Failure Case Method?
To take one example of developments that are a possibly relevant to disaster evaluation, there are many of us who feel that one of the most interesting emerging trends in evaluation in recent years has been the emphasis on a systems approach, and surely that is one emphasis that disaster evaluation requires, when we start looking evaluatively at the precursor conditions in preparedness studies. Relatedly, one must view epidemiology, a fast-developing science in its own right, as a model worth considering for its focus on finding and fixing causes of trouble, past and future. The same is true of ecobiology, another of the recent additions to the scientific Pantheon. Television has made us increasingly aware of a third player that values the systems approach—forensic pathology, portrayed on the tube as a science far more sophisticated than its actual embodiment in real labs, where DNA matching is still taking a matter of weeks not hours. And engineering has contributed a similar discipline in the form of applied research work of the investigation of the accident investigations of the National Transportation Advisory Board. In all of these cases, as with natural disasters and terrorist strikes, one great methodological lesson stands out: they are all primary cause-hunting sciences and none of them has ever felt unable to go to work even though they’ve never seen a randomly controlled experiment. So, to pick up a theme that recurs briefly in this issue, there are some important issues in evaluation methodology where we may be able to learn something from a study of the existing disaster-hunting and disaster-prevention disciplines. Our nearest approach to date, and a worthy one it is, though low-profile so far, is evaluation of peace-maintenance efforts, with a small appearance at AEA last year.
But perhaps the most important element in disaster evaluation that is familiar to most evaluators is the ‘blame game,’ the search for responsibility. It’s an integral part of aircraft and rail crash investigations, and it poses no insuperable barrier to reliable conclusions there, or in its courts. We must take it in our stride, though of course it helps to arm oneself with the basic tools of ethical and legal analysis. For the bottom line in all of this is simple enough: a good proportion of the disastrous events themselves, and a larger proportion of their terrible consequences, are avoidable by human action. If we take on disaster evaluation and don’t step up to do the ethical analysis, and do it rigorously, the job won’t be completely done. Evaluators need to grow into this new aspect of a new task as they have so often grown before. It may be the greatest challenge we’ll ever face.
School
districts in the
The purpose of setting test or assessment standards is to establish judgmentally the cutscores that show the dividing points between levels of student performance such as pass and fail, basic and proficient, proficient and advanced, and so forth. Cutscores are established with methods such as the modified Angoff method, the contrasting-groups method, the bookmark method, and several others (Cizek, 2001). As part of student and school accountability efforts, districts report to students the performance levels at which their scores fall and report to policymakers and to the public the percentages of students achieving at the various performance levels. The U. S. No Child Left Behind Act has enshrined the use of cutscores, in that schools are required to identify and report student proficiency levels and to increase the levels of students who score below proficiency.
Cutscores
are set either by making judgments about test items or about examinees’
performance on tests or assessments. Methods for making judgments about test
items are known as test-centered methods,
and methods for making judgments about examinee performance are known as examinee-centered methods (Jaeger,
1989). The test-centered method that for years was the most frequently used and
that remains the most widely studied method is the modified Angoff method
(Angoff, 1971), and probably the most frequently studied examinee-centered
method is the contrasting-groups method. In preparation for studying how and
when to use test standard-setting methods in educational program evaluations, I
conducted exhaustive reviews of the literature on these two methods (
Before districts or states set cutscores, they first must develop performance standards. A performance standard is a statement defining and describing the knowledge or skills that students must show at a particular performance level. Performance standards are developed before cutscores are set; cutscores are the operationalized versions of performance standards. Sometimes policy makers specify performance standards and sometimes the panels of judges that set cutscores develop them.
Under what conditions and for what purposes might it be appropriate to conduct standard setting in program evaluations? This topic has been discussed sketchily by some (e.g., Cook, Leviton, & Shadish, 1985; Rossi & Freeman, 1993; Shadish, Cook, & Leviton, 1991; Worthen, Sanders, & Fitzpatrick, 1997) and somewhat more thoroughly by a few others (e.g., Fink, Kosecoff, & Brook, 1986; Henry, McTaggart, & McMillan, 1992; Patton, 1997; Wholey, 1979). The inattention given to the topic is unfortunate, because the appropriateness of using standard-setting methods in program evaluation has not been thoroughly discussed, and the types of evaluation instances in which using cutscores would be helpful and appropriate have not been well-established.
This
article examines the use of test standard setting in educational program
evaluations. It begins with a recounting of the primary findings of my review
of the literature on the modified Angoff method (
1. the types of decisions that might be made when interpreting evaluation results in light of cutscores and the strengths of the conclusions made based on test standard setting in evaluations,
2. the program evaluation scenarios in which it is appropriate to use cutscores for interpreting evaluation results, with a focus on the stage of evaluation and the types of evaluation designs, and
3. four criteria that evaluators should address when using cutscores to help interpret evaluation results.
This
article is limited by my decision to base conclusions primarily on empirical
findings about the modified Angoff research. Some evaluators might wish to know
what standard-setting methods other than the modified Angoff method can be used
in program evaluations. Psychometricians and researchers are continually
developing new standard-setting methods (Cizek, 2001); many such as the
bookmark method are proving promising, and evaluators might wish to learn from
the research on them. However, the intent of this article is base conclusions
on empirical research, and little sound research has been conducted methods
other than the modified Angoff. For example, considerable attention has been
paid to the contrasting-groups method, which for years probably was used more
than any other examinee-centered approach, but little research has been
conducted on it (
The article also is limited because it does not suggest how to apply standard setting methods for purposes other than test standard setting in program evaluation. Other than brief comments in the final paragraph of the article, I do not speculate about using the method for other purposes. Very little program evaluation research has been conducted on using standard-setting methods for purposes other than testing. (I have experimented in two evaluations with applying standard-setting methods to judging how well the evaluated programs were implemented, but the success of the efforts was mixed.) There was no research on test standard-setting methods when they were first put into wide use; I do not intend to repeat that scenario by making recommendations about using standard setting in program evaluation for purposes other than tests without an empirical basis for my suggestions. The place for extensive speculation about other uses of standard setting in program evaluation is elsewhere.
The Methodological Soundness of the Modified Angoff Method
To learn about the soundness of test standard-setting, it is useful to discuss the modified Angoff method, not only because it is an exemplar of one of the two primary types of test standard setting, but also because more empirical research has been conducted on it than any other standard-setting method. As this section shows, the evidence for the effectiveness and validity of the method is less convincing than desirable, the literature is narrow, and many of the studies of the standard-setting method are unsound or incomplete.
The modified Angoff method includes three primary steps. The method is called modified because some aspects of it were developed after Angoff (1971) first proposed it. The first step is to select and train judges. The second step is to define and describe the performance level that examinees must meet—that is, to establish the performance standard. Judges can conduct this step, but often policymakers or others provide judges with the performance standard. The third step is to make item estimates—that is, to establish estimates of the probabilities that examinees will correctly answer the items on the test or assessment at the level of the performance standard. Usually judges conduct two or three rounds of item estimation. Between rounds, the judges review empirical information such as the difficulty level of each item and have discussions about their item estimates; then, if they wish, they revise their estimates in the next round. After the three steps are conducted the cutscore is calculated by summing the item estimates for each judge and averaging the sums across judges.
Researchers and practitioners have studied the modified Angoff method more than any other, but some of the findings on the steps are inconclusive:
Selecting
and training judges. Some of
the research on selecting and training judges provides conclusive findings, but
other research does not. Studies suggest that the appropriate number of judges
for modified Angoff studies is 10–20. The conclusions of the small number of
empirical studies on this topic (
Selecting judges for their subject-matter expertise can enhance item estimation, but not all judges need have high levels of expertise. Research on this topic is inconclusive because of some of the studies that I identified had methodological flaws and because other studies examined incomplete versions of modified Angoff standard setting.
Very little research has been conducted on training judges, and no results bear summarizing here.
Defining
and describing the performance standard. The findings of a small body of studies support the conclusion
that definitions and descriptions of performance standards should be made using
a set of prescribed steps and that performance standards should be fully
explicated. Research on the topic is inconclusive because about half of the
studies on it were simulations of standard-setting that did not include or
fully implement all the modified Angoff steps (
Defining and describing performance standards is a difficult step to carry out fully and validly. Developing statements of performance standards for high school graduation tests requires judges to have a full understanding of the knowledge and skills that teenagers must have upon entering the workforce or post-secondary education, and developing performance standards for earlier school grades requires judges to estimate the level of students’ knowledge and skills necessary for success in the following grades. In both these standard-setting instances, judges must know what they are setting proficiency scores for. That is, they must understand the purpose of the standard setting and the context that students will be in when the students use the knowledge and skills that are addressed in the examination. “To say that adequacy must be defined for some purpose has important implications for validating passing scores as well as validating performance standards. This condition is much more stringent than requiring the passing score to be consistent with the description of performance standards” (Camilli, Cizek, & Lugg, 2001, p. 459). Understanding what scores are set for is not a trivial endeavor; indeed, some would say it is impossible: “Performance standards simply cannot help us decide whether Johnny or PS 19 or Colorado has enough reading skill, because there is no sensible answer to the question, ‘Enough reading skill for what?’ beyond the trivial level of ‘Enough reading skill to answer test question 36 correctly’” (Burton, 1978, p. 270).
There are no well-established developmental theories to guide methods for estimating what students’ necessary levels of performance should be upon graduation. What students need to know and be able to do depends upon the educational or vocational paths they will follow upon graduation. The proficiency level necessary for someone to go directly into the workforce is different from level necessary for someone to enter a community college, which in turn varies from the level necessary someone entering a competitive four-year post-secondary educational institution. The minimum levels of knowledge and skills necessary to succeed in these settings, as well as the highest levels of proficiency that can be expected, vary among these settings. Similar issues apply to setting cutscores for elementary and middle school tests and assessments. Kane (2001, pp. 58, 82–83) said,
There are generally no
accepted performance standards for life after high school and no empirical base
of information relating performance in history or science in eighth or twelfth
grade to success in life (however that might be defined)… Standards seem most
arbitrary when the contingencies they are designed to address are very vague
and open-ended. The standards set on a high school graduation test are likely
to be judgmental, because the level of skill that a graduate will need for work
or life will depend on where they work and how they choose to live, and
therefore there is no clear focal activity or contingency that can serve as a
guide in standard setting. Standard-setting judges must know what students must
be proficient for.
A comparison with standard setting in the military is informative. In military settings, training standards are established and applied in personnel decision making. Military training standards address clear external criteria such as the knowledge and skills necessary to operate equipment or perform specialized tasks. This is also more or less the case in standard setting for licensure or certification—a topic addressed in much of the standard-setting literature. It is not the case in K–12 education, where “it is highly unlikely that a teacher will have had experience in the career that his or her students eventually choose to enter. . . . Schools are relatively isolated from the world of work and the consequences of the quality of education they provide, whereas military training centers and operating units are tightly integrated” (Hanser, 1998, p. 82). If traditional K–12 standard-setting methods were used in the military, “the trainers who set the training standards could be quite divorced from field experience” (Hanser, p. 92)—a clearly unacceptable state of affairs. “Standards that are relatively context free are difficult to set and accept” (Hanser, p. 93).
Making item estimates. More research has been conducted on making item estimates than on any other modified-Angoff step. Some of the findings of this research support the conclusion that cutscores are valid, but other findings make us question the strength of that conclusion.
The findings of research on the extent to which item estimates are correlated with item difficulty levels—a relatively common thread of research in the empirical standard-setting literature—suggest that the estimates moderately mirror item difficulty. This finding is an indication of the validity of the estimates.
Other
studies have examined the effects of activities between standard-setting
rounds, when judges review empirical information about items and discuss this
information and their item estimates. The results of these studies suggest that
judges’ between-round activities affect the magnitude of cutscores. However,
these results are tentative because about a third of the studies on the topic
have not confirmed these findings (
Other results suggest that judges’ between-round activities decrease item estimates’ variability and increase their reliability from round to round (desirable results). However, the results about decreasing variability are inconclusive because of large standard deviations, and the results about increasing reliability are inconclusive because of the number of studies is small and the methods for calculating reliability varied among studies. Hurtz and Auerbach (2003) found that judges’ discussions among themselves reduced the variability of cutscores but that reviewing empirical information did not.
Researchers also have examined the absolute value of the differences between item estimates and empirical p-values. Their studies address item accuracy. The rationale behind the studies is that there should be small differences between item estimates and the empirical p-values of examinees whose scores are deemed to be close to the cutscore. Although some evidence has been found that judges are able to make estimates accurately, the results of several studies suggest that item estimation might be less valid than desirable because judges tend to underestimate the difficulty of hard items and overestimate the difficulty of easy items. Of all the findings about item estimates, these are the most troubling for the validity of modified Angoff cutscores. Indeed, Shepard (1995, p. 151) concluded that findings such as these showed that “judges were unable to maintain a consistent view of the performance they expected” and thus made judgments that were “internally inconsistent and contradictory.”
Conclusions About the Modified Angoff Method and Its Literature
The findings about item accuracy and the findings about the “proficiency for what” issue lead us to be concerned about using cutscores for a wide variety of program evaluation purposes. These are not the only reasons to be cautious about using the method in program evaluations, however. There also are three flaws in the literature that throw doubt on using the method for a broad array of evaluation scenarios.
The first flaw has to do with the breadth of the literature: It is broader than the research on other standard-setting methods, but it is still narrower than desirable. Insufficient empirical research has been conducted on some steps of the modified Angoff method, particularly on selecting judges, the need for judge subject-matter expertise, judge training, and defining and describing the performance standard.
More research has been conducted
on the modified Angoff method than any other standard-setting method, but the
findings of the extant research provide only the first few layers of an
empirical foundation for making decisions about how to set cutscores. These
layers alone cannot serve as the sole basis for deciding about how to go about
setting modified Angoff cutscores; clinical guidance by experienced
practitioners is also necessary.
(Brandon, 2004, p. 80)
The second flaw has to do with the reporting of studies. Many empirical modified Angoff studies have not reported full descriptions of the standard-setting methods that were used:
The dearth of complete
descriptions obfuscates the interpretation of the body of modified Angoff
standard-setting literature. If the studies were described more carefully and
thoroughly, patterns of interactions among the variations in methods might be
discernible. As the research stands now, these patterns cannot be seen.
(Brandon, 2004, pp. 79–80)
The third flaw is methodological. Many of the findings reported in the empirical standard-setting research are from simulations in which only some of the standard-setting steps have been conducted. Research on the method that omits some of the modified Angoff steps is flawed because it does not examine all the key aspects of standard-setting; such research is akin to studying performance assessments in which students are not given instructions for conducting the assessments. Because of the omission of key steps, the findings of some studies are less generalizable than desirable to the fully implemented modified Angoff method.
The primary effect of these three flaws is that we do not have a full understanding of all of the steps of the modified Angoff method. There are not enough empirical studies to adequately examine all facets of the method, too many of the empirical studies that have been published do not explain how they conducted the steps or else do not conduct some of the steps, and too many studies are analog studies. These flaws, combined with the findings about difficulties in knowing “proficiency for what” and the findings about the difficulty in making estimates for the hardest and for the easiest items, lead me to conclude that it is questionable whether modified-Angoff cutscores are uniformly valid for making summative, high-stakes decisions in program evaluations. Placing great weight on modified Angoff cutscores in high-stakes decisions, as occurs in K–12 education, might be more than their methodological foundation can bear, in part because some of the findings about the method are troubling and in part because the methods and reporting of many modified Angoff studies are flawed.
Evaluation Scenarios Appropriate for Developing and Using Cutscores
Program evaluators might correctly hesitate to use modified Angoff cutscores for high-stakes, summative purposes, but the findings on the validity of cutscores are not so troubling as to refrain from using them in all program evaluations. Evaluators can use them to help interpret student scores for formative-evaluation purposes or to help interpret scores for suggesting summative program-evaluation decisions. Cutscores do not have to be interpreted as definitive demarcations of success; “gray areas” about the cutscores can be calculated using the standard error of the mean, resulting in cutbands instead of cutscores. This calculation would show a band around the cutscore that would provide an accommodation to the inexactitude of standard setting. Using standard errors in this way, evaluators would have three score bands—one for students who we could reasonably state are below the desired level of performance, one for those who are more or less at the desired level of performance, and one for those who are clearly above the desired level of performance. Using this analysis, evaluators could report with a reasonable level of assurance the percentages of student scores above and below proficiency. Such descriptive reports could help evaluators understand how well programs are helping students achieve program goals without placing undue emphasis on the cutscore itself. The reports could provide program personnel with general guidance about their programs. Formative evaluation findings and findings that are only suggestive of summative conclusions are not used to make go/no-go decisions about programs. When cutscores are used in ways such as these, their precision and validity are less critical than when they are used for making conclusive summative decisions about students or schools.
However, because of the limitations in the research and because of concerns about invalidity, I conclude that the modified Angoff method should be used primarily when other approaches are unavailable for interpreting student scores. That is, cutscores should be developed and used only with some kinds of evaluation designs and only in some evaluation stages. Evaluators should consider using test cutscores to help interpret test or assessment program outcome scores when no comparison or control groups are available. This scenario occurs when educational programs are implemented at all program sites, when administrators and faculty at non-program sites are unwilling to let evaluators use their sites for comparison or control groups, or, in the evaluations of small programs, when evaluation funding is too limited to have comparison or control groups. Cutscores developed when no comparison or control groups are available could help evaluators decide the extent to which children are performing at or near the desired level of performance. Cutscores might particularly be useful during the first year of an evaluation, when no year-to-year effect sizes can be calculated. Effect sizes showing annual growth are valuable for year-to-year comparisons, because they can be compared with published effect sizes about similar programs studies (Lipsey, 1990; Lynch, 1987), and because they probably are more defensible than cutscores. The two analyses together might also be useful, of course; cutscores used over several years of an evaluation can interpret how high or low program students are performing, irrespective of the size of year-to-year effect sizes.
As long as they are interpreted with caution, cutscores might also be helpful even when comparison groups are used. They can help interpret mean scores when the differences between program and comparison groups are not statistically significant. Comparing average scores to a cutscore could help evaluators know the general levels of performance of both the program and comparison groups. Furthermore, using cutscores could help evaluators tie the interpretation of evaluation results directly to program goals. If a program’s goal is, say, to have students achieve proficiency in reading knowledge or skills, evaluators could use cutscores to show the extent to which the proficiency goal had been achieved. The same kind of analysis could be conducted for other levels of student performance. Such reports are rhetorically more powerful than simply reporting whether the program group out-achieved a comparison group or surpassed a specified percentile of a norm group, because comparisons of average scores with cutscores tie evaluation results directly to descriptions of desired levels of student performance.
Criteria for Using Standard Setting in Program Evaluations
There are at least four criteria that should be addressed if evaluators use the modified Angoff method in program evaluations:
1. Standards should be set for reliable
and valid tests.
2. The program for which standards are
to be set should be well defined with concrete objectives that clearly show
what is expected of program recipients upon completion.
3. The standard-setting judges should
understand the program objectives well, know the socioeconomic and educational
context of the program, and understand the context in which program recipients
will study or work after completing the program.
4. The standard setting should be
feasible. The standard-setting method should not require more time and
resources than the program can afford.
The necessity of the first condition should go without saying; cutscores cannot be used validly to make decisions about program success unless the test for which they are set adequately measures subject matter and produces sufficiently precise scores to make decisions about programs. The other three conditions, however, need some elaboration.
Well-defined programs. When using standard setting in program evaluations, the programs should have clear sets of concrete objectives. Clear objectives are necessary if well-defined and well-described performance standards are to be developed. Although the empirical literature on setting performance standards is not extensive, a small body of studies strongly suggests that performance standards must be thoroughly described and well understood by judges if cutscores are to be valid. Indeed, it is commonsensical that performance standards must be thoroughly explicated, because judges need to understand what students must be proficient for.
The “proficiency for what” issue need not be as deleterious in program evaluation standard setting as it is in K–12 accountability standard setting. K–12 public education provides a wide smorgasbord of educational services to all children. In contrast, many educational programs provide narrow, well-defined services to clearly-demarcated populations. Educational programs typically address a single subject such as reading or science or a narrow topic such as safety, drugs abuse, and so forth. Programs are designed for a single grade level or perhaps two or three grades. They often serve subgroups of students with well-described demographic characteristics. If programs are well-designed, it is likely that their objectives will be clear and the goals more clearly defined the goals typically addressed in K–12 standard setting (i.e., advancing students to the next grade or graduating them from high school). Furthermore, judges in program evaluation standard setting can consider the social and demographic context of the schools that a program serves. Programs often serve smaller populations than entire districts. Judges can define performance standards and set cutscores while keeping in mind the population that the program serves, the wealth and the physical condition of the schools that are served, the typical longevity of teachers serving in the district, and other district demographics that evaluators can gather for judges to consider.
Judges who know the program and its context. Standard-setting judges are more likely to have reasonable expectations about student outcomes in a program if they are intimate with the program’s history, aspirations, administration, line personnel, operations, and so forth. The better they know a program, the more reasonable their expectations about program outcomes will be, and the more likely it will be that they will know the answers to a number of questions, Quoting Smith (1981, p. 266), these questions are
· Has what the program is trying to do ever been done before by anyone? (If not, do not expect too much.)
· Has it ever been done the way the program is trying to do it? (Reasonable expectations are lower for innovations.)
· Is the logic which explains why this program will achieve its desired ends compelling? (The stronger the logic, the more warranted high expectations are.)
· Does the scope of this effort, in terms of time and resources, match the level of effect expected? (Real change usually requires a lot of time and effort.)
· Do contextual factors suggest that this effort might be more or less successful than previous efforts? (Higher expectations are warranted if this program is free of previous contextual constraints.)
It certainly would not be impossible to provide standard-setting judges selected from outside the program with the answers to these questions, but the standard-setting training required to address the questions fully would be onerously lengthy and expensive.
Judges are more likely to develop reasonable expectations if they are familiar with the socioeconomic and educational contexts of a program. Programs in economically disadvantaged communities or in schools lacking good equipment and facilities are less likely to show acceptable levels of performance than are programs in less-disadvantaged communities. Judges should know these contexts because of their effects on student outcomes in the program. Judges can take socioeconomic status and school conditions into account when developing performance standards and setting cutscores. Keeping in mind the mix of schools of varying socioeconomic status and of facilities with varying degrees of maintenance will help ensure that judges’ standards are well-informed and reasonable.
The need for familiarity with programs and their social and demographic contexts means that standard-setting judges should be program personnel such as developers or teachers. Others might be insufficiently familiar with the program. For example, parents might not understand program expectations. Also, outside educators such as university personnel might be insufficiently familiar with the conditions of the schools in the program. Program evaluators who are not subject to political pressures can select judges on the basis of how well they know the program and understand the school context, including both the schools themselves and the community in which they reside. It is unlikely that evaluators will find qualified personnel of this sort outside of the program setting.
Having to hire program personnel might mean selecting judges who would be inclined to set lenient program performance standards and low cutscores. Judges might establish erroneously easy performance standards and cutscores because they are loyal to the program, do not wish to see it fail, or believe that they might be under pressure to be easy on the program. This is a source of bias that evaluators should consider when developing program standards. Judges should be trained to establish performance standards that reflect the intent of the program and to set cutscores at levels that match the performance standards.
A colleague and I had teachers serve as standard-setting judges for a state-developed writing assessment that we administered during an elementary-school writing program evaluation (Brandon & Higa, 1998). After pilot-testing the standard setting in another school, all seven fourth-grade teachers in the program school set standards for their students. The teachers addressed the question, “If you instructed your students last year as well as possible, what was the best they could have done?” They answered this question for each of five dimensions of writing—meaning, voice, design, clarity, and conventions (grammar, punctuation, and so forth).
The seven teachers were deemed the only appropriate group to develop standards because other groups had insufficient knowledge about students’ achievement and educational background, writing skills, and the context within which they were taught. The school principal did not participate because he might not have known the capabilities of the cohort of assessed students sufficiently well to have set fair standards, and parents did not participate because they knew too little about content-area knowledge or skills or about program context to arrive at fair judgments.
We were concerned that the seven teachers’ estimates of how well students could perform might be lenient because they would not want the effects of their instruction to look poor. To address this concern, we examined the differences between the mean estimates for each of the five writing dimensions and the actual performance of students for which the standards were set (Brandon & Higa, 1998). If the cutscores that the teachers set had been far below student averages, it would have suggested that inappropriate methods were used or that teachers had a self-serving bias. The differences between the cutscores and the performance of the program students showed, however, that the cutscores were somewhat above students’ performance, suggesting that teachers did not show a self-serving bias. Furthermore, the cutscores were not so high as to suggest inappropriate expectations. These results helped rule out claims of invalid standards.
Feasibility. Program evaluations must be feasible (Joint Committee on Standards for Educational Evaluation, 1994). Sufficient time and resources are necessary for program evaluation standard setting because good standard setting can be a labor-intensive, lengthy activity. Evaluation theoreticians and methodologists often overlook feasibility issues, but these must be addressed if practitioners are to use the methods.
In standard setting, both the development of the description of the performance standard and the setting of cutscores require sufficient time and resources. Developing performance standards for a moderately long single-subject test can take half a day (Mills, Melican, & Ahluwalia, 1991; Livingston & Zieky, 1989). Furthermore, setting cutscores is clearly not a brief task, as should be apparent from the description presented earlier of the steps of the modified Angoff method. In modified Angoff standard setting, judges review items, make initial estimates, review empirical information about the items, hold discussions about their initial estimates, revise their estimates, and perhaps repeat the review/discussion/estimation activities for another iteration. These activities can easily last for a full day; in some instances, such as standard setting for the National Assessment of Educational progress, they take two days or more.
When setting standards for the elementary-school writing program (Brandon & Higa, 1998), we eliminated the step of having teachers prepare written descriptions of performance standards; instead, we asked them to estimate the best performance that they reasonably thought children could achieve. We eliminated the step because the rating-scale rubrics described the target level of performance for each rating-scale point. Teachers knew the rubrics well because they had used them to score student papers; they were asked to use the rubrics to substitute for performance standards. When trained in the standard-setting procedures, they simply had to review some of the materials that they had used when doing the assessments. This efficiency contributed to the feasibility of the standard setting. The standard setting method was implemented in a reasonable period of time (less than half a day). The teachers’ comments, made during and immediately following the standard setting, suggested that they understood and fully used the standard-setting methods. Some teachers commented that they were unsure about the percentages to estimate for the scale points, but none resisted participation. None of the comments suggested that teachers found it difficult to apply knowledge of the assessment to the standard-setting task.
Summary and Conclusions
Standard setting, which is widely used by school districts and states to hold students and schools accountable for their educational performance, has not been widely used by program evaluators as a means for helping decide whether a program has performed sufficiently well. Furthermore, the topic has been covered minimally in the program evaluation literature. This is unfortunate, because evaluators could use cutscores to help interpret program outcomes during the first year of an evaluation in which there are no comparison groups. They might even be useful when comparison groups are used, for they help show how high program and comparison groups are performing, irrespective of which group is performing the best.
Standard-setting consists of establishing performance standards, which are statements describing the knowledge and skills that students must attain if they are to perform at a specified performance level (basic, proficient, advanced, and so forth), and it consists of setting cutscores. The modified Angoff method is the most widely studied standard-setting method. As used in the test and assessment standard setting that schools, districts, and states conduct for accountability purposes, the modified Angoff method has three steps. Very little research has been conducted on the first step, which is to select and train the panels of judges who establish performance standards and set cutscores. Other than showing that 10–20 is an adequate range of the number of standard-setting judges, the empirical research literature is of little assistance in identifying the best mix of procedures for this step.
More research has been conducted on the second step, which is to define and describe the performance standard (i.e., the statements describing the level of knowledge and skills that students should attain). The findings are inconclusive but commonsensically suggest that the better that performance standards are defined and explicated, the more valid cutscores are likely to be. Performance standards for educational accountability purposes are murky by nature, however, because it is impossible to know what comprises an adequate level of performance. If a performance standard is defined for graduation, should it be set for students who are going to trade schools, community colleges, state colleges, or private elite universities? What should the performance standard be for students who do not participate in any post-secondary education? If a performance standard for a particular school subject is defined for an elementary- or middle-school grade, what is the developmental or pedagogical basis for deciding what constitutes adequate performance? These questions have not been adequately addressed in the literature, and because of the epistemological complexity of the topic, are unlikely ever to be.
More research has been conducted on the third step of the modified Angoff method than on the other two steps. In this step, judges set estimates of the percentages of students who should pass each item at the level of the performance standard. During this step, judges are given empirical item p-values so that they know the difficulty levels of the items they are judging. The empirical research suggests that judges’ discussions make a difference, but the research is not conclusive. Probably the most conclusive research about the third step has to do with the accuracy of item estimates, which is established by examining the absolute value of the differences between judges’ item estimates and item p-values. This research suggests that judges tend to underestimate the difficulty of hard items and overestimate the difficulty of easy items. That is, the range of judges’ item estimates is less than the range of empirical p-values.
The research on the three steps of the modified Angoff method has not been conclusive in part because (a) the literature is more narrow than desirable, (b) some of the literature is not reported fully, and (c) the methods of the research have been of low quality. Because of problems with the methods and findings of the empirical research on standard setting, as exemplified by the research on the modified Angoff method—the most-studied of all test and assessment standard-setting methods—it might be concluded that program evaluators should avoid using the method to help make judgments about program success. However, the methods are not so unsound as to preclude their use for formative program evaluation purposes or for making suggestive (rather than conclusive) summative evaluation decisions. If cutscores are interpreted with caution and are considered to be suggestive of the success (or lack thereof) of a program, they can help evaluators make conclusions in evaluations that lack comparison groups.
Even though the empirical test and assessment standard-setting literature does not provide convincing evidence about the strength of standard-setting methods, it nevertheless is sufficiently thorough to help us know the conditions that should be present if evaluators use the method in program evaluations. There are at least four of these conditions. The first is that standards should be set only for valid and reliable tests. Evaluators are best advised to set standards for commercially published tests or assessments or for other carefully crafted instruments. Second, cutscores should be set only if program objectives are clearly stated. Otherwise, performance standards will be difficult to develop. Third, judges should be familiar with the program and the context within which it is taught. The task of setting performance standards for a program is conceptually less complex than the task of setting standards for a school district, because programs (at least those that well-developed and well-run) have clear sets of methods and objectives that standard-setting judges can keep in mind when setting cutscores. This assumes that the judges know the program well and eliminates the possibility of having people outside the program serve as judges. Of course, the charge might be made that program faculty, developers, or administrators who serve as standard-setting judges might set lenient standards. However, in a trial application of standard setting in a program evaluation, it was shown that this need not be the case (Brandon & Higa, 1998). The fourth condition is that the standard setting should be feasible. Evaluators should not assume that they can set standards without proper preparation and full understanding of the mechanics and theory of the procedures. In our trial application of standard setting in a program evaluation (Brandon & Higa, 1998), we showed that it was feasible in a small school-level evaluation.
This article shows that standard setting methods have value in evaluations. They can help evaluators make decisions about program success in the first year of an evaluation that has no comparison groups. In this scenario, other means for deciding about program success are unavailable; therefore, standard setting helps address an empty slot in evaluators’ methodological toolbox. The fact that there are weaknesses in the argument for using methods such as the modified Angoff method to make high-stakes decisions need not deter evaluators from using the method during programs’ early years, when summative decisions are infrequent. Standard-setting methods also can help evaluators make decisions about program success in later years of evaluations that do have comparison groups. In this scenario, cutscores can help determine the extent to which both the program group and the comparison group have achieved at sufficiently high levels. In both these scenarios, cutscores should not be interpreted rigidly; they should be used to arrive at suggestions about program success. This use of cutscores helps make up for the procedural weaknesses of the method. As long as (a) cutscores are set for valid and reliable tests, (b) program objectives are clear, (c) program personnel serve as standard-setting judges, and (d) there are sufficient resources to conduct the standard setting well, standard setting can contribute to evaluators’ decisions.
As stated at the beginning of this article, standard-setting is a means of answering the question, How good is good enough? The conclusions about standard setting given in this article can serve as suggestions about other methods for addressing the question in evaluation studies. First, the stage of the evaluation should be considered. In the case of developing cutscores in program evaluations, the argument for using standard setting to help make evaluation decisions is the strongest in the first year of an evaluation. Other methods for deciding the quality of a program are appropriate in other phases. By way of contrast, experimental and quasi-experimental methods are appropriate when programs are mature. Second, the method for answering the question depends on the use of evaluation findings. Standard-setting methods used for deciding about program success need not be free of flaws when the decisions are formative or when the findings are used to make suggestions, as opposed to conclusive statements, about program success. Experimental and quasi-experimental approaches to evaluation are appropriate for providing conclusive findings about the quality and effectiveness of a program. Third, the context of the program should be taken into account (Smith, 1999). Evaluators using standard setting methods need to find judges who understand the context of the program, or else cutscores will not be well-informed. The importance of knowledge about context applies to all discussions about how good is good enough. Fourth, the method for answering the question must be feasible. It will not do to require, for example, that all studies use experimental or quasi-experimental designs when the setting or the resources of the evaluation do not allow them. The current push by federal educational research funding agencies to require these designs ignores the feasibility issue—particularly since these same officials do not back up their call for experimental and quasi-experimental designs with funding for expensive evaluations. These four aspects of evaluation should be considered when developing a minimal set of guidelines that evaluators should take into account when establishing the level of performance that a program should show if it is to be considered good enough.
References
Angoff,
W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp.
508–600).
Brandon, P. R. (2002). Two versions of the contrasting-groups standard-setting method: A review. Measurement and Evaluation in Counseling and Development, 35, 167–181.
Brandon, P. R. (2004). Conclusions about frequently studied modified Angoff standard-setting topics. Applied Measurement in Education, 17, 59–88.
Brandon,
P. R., and Higa, T. F. (1998, April). Setting
standards to use when judging program performance in stakeholder-assisted
evaluations of small educational programs. Paper presented at the meeting
of the American Educational Research Association,
Camilli,
G., Cizek, G. J., & Lugg, C. A. (2001). Psychometric theory and the
validation of performance standards: History and future perspectives. In G. C.
Cizek (Ed.), Setting performance
standards: Concepts, methods, and perspectives (pp. 445–475).
Cizek,
G. C. (2001). (Ed.). Setting performance
standards: Concepts, methods, and perspectives.
Cook,
T. D.; Leviton, L. C., & Shadish Jr., W. R. (1985). Program evaluation. In
G. Lindzey and E. Aronson, Handbook of
social psychology (3rd ed.).
Fink, A. Kosecoff, J., & Brook, R. H. (1986). Setting standards of performance for program evaluations: The case of the teaching hospital general medicine group practice program. Evaluation and Program Planning, 9, 143–151.
Hanser, L. M. (1998). Lessons for the National Assessment of Educational Progress from military standard setting. Applied Measurement in Education, 11, 81–95.
Henry, G. T., McTaggart, M. J., & McMillan, J. H. (1992). Establishing benchmarks for outcome indicators: A statistical approach to developing performance standards. Evaluation Review, 16, 131–150.
Hurtz, G. M., & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and Psychological Measurement, 63, 584–601.
Jaeger,
R. M. (1989). Certification of student competence. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
485–514).
Joint
Committee on Standards for Educational Evaluation. (1994). The program evaluation standards (2nd ed.).
Kane,
M. T. (2001). So much remains the same: Conception and status of validation in
setting standards. In G. C. Cizek (Ed.), Setting
performance standards: Concepts, methods, and perspectives (pp. 53-88).
Lipsey,
M. W. (1990). Design sensitivity:
Statistical power for experimental research.
Livingston, S. A. & Zieky, M. J. (1989). A comparative study of standard-setting methods. Applied Measurement in Education, 2, 121–141.
Lynch, K. B. (1987). The size of education effects: An analysis of programs reviewed by the Joint Dissemination Review panel. Educational Evaluation and Policy Analysis, 9, 55–61.
Mills, C. N., Melican, G. J., & Ahluwalia, N. T. (1991). Defining minimal competence. Educational Measurement: Issues and Practice, 10(2):7–10.
Patton,
M. Q. (1997) Utilization-focused
evaluation: The new century text. 3rd ed.
Rossi,
P. H., & Freeman, H. E. (1993). Evaluation:
A systematic approach (5th ed.).
Shadish,
W. R., Cook, T. D., & Leviton, L. C. (1991) Foundations of program evaluation: Theories of practice.
Shepard,
L. A. (1995). Implications for standard setting of the National Academy of
Education Evaluation of the National Assessment of Educational Progress
Achievement Levels. In Joint conference
on standard setting for large-scale assessments. Vol.2. Proceedings (pp.
143–160).
Smith, N. L. (1981). Constructing reasonable expectations in evaluation. Evaluation News, 2, 265–267.
Smith, N. L. (1999). A framework for characterizing the practice of evaluation, with application to empowerment evaluation. Canadian Journal of Program Evaluation, Special Issue, 39–68.
Wholey,
J. S. (1979). Evaluation: Promise and
performance.
Worthen,
B. R., Sanders, J. R., & Fitzpatrick, J. L. (1997). Program evaluation: Alternative approaches and practical guideline
(2nd ed.).
Following an exposition of the ethical dimension, professional role and democratic rationale of standards in the evaluation community, this paper proposes an assessment framework for rating evaluation standards, illustrates its use on a sample of published norms[1] and offers lessons for the participatory elaboration of global evaluation standards.
The Meaning of Standards
Dictionaries do not draw sharp distinctions between principles, guidelines and standards. According to the Oxford English Dictionary, a principle is a proposition serving as the foundation of belief or action; a guideline is a general rule or piece of advice; and a standard means a thing serving as recognized example or principle to which others conform or should conform or by which the accuracy or quality of others is judged.
Thus, the words tend to be used interchangeably although the notion of principles is commonly perceived as aspirational; guidelines are frequently intended as recommendations that do not take precedence over the judgment of experienced practitioners[2] while standards is the preferred term for mandatory norms, accompanied by enforcement or certification mechanisms.
Since this paper evaluates the intrinsic value of the norms rather than their application it makes no distinction between principles, guidelines or standards. In any event, since no enforcement or certification mechanism exists within the fledgling evaluation profession, all published evaluation principles, guidelines or standards are predicated on voluntary rather than mandatory compliance[3] so that the difference between the terms is largely stylistic.
The Ethics of Standards
In industry,
standards are used to impose uniformity in design characteristics or processes.
They are needed to meet the demands of mass production and/or international
commerce for goods and services. As a social practice on the other hand, standard making is designed to shape human
behavior and interaction[4].
They help to achieve explicit or implicit policy goals. Intendedly or not, they
promote the interests of particular groups and can restrain competition and
creativity.
Hence, standard setting is legitimate only if
provides for lack of coercion, equal treatment and the informed consent of
participants in an open process. By clarifying expectations and setting
rules of conduct, professional standards promote accountability, facilitate
comparability and enhance the reliability and quality of services provided.
They imply shared values, dedication to professional excellence and voluntary
compliance with ethical guidelines. In democracies, standards are set in the public sphere and usually involve
the civil society.
According to Jurgen Habermas, rational discourse
among principled individuals is the only way to generate sound standards for
knowledge creation: “Representations and descriptions are never
independent of standards. And the choice of these standards is based on
attitudes that require critical consideration by means of arguments, because
they cannot be either logically deduced or empirically demonstrated.[5]” This
means that standards are context dependent and dependent on the outcome of
deliberative processes that are shaped by specific cultural environments.
The Professional Dimension
Whatever their label, all existing evaluation norms have been socially constructed through rational deliberation and context dependent processes. No consensus has yet been reached within the global evaluation profession as to the desirability of complying with internationally accepted norms. Thus, this paper is only meant as a contribution to an on-going debate about the future of the evaluation profession.
In most societies, principles, guidelines and standards are what distinguish a profession from a mere occupation. For some occupations, formal barriers to entry (e.g. academic degrees; certifications or licenses) help to protect the integrity of the profession. For others, informal criteria (e.g. a period of apprenticeship or a record of competitive achievement) suffice. But invariably the franchise enjoyed by a professional group is grounded on the presumption that its members are committed to live up to rules of conduct that protect the public interest[6].
Such rules underlie the social contract that allows professionals (and the organizations that employ them) to enjoy public trust, practice their craft without undue interference and charge for services rendered. On the supply side, standards enhance the professional stature of those who operate in conformity with them and promote good practices. On the demand side, they facilitate comparisons among providers of services, thus helping customers secure value for money.
Even if the case for evaluation standards is accepted in principle, there are differences of views on their desirable range and scope. Evaluators are still debating whether it is appropriate to set uniform standards to guide or control how evaluation professionals, commissioners, participants and users should behave (ethical norms), what concepts and practices evaluators should use (methods), the benchmarks their products should meet (quality), the outcomes they should achieve (utilization) or the instruments needed to ensure that agreed standards are met and results achieved in the public interest (verification).
Standards as a Democratic Imperative
According to David Marquand[7], democracy is characterized by a public domain where “citizens collectively define what the public interest is through struggle, argument, debate and negotiation.” Central to this process is an ethic of public service that “puts public duty and the public interest before market rewards and private interests”. In turn, the effective functioning of a pluralistic society requires professionals that operate autonomously, according to ethical standards. In such a system, professionals are motivated not by “patronage, clientism, connection” but by “professional pride, professional duty, professional authority and …professional career paths”.
The mandate of evaluation is to assess the merit and worth of public policies and programs on behalf of citizens and with their participation and the credibility of evaluation hinges on proper assurances of quality, objectivity and independence.
Public trust is the critical ingredient. Absent certification and accreditation, evaluators must take it upon themselves to “internalize a set of norms precluding them from abusing their monopoly position and exploiting their clients, and enjoining them to promote the public good”.
Evaluation standards in and by themselves do not generate good policy outcomes. Indeed, they may have unintended consequences depending on how they are generated and used. If they are centrally imposed and coercively implemented, they may have a chilling effect on creativity and innovation. They may also be viewed as redundant where the value of the evaluation services provided can be reliably gauged in terms of the impact on the quality of decisions reached (ascertained as an integral part of the evaluation process).
On the other hand, just as one does not judge auditors by the profitability of the companies they serve, it is inappropriate to judge evaluators by the effectiveness of the programs and policies being evaluated. While a byproduct of independent evaluation is to assist policy makers and program managers do a better job (the advisory dimension of the profession), the primary responsibility of the evaluator in a democratic society is to enhance accountability, tell truth to power, illuminate policy options, promote public involvement and contribute to the transparency of decisions taken in the public interest.
Equally, asking the clients of evaluations to rule on their usefulness involves moral hazard. It may lead to “feel-good” evaluations tailored to what program managers want to hear rather than forthright assessments that protect the public interest. Evaluation needs to be responsive to the public interest and to operate without fear or favor. As for the accounting profession, the legitimacy of evaluations carried out by (or on behalf of) program managers requires independent validation against agreed standards.
In other words, for evaluators just as for accountants, the client is not always right. Evaluation needs to be independent of vested interests, including those of sponsors[8]. Irrespective of the funding source, evaluators are ultimately accountable to the public.
While they should give close attention to issues raised by stakeholders, their credibility and integrity hinges on their objectivity and impartiality. This is why evaluation consultants as well as public officials that carry out self-evaluations need independent oversight that attests to the professionalism of their behavior and the validity of their findings.
Do as I Say – Not as I Do?
Ultimately, the case for evaluation standards rests on the golden rule: evaluators should practice what they preach. It would be perverse for evaluators to reject the discipline that they impose on others.
In order to determine the merit, worth and value of an activity, evaluators routinely identify relevant criteria of merit, i.e. they use standards to assess the results of programs and the performance of public officials. Thus, Carol Weiss[9] refers to standards in her definition of evaluation as “the systematic assessment of the operations and/or the outcomes of a program or policy compared to a set of explicit or implicit standards.”
For Evert Vedung[10], “the value component of evaluation presupposes at least one criterion of merit against which public interventions are judged”. In turn, Michael Scriven[11] observes that: “evaluation has two arms, only one of which is engaged in data gathering. The other arm collects, clarifies and verifies relevant values and standards”. With implacable logic, he adds: “anything can be evaluated, including evaluation”.
The design and adoption of commonly agreed evaluation standards would help to resolve the dilemma the profession currently faces in managing an exploding demand for evaluation services within an operating environment characterized by widespread unease about the uneven quality of evaluation products and services and the limited utilization of evaluation results. In setting standards, the emerging discipline of evaluation would emulate the experience of its forebears in the social sciences and the accounting profession.
But in order to make progress along the road of common evaluation standards, a paradox must be explored: if the case for evaluation standards is so strong, why has progress in formulating and endorsing evaluation standards been so slow, halting and partial? What are the limits of standards and what risks must be managed while designing and using them?
The Limits of Evaluation Standards
Evaluation determines the merit, worth and value of things[12]. It consists in collecting relevant evidence, identifying suitable evaluative standards and using methods of analysis that are valid and fair. Assuming a stable and predictable operating environment and provided the causal links that make up a results chain are known (and all evaluation actors are willing and able to comply with the reciprocal obligations that the agreed rules of the game imply) it should be enough to control the quality of outputs or to verify the ultimate impacts of an intervention to create the incentives needed to achieve the desired results.
In other words, the notion of standards is often associated with a linear conception of society predicated on rational behavior and predictable consequences of public policy actions. But in the real world, unintended consequences prevail, the operating context is unstable and the behavior of social actors is influenced by vested interests. The causal links between policy actions and impacts are subject to a wide range of external influences. Lack of precise knowledge regarding the connections between inputs, outputs, outcomes and impacts distorts decisions. This means that evaluation standards must take account of the volatility, complexity and contingency of public service tasks.
Like other social rules and protocols, standards are justified only if they generate social benefits in excess of their costs. Inappropriate standards can cause substantial harm by providing unwarranted assurances. Thus, crude performance indicators, simplistic league tables and performance assessments connected to inappropriate goals can destroy trust and erode the public service ethic. In particular, standards focused on only one stage of the results chain and rigidly applied (e.g. budget controls; quality assurance; inspection; auditing or evaluation) can create perverse incentives.
Conversely, applying and verifying standards at all stages of the results chain can lead to excessive rigidity and prohibitive transaction costs, especially where standards are mandatory and controls are tight. The dogmatic use of standards is evoked by the alternative dictionary definition of standards as “a document accepted by a church as the authoritative statement of its creed”. Concerns about its restrictive implications can also be traced to the original meaning of the term: “a flag or figurehead attached to the upper part of a pole and raised to indicate a rallying point”; the “distinctive ensign of a sovereign, commander, nation”; the “flag of a cavalry regiment as distinct from the colors of an infantry regiment, etc.”
Such martial images have threatening connotations for evaluators committed to freedom of thought, diversity of perspectives and creativity in methods. They evoke the specter of coerced uniformity, mindless orthodoxy, methodological rigidity and bureaucratic interference. Hence, the strong resistance to mandatory norms among “free thinking” professional evaluators who treasure the integrity of their craft and the freedom to select the methods most relevant to the evaluation challenges they face. This also explains the predilection of most professional associations for terms that are less threatening (i.e. charter, guidelines, principles, etc.) than the word standards.
Diverse Doctrines
The potential chilling effect of standards justifies a gradual and participatory approach to their design and adoption. A pluralistic approach, sensitive to cultural differences is fundamental. Special care is needed to avoid favoring one evaluation school over another. Not all evaluators endorse the notion that evaluators have a mandate to judge the performance of public policies and programs[13]. Some conceive of evaluation as a tool for understanding social phenomena. Others hold the view that evaluators are not entitled to question the framework of values or objectives pursued by program managers. Still others view evaluation as a tool for facilitating the achievement of consensus among groups.
Given this diversity, it is entirely legitimate for distinct evaluation schools to elicit different criteria of merit. This is why universal agreement for methodological norms has proven elusive. But all evaluation doctrines endorse judicious rules of conduct with respect to the ethical collection and interpretation of evidence, the validity of findings, etc. Thus, no ontological rationale exists for rejecting evaluation standards based on the notion that evaluation doctrines are manifold. On the other hand, due care should be taken to preserve the space that evaluation professionals need to practice their craft in line with their distinctive values and principles.
Beyond a central “core” of evaluation principles endorsed by all, each of the schools that make up the multi-faceted evaluation profession may choose produce its own principles and methods. Indeed, transparency about the methodological standards used in evaluations ought to be encouraged: clarity about the purposes and roles the evaluation methodologies is designed to serve would help users make a reasoned choice among evaluation suppliers, in line with the “truth in labeling” principle.
Lessons from Accounting and Auditing
Standards are “rules of the game”. They define roles as well as desirable outcomes. They set the voluntary restraints that govern the behavioral relations among individuals or groups. These must be meaningful but not so strict as to paralyze action or hinder innovation. They can be used to assess the performance of all parties to an evaluation. Whether carried out by individuals or evaluation organizations, they guide the design of evaluation processes.
Since evaluation is
to the public sector what accounting and auditing is to the private sector, the
lessons gained in the process of developing accounting and auditing standards
are instructive. In corporate finance, reporting standards combined with
verification and enforcement guarantee consistency and comparability of
accounts. The vigorous debate about financial reporting standards between the
Professional associations of accountants and auditors devote considerable resources to standard setting and certification. They endorse the concept of international accounting standards. Such standards are meant to overcome the problems faced by multinational companies that operate in diverse national jurisdictions. The preparation of reports based on different national principles undermines public trust in corporate accounts since different judgments of financial performance for the same multinational company resulted from inconsistencies in national accounting standards. Thus, the pressure for uniformity in accounting rules rose to protect the credibility, comparability and efficiency of business transactions and facilitate cross border investments.
Similarly, with the internationalization of evaluation under the pressures of globalization, national policies and programs can no longer be held to account without a clear set of benchmarks or without reference to their cross-border implications. A global evaluation community is emerging, peer reviews of national policy performance are becoming routine and a growing international consensus behind new public management principles is fueling a demand for cross border consistency and transparency in public policy and program evaluations.
Thus, the lessons that have emerged in the process of generating universal accounting standards may have relevance for the design of evaluation standards[14]:
· Ownership: for legitimacy, standards should gain broad acceptance by professional associations and public authorities at all levels and this in turn requires that they be transparent, enforceable and useful.
· Tradeoffs between uniformity and relevance: the advantages of credibility, comparability and efficiency that uniformity delivers may conflict with the quality of the rules and their adaptability to different operating contexts. Principled compromises are needed and, where necessary, second best solutions should be adopted.
· Need for restraint in prescriptive rules: Diminishing returns result from efforts to make standards ever more precise and rigorous. Standards should be clear, broad and indicative rather than obscure, detailed and mandatory. They should be as simple as possible but not simpler[15].
·
The Genesis of National Evaluation
Standards
For evaluation then, just as for accounting and auditing, standards are part of the social context of the profession. But to a far greater extent, the evaluation profession should adapt its methods to the unique features of individual evaluation assignments. This makes uniform standards for evaluation of public policies and programs far harder to develop than for accounting and auditing. On the other hand, most evaluation practitioners agree that good and bad practices can be distinguished. They accept the judgment of their peers about the quality of their work and they appreciate guidance about the conduct of their work.
In
1994, the Joint Committee on Standards for Educational Evaluation in the
The principles are very general and cannot be relied upon to provide pointed advice in individual cases. But this does not detract from their usefulness when supplemented by case studies, training and guidance by experienced evaluation managers. Within their limits, they have provided the evaluation profession with a serviceable framework for learning and sharing of experience about the ethical conduct of evaluations.
Similarly, the Joint Committee (JC) on Standards for Educational Evaluation published influential standards for the conduct of program evaluations. It deals with ethical, contractual and methodological aspects. The standards were generated through debate among leading exponents of different evaluation persuasions. They were framed in consultation with policy makers and users of evaluation in the education profession. A third edition is under preparation. It is expected to improve attention to cultural diversity issues.
The Swiss Evaluation Society, the German Evaluation Association and the African Evaluation Association have published official standards inspired by the US Joint Committee’s work and subsequently streamlined, refined or adapted[18]. By contrast, the UK Evaluation Society’s good practice guidelines issued in 2003 address explicitly the distinctive roles of evaluation commissioners, evaluators and participants and they also provide guidance for institutional self-evaluations.
The concise charter
issued by the French Evaluation Society in the same year stresses the
commonality of obligations of evaluators and evaluands while on-going work by
the Australasian Evaluation Society is expected to reach well beyond the rights
and responsibilities of commissioners, evaluators and other stakeholders
codified in its 2002 Guidelines for the
Ethical Conduct of Evaluations in order to address more explicitly issues of
utilization and integration of evaluation within the policy process.
Criteria of Value for Evaluation
Standards
Standard setting in evaluation needs to address social learning as well as individual learning. As Oscar Wilde famously observed, “experience is the name everyone gives to one’s mistakes”. But evaluation helps individuals, groups and organizations learn both from their own mistakes and from the mistakes of others. This is far cheaper and less painful than trial and error. At the level of the individual decision maker, accountability for decisions taken provides incentives for learning while learning improves the quality of decisions and broad based participation helps to protect the public interest[19].
Accordingly, the three main functions of evaluation are (i) to reduce errors in decision-making through knowledge, (ii) to make authority responsible through increased accountability, and (iii) to promote public involvement in public affairs. These three roles (accountability, learning and participation) are inextricably linked. Thus defined they help to determine how the profession should be judged. Specifically, evaluation standards should address three distinct dimensions: (a) collective decision-making; (ii) enhanced accountability; (iii) informed participation.
First, evaluation contributes to effective decision-making by nurturing principled solutions to complex public policy problems. Since neither the state nor private agents have the power to unilaterally define their actions, social decision-making involves bargaining. Evaluation improves the social rewards of the game by providing impartial evidence to all parties and facilitating progress towards agreed solutions. This is where standards of propriety come in. They ensure that evaluation is conducted with regard for the welfare of all those involved in the evaluation and affected by its results. Resolving conflicts of value in a constructive way and achieving shared objectives among group members facilitates collective action. Through transparency, standards ensure that all parties and the public are adequately informed about evaluation findings.
Second, evaluation levels the playing field of the collective action game by helping to increase the responsibility of authority. By providing objective validation (or censure) of the actions taken by those in authority, it encourages the powerful to behave responsibly and fairly. Evaluation standards add value by discouraging the capture of the process by vested interests. By dispassionately examining the record of past interventions and putting social science disciplines to work, evaluation helps as a counterweight to the prejudices and self-serving opinions of decision makers. This is why independence and integrity standards are needed to help protect the value of evaluations.
Third, evaluation contributes to public involvement in decisions by reducing information asymmetries and providing advice to the public and to decision makers that can be used to enhance the outcomes of policies and programs. Through participation, evaluation amplifies the influence of those who may not have direct access to decision makers–employees, clients, the poor, etc. Access to relevant information is often too costly to be secured by individual actors. Effective collection and interpretation of the evidence requires competence. In this context, standards of quality and comprehensiveness of guidelines (to cover all stages of the evaluation cycle) provide comfort about the validity, accuracy and objectivity of evaluation findings.
Beyond the seven criteria implied by the three main functions of evaluation (propriety, transparency, independence, integrity, competence, quality and comprehensiveness), the net value added by evaluation is dependent on the efficiency of the evaluation process and the utilization of evaluation findings. These are only under the partial control of evaluators. Evaluation governance factors also intervene.
Getting Results
In learning organizations, evaluation helps to keep transactions low. In rigid, poorly managed, unaccountable organizations, evaluation leads to tense interactions, “cover ups” and even intimidation—so that evaluation ends up raising transaction costs with limited benefits in terms of improved organizational effectiveness. In learning organizations, objective evaluations are used to improve the relevance and impact of interventions and, as a result, unlock enormous benefits at modest cost. Not so in poorly managed organizations where the evaluation process tends to be captured for the personal benefit of those in authority.
This means that, beyond evaluation
supply factors (addressing the right issues, conducting evaluations with
efficiency, ensuring fairness and professionalism, etc.) the economics of
evaluation hinge on demand factors. How commissioners and program managers
behave before, during and after the evaluation process is critical to the
derivation of social benefits out of the evaluation process. This is why for
evaluation standards to improve the functioning of society, it is important for
evaluation standards to deal explicitly with the distinctive accountabilities
and the reciprocal obligations of evaluators, evaluation commissioners and
program managers.
In particular, the roles of the independent evaluator in the construction of results based management systems and other real time monitoring and evaluation processes (that are an integral part of public sector management) would benefit from codification. Not enough attention has been given to this interface in traditional evaluation standards. Nor have the responsibilities of program managers been explicitly considered in the definition of evaluation standards. Conceptually and operationally, this gap has been filled by the “evaluability” doctrine[20]. Looking ahead, it should have a place in evaluation standards geared to the achievement of results.
Ideally, evaluation standards should relate good practices to the institutional conditions under which evaluation actually takes place. This means that evaluation governance as well as evaluation practice would benefit from well-conceived standards. Without organizational ethics, the moral hazards of evaluation rise. In poorly managed organizations, evaluation is used punitively to name and shame. In such operating environments, managers use evaluation to censure those who are low on the totem pole. This is why evaluation standards should enhance accountability of authority as well as assist authority through learning from experience.
In other words, evaluation standards do not deliver optimum results unless they focus on the reciprocal obligations of all those involved in an evaluation. Institutions as well as individual evaluators should be guided by evaluation standards. Ethical considerations and technical soundness of quality standards matter but they should be embedded within suitable governance frameworks[21]. Furthermore, it is desirable that the standards be “owned” by the individuals, groups and organizations that use them. Only then are they likely to devote the resources and the skills needed to abide by the standards and make use of evaluation findings.
It follows that the very process of formulating and implementing standards should be viewed as a social learning mechanism (that is subject to evaluation). Experience from accounting and other professions suggests that the usefulness of standards hinges on their actual use and, in due course, their independent verification. Effective utilization of standards is facilitated by authoritative accumulation of evidence from adjudicated cases, especially those where the application of agreed standards has been contested.
Finally, tacit norms tend to spread from organizations that are recognized as leaders in their field to the rest of the profession. Thus, sharing of good practice and training programs act as transmission belts between standards and their effective utilization. This is one reason why professional associations have a comparative advantage in the formulation and verification of standards.
A Comparative Assessment
A cursory comparison between national evaluation standards brings out the following common features:
· Brevity: the standards in the sample are invariably stated in concise and non-technical terms; they contain 3-6 principles and 11-44 guidelines. The listing of standards varies between one and six pages[22].
·
Scope:
Most guidelines focus on the ethical conduct of public program and policy
evaluations while the
· Multiple audience: all standards in the sample are directed to the individuals and organizations that commission, prepare, conduct and use evaluations as well as to stakeholders affected by the evaluation or who have an interest in the results.
· Process orientation: the standards tend to eschew methodological aspects; instead they concentrate on behavioral, contractual and ethical considerations.
·
Asymmetry:
most standards give far greater weight to the responsibilities of evaluators
and the rights of other stakeholders than to the obligations of evaluation
commissioners and program managers with the notable exception of the
This said there are substantial differences among the published standards. In order to carry out a comparative assessment among them, based on the considerations elaborated in the above sections of the paper, the ten criteria of merit that were identified above were used by the author as an evaluation template:
· Propriety: preservation of the dignity, security and privacy of people; protection of confidential information; prior informed consent of participants.
· Transparency: agreed expectations about objectives and methods are disseminated to stakeholders; evaluation reports disclosed to stakeholders and the general public without modification.
·
· Integrity: disclosure and avoidance of actual or potential conflicts of interest; contestability of evaluation judgments; evaluator’s access to relevant information.
· Competence: requirements about the knowledge, skills and experience expected of evaluators.
· Quality: guidance about the practices needed to achieve evaluation relevance, credibility, rigor and objectivity; norms for achieving fair and valid evaluation findings and recommendations; practices that generate constructive relationships among participants.
· Comprehensiveness: coverage of all phases of the evaluation cycle – from commissioning to dissemination and utilization.
· Efficiency: economy in use of skills, funds and administrative resources; limited burden on participants.
· Utilization: responsiveness to the public interest and to users’ needs; participation of stakeholders in the evaluation; timely delivery; clarity of presentation.
· Governance: clarity of roles between evaluation commissioners, evaluators and participants; appropriate contractual relationships; oversight of self-evaluation by independent evaluation; “evaluability” norms for program and policy design.
Table 1 presents the summary results based on the admittedly subjective judgment of the author. Equal weights were ascribed to each category. Based on a textual analysis of their content, each of the national standards was rated from 1 to 6—where 1 presents a highly unsatisfactory treatment and 6 a highly satisfactory treatment. The exercise was carried out for illustrative purposes only, i.e. to demonstrate that evaluation standards can be valued. No claim is made for their accuracy. And it goes without saying that the process followed does not comply with sound evaluation process norms. Validation of the criteria by an expert panel combined with independent scoring by qualified evaluators would be required to attest to the reliability of the individual ratings.
Table 1. Ratings of National Evaluation Standards
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Switz. |
|
|
Average |
|
|
|
Propriety |
6 |
5 |
5 |
4 |
5 |
6 |
5 |
5.1 |
|
|
|
Transparency |
4 |
4 |
4 |
4 |
4 |
6 |
4 |
4.3 |
|
|
|
|
3 |
1 |
3 |
1 |
1 |
4 |
1 |
2.0 |
|
|
|
Integrity |
2 |
2 |
2 |
2 |
4 |
4 |
2 |
2.6 |
|
|
|
Competence |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5.0 |
|
|
|
Quality |
5 |
3 |
3 |
6 |
6 |
4 |
6 |
4.7 |
|
|
|
Comprehensiveness |
6 |
2 |
3 |
5 |
5 |
5 |
5 |
4.4 |
|
|
|
Efficiency |
1 |
4 |
1 |
5 |
5 |
1 |
5 |
3.1 |
|
|
|
Utilization |
2 |
2 |
2 |
5 |
5 |
6 |
5 |
3.8 |
|
|
|
Governance |
4 |
3 |
3 |
3 |
3 |
4 |
3 |
3.3 |
|
|
|
Average |
3.8 |
3.1 |
3.1 |
4.0 |
4.3 |
4.5 |
4.1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Most national
standards give considerable emphasis to the imperative of doing no harm and
stress the rights of evaluation participants and the protection of
confidentiality. Some mandate a right of prior informed consent for evaluation
participants preferably in writing. In general, the standards give salience to
the necessity of ensuring that evaluators have appropriate knowledge and
skills. The critical importance of quality standards is also stressed, except
for the charters of
The
lowest ratings are those related to the independence and integrity of the
function. Only the
None of the other national standards address the risks inevitably associated with cases where evaluation commissioners have major executive responsibilities for the programs being evaluated. Integrity ratings are relatively low because conflict of interest problems tend to be treated lightly if at all and contestability procedures are not provided for. Where conflicts of interest are treated it is in terms of requiring their disclosure rather than on automatic disqualification from taking part in the evaluation.
Nevertheless,
On
the other hand, the unimpeded access to relevant information (an acid test of
independence for evaluators) while encouraged in some cases is not guaranteed
by any of the national standards. Nor under the rubric of transparency does the
inalienable right of the public to access uncensored evaluation reports figure
explicitly in any of the national standards although the
The Canadian guidelines do not address the disclosure of evaluation reports. Instead they emphasize the responsibility of evaluators to their clients with respect to confidentiality, privacy and ownership of findings and reports. The US guidelines (along with the German and Swiss versions that they have inspired) take a middle and somewhat ambiguous course by requiring that the “results” of the evaluation be made available to all potentially affected persons as well as to all others who have a legitimate claim to receive them.
Similarly, the French charter opines that public dissemination of evaluation results is desirable but reserves actual disclosure modalities to a negotiated outcome at the time of evaluation commissioning. By contrast, the Australasian guidelines are explicit in requiring the consent of the evaluator for any amendments to the final report but they do not compel the commissioners to disclose evaluation reports to the public. Instead, they enjoin commissioners not to breach the integrity of the reports in their pronouncements.
The Road Ahead
The above analysis
brings out the following conclusions that may help trace a road map for future
work on evaluation standards. Utilization ranks highest for the
Most national standards give far more weight to the obligations of evaluators than to policy makers, program managers and evaluation commissioners. They do not address criteria of program “evaluability” or the measures needed to ensure effective utilization of evaluation results. They do not instruct evaluation commissioners to support evaluators in their evaluative work; provide them with unencumbered access to data; protect their independence; and avoid retribution, intimidation and other means of evaluation capture.
None of the standards makes public officials accountable for the effective use of evaluation results in the public interest. This would require the formulation of standards that address explicitly the institutional prerequisites of organizational learning, e.g. based on the accountability principles of the new public management movement. This would involve codification of the distinctive roles of independent evaluation, self-evaluation, inspection and auditing in various administrative environments.
For evaluation standards to be fully relevant, evaluators, evaluation commissioners and program managers would need specific guidance with respect to results based management systems, quality assurance processes, results based scorecards and selection and use of performance indicators in public service delivery. This is where the current frontier of program evaluation activities lies. Similarly, the regulatory interface between citizens, government, voluntary organizations and the private sector would benefit from judicious guidelines. Other promising areas for standard setting include the design of appropriate linkages between independent evaluation, budget allocation processes and personnel evaluation practices.
National audit offices have often trespassed profitably into the evaluation domain through value for money and comprehensive audits. Conversely, systematic evaluations of the effectiveness of public auditing and inspection processes would have considerable merit and should be encouraged. In most industrial democracies, public officials feel victimized by “inspection overload”, taxpayers feel powerless to influence the quality of public services and performance indicators are widely criticized because they fail to encourage beneficiary involvement and genuine quality assurance. This suggests a need for more systematic evaluations of control functions, including of inspection and auditing…and of evaluation itself.
Towards Global Standards
Finally future work on evaluation standards should take account of the transnational features that now characterize the profession. Increasingly, evaluators are called upon to assess public policies and programs that extend beyond national borders. As a result, a global evaluation community is in the making. It is seeking a common language in order to facilitate evaluation assignments across national boundaries. Given this new context, harmonization of evaluation standards across national boundaries would be desirable. Demands for cross border consistency and transparency in evaluation have become more pressing.
But a global approach not grounded in national and regional experience would involve risks of coercion, rigidity and misplaced homogeneity. To achieve credibility and legitimacy, global standards should be grounded in the initiatives of national evaluation associations. Consistent evaluation standards that would respect universal ideals of peace and justice would promote high quality work in evaluation, foster trust in the profession and contribute to the advent of an international evaluation community. There is now a wealth of experience in the design and implementation of national standards. It would be appropriate for such standards to be used as building blocks for a global initiative.
The global standards should be inclusive, embrace new stakeholders, accommodate all evaluation doctrines and focus on institutions rather than the individual evaluator. A comprehensive approach to standards (capturing its ethics, its governance, its methods and its linkages to policy making and resource allocation processes) would be desirable so that the sterile debate between principles based and rule based standards that has plagued the development of universal accounting standards is not repeated and the results based approach that is the hallmark of the evaluation profession is given a chance.
National evaluation societies should take the lead in the design of global evaluation standards. A gradual, organic progress is more likely to yield greater ownership than hasty standardization. To provide credibility to the formulation of evaluation standards, policy makers and representatives of the private and voluntary sectors should have their say and due process, including broad based public consultations, will have to be observed. Last but not least, in order to ensure legitimacy, special efforts should be made to involve evaluators of the developing world where 85% of the world’s peoples live.
About the Author
Robert Picciotto
is Visiting Professor at King's College,
Determining Causality in Program Evaluation and Applied Research: Should
Experimental Evidence Be the Gold Standard?
While there is
little disagreement about the need for, and value of, program evaluation, there
remain major disagreements in the field about best practices (Donaldson &
Lipsey, in press). For example, Donaldson and Scriven (2003) invited a diverse
group of evaluators to
This uneasy peace
seemed to revert back to overt conflict in late 2003, when the U.S. Department
of Education’s
AEA STATEMENT
November 24, 2003
Dear Colleagues,
We encourage AEA members to share
their views on Scientifically Based Evaluation Methods with the U.S. Department
of Education. Up to now a number of members have shared their views with other
members on EvalTalk. This discussion has been helpful in clarifying our
thoughts and in presenting potential arguments, but NOW it is time for AEA
members to share their views directly with the Department of Education.
A statement has been prepared by a
team of distinguished evaluators including: Randall Davies, Ernest House, Cheri
Levenson, Linda Mabry (chair), Sandra Mathison and Michael Scriven. This team
received valuable assistance from: Lois-ellin Datta, Burt Perrin, Katherine
Ryan and Bob Williams. We are grateful to this team for their rapid response to
this proposal. This statement has been approved by the current and future
Executive Committees of the Board of the American Evaluation Association,
including:
Molly Engle, 2002 President
Richard Krueger, 2003 President
Nick Smith, 2004 President
Sharon Rallis, 2005 President
Nanette Keiser, 2002-2003 Treasurer
Kathleen Bolland, 2004 Treasurer
We encourage AEA members to share
their thoughts directly to the U.S. Department of Education and possibly with
legislative leaders. If you agree with the AEA statement, you might indicate
your support of the AEA statement.
OR
If you wish to offer other arguments
or points of views, please submit those as well.
Responses are to be sent to:
Margo K. Anderson, U.S. Department of
Education, 400 Maryland Avenue, SW., Room 4W333, Washington, DC 20202-5910
Or by internet to: comments@ed.gov
and include the term ``Evaluation'' in the subject line of your electronic
message. Comments must be received on or before December 4th.
Sincerely
Richard Krueger, President
American Evaluation Association
* * * * *
American Evaluation Association Response
To
Notice of proposed priority, Federal Register RIN 1890-ZA00,
November 4, 2003
"Scientifically Based Evaluation Methods"
The American Evaluation Association
applauds the effort to promote high quality in the U.S. Secretary of
Education's proposed priority for evaluating educational programs using
scientifically based methods. We, too, have worked to encourage competent
practice through our Guiding Principles for Evaluators (1994), Standards for
Program Evaluation (1994), professional training, and annual conferences.
However, we believe the proposed priority manifests fundamental
misunderstandings about (1) the types of studies capable of determining
causality, (2) the methods capable of achieving scientific rigor, and (3) the
types of studies that support policy and program decisions. We would like to
help avoid the political, ethical, and financial disaster that could well
attend implementation of the proposed priority.
(1) Studies capable of determining
causality. Randomized control group trials (RCTs) are not the only studies
capable of generating understandings of causality. In medicine, causality has
been conclusively shown in some instances without RCTs, for example, in linking
smoking to lung cancer and infested rats to bubonic plague. The secretary's
proposal would elevate experimental over quasi-experimental, observational,
single-subject, and other designs which are sometimes more feasible and equally
valid.
RCTs are not always best for
determining causality and can be misleading. RCTs examine a limited number of
isolated factors that are neither limited nor isolated in natural settings. The
complex nature of causality and the multitude of actual influences on outcomes
render RCTs less capable of discovering causality than designs sensitive to
local culture and conditions and open to unanticipated causal factors.
RCTs should sometimes be ruled out
for reasons of ethics. For example, assigning experimental subjects to educationally
inferior or medically unproven treatments, or denying control group subjects
access to important instructional opportunities or critical medical
intervention, is not ethically acceptable even when RCT results might be
enlightening. Such studies would not be approved by Institutional Review Boards
overseeing the protection of human subjects in accordance with federal statute.
In some cases, data sources are
insufficient for RCTs. Pilot, experimental, and exploratory education, health,
and social programs are often small enough in scale to preclude use of RCTs as
an evaluation methodology, however important it may be to examine causality
prior to wider implementation.
(2) Methods capable of demonstrating
scientific rigor. For at least a decade, evaluators publicly debated whether
newer inquiry methods were sufficiently rigorous. This issue was settled long
ago. Actual practice and many published examples demonstrate that alternative
and mixed methods are rigorous and scientific. To discourage a repertoire of
methods would force evaluators backward. We strongly disagree that the
methodological "benefits of the proposed priority justify the costs."
(3) Studies capable of supporting
appropriate policy and program decisions. We also strongly disagree that
"this regulatory action does not unduly interfere with State, local, and
tribal governments in the exercise of their governmental functions." As
provision and support of programs are governmental functions so, too, is
determining program effectiveness. Sound policy decisions benefit from data
illustrating not only causality but also conditionality. Fettering evaluators
with unnecessary and unreasonable constraints would deny information needed by
policy-makers.
While we agree with the intent of ensuring
that federally sponsored programs be "evaluated using scientifically based
research . . . to determine the effectiveness of a project intervention,"
we do not agree that "evaluation methods using an experimental design are
best for determining project effectiveness." We believe that the
constraints in the proposed priority would deny use of other needed, proven,
and scientifically credible evaluation methods, resulting in fruitless
expenditures on some large contracts while leaving other public programs
unevaluated entirely. Statement prepared by: Randall Davies, Ernest House,
Cheri Levenson, Linda Mabry (chair), Sandra Mathison and Michael Scriven. This
team received valuable assistance from: Lois-ellin Datta, Burt Perrin,
Katherine Ryan, and Bob Williams.
Opposition to the AEA Statement
An influential group of senior members of the American Evaluation Association opposed the AEA Statement, and did not feel they were appropriately consulted as active, long-term members of AEA. In response to President Krueger’s call for members to share their individual views on this matter, a new statement now referred to as the “NOT AEA STATEMENT” (as seen on Evaltalk) was submitted to the U. S. Department of Education:
NOT THE AEA STATEMENT
Posted on Evaltalk on: 12-3-2003
AEA members:
The statement below has been sent to
the Department of Education in response to its proposal that
"scientifically based evaluation methods" for assessing the
effectiveness of educational interventions be defined as randomized experiments
when they are feasible and as quasi-experimental or single-subject designs when
they are not.
This statement is intended to support
the Department's definition and associated preference for the use of such
designs for outcome evaluation when they are applicable. It is also intended to
provide a counterpoint to the statement submitted by the AEA leadership as the
Association's position on this matter. The generalized opposition to use of
experimental and quasi-experimental methods evinced in the AEA statement is
unjustified, speciously argued, and represents neither the methodological norms
in the evaluation field nor the views of the large segment of the AEA
membership with significant experience conducting experimental and
quasi-experimental evaluations of program effects.
We encourage all AEA members to
communicate their views on this matter to the Department of Education and
invite you to endorse the statement below in that communication if it is more
representative of your views than the official AEA statement. [Comments can be
sent to the Dept of Ed through Dec. 4 at comments@ed.gov with
"Evaluation" in the subject line of the message].
************************************
This statement is in response to the
Secretary's request for comment on the proposed priority on Scientifically
Based Evaluation Methods. We offer the following observations in support of
this priority.
The proposed priority identifies
random assignment experimental designs as the methodological standard for what
constitutes scientifically based evaluation methods for determining whether an
intervention produces meaningful effects on students, teachers, parents, and
others. The priority also recognizes that there are cases when random
assignment is not feasible and, in such cases, identifies quasi-experimental
designs and single-subject designs as alternatives that may be justified by the
circumstances of particular evaluations.
This interpretation of what
constitutes scientifically based evaluation strategies for assessing program effects
is consistent with the presentations in the major textbooks in evaluation and
with widely recognized methodological standards in the social and medical
sciences. Randomized controlled trials have been essential to understanding
what works, what does not work, and what is harmful among interventions in many
other areas of public policy including health and medicine, mental health,
criminal justice, employment, and welfare. Furthermore, attempts to draw
conclusions about intervention effects based on nonrandomized trials have often
led to misleading results in these fields and there is no reason to expect this
to be untrue in the social and education fields. This is demonstrated, for
example, by the results of randomized trials of facilitated communication for
autistic children and prison visits for juvenile offenders, which reversed the
conclusions of nonexperimental studies of these interventions.
Randomized trials in the social
sector are more frequent and feasible than many critics acknowledge and their
number is increasing. The Campbell Collaboration of Social, Psychological,
Educational, and Criminological Trials Register includes nearly 13,000 such
trials, and the development of this register is still in its youth.
At the same time, we recognize that
randomized trials are not feasible or ethical at times. In such circumstances,
quasi-experimental or other designs may be appropriate alternatives, as the
proposed priority allows. However, it has been possible to configure practical
and ethical experimental designs in such complex and sensitive areas of study
as pregnancy prevention programs, police handling of domestic violence, and
prevention of substance abuse. It is similarly possible to design randomized
trials or strong quasi-experiments to be ethical and feasible for many
educational programs. In such cases, we believe the Secretary's proposed
priority gives proper guidance for attaining high methodological standards and
we believe the nation's children deserve to have educational programs of demonstrated
effectiveness as determined by the most scientifically credible methods
available.
The individuals who have signed below
in support of this statement are current or former members of the American
Evaluation Association (AEA). Included among us are individuals who have been
closely associated with that organization since its inception and who have
served as AEA presidents, Board members, and journal editors. We wish to make
clear that the statement submitted by AEA in response to this proposed priority
does not represent our views and we regret that a statement representing the
organization was proffered without prior review and comment by its members. We
believe that the proposed priority will dramatically increase the amount of
valid information for guiding the improvement of education throughout the
nation. We appreciate the opportunity to comment on a matter of this importance
and support the Department's initiative.
Signed by:
Leonard Bickman
Professor of Psychology, Psychiatry,
and Public Policy at Vanderbilt University, Associate Dean, and Director of The
Center for Mental Health Policy at the Vanderbilt Institute for Public Policy
Studies; Coeditor of the Sage Publications Applied
Social Research Methods Series and the Handbook
of Applied Research Methods and the editor of the Journal, Mental Health Services Research;
recipient of the American Psychological Association's Public Interest Award for
Distinguished Contribution to Research in Public Policy and the American
Evaluation Association Outstanding Evaluation award; past president of the
American Evaluation Association.
Robert F. Boruch
Professor in the Graduate School of
Education, Fels Institute for Government, and the Statistics Department of the
Wharton School of Business at the University of Pennsylvania; Fellow of the
American Statistical Association and the American Academy of Arts and Sciences;
recipient of the American Evaluation Association Myrdal Award for Evaluation
Practice and the Policy Studies Organization's Donald T. Campbell Award;
founder of the Evaluation Research Society, a parent to the current American
Evaluation Association.
Thomas D. Cook
Joan and Serepta Harrison Chair in
Ethics and Justice and Professor of Sociology, Psychology, Education and Social
Policy at Northwestern University; Coauthor of Experimental and Quasi-Experimental Designs for Generalized Causal
Inference, Quasi-Experimentation:
Design and Analysis Issues for Field Settings, and Foundations of Program Evaluation: Theories of Practice; Fellow of
the American Academy of Arts and Sciences and the American Academy of Political
and Social Science; recipient of the American Evaluation Association Myrdal
Award for Evaluation Science, the Donald Campbell Award for Innovative
Methodology from the Policy Sciences Organization, and the Distinguished
Scientist Award of Division 5 of the American Psychological Association.
David S. Cordray
Professor of Public Policy and
Psychology at Vanderbilt University; Coauthor, Evaluation methods for social
intervention, Annual Review of Psychology; past President and Board Member of
the American Evaluation Association.
Gary Henry
Professor of Public Administration
and Urban Studies, Political Science and Educational Policy Studies at the
Andrew Young School of Policy Studies, Georgia State University; Coauthor of Evaluation: An Integrated Framework for
Understanding, Guiding, and Improving Policies and Programs; former
Editor-in-chief of New Directions for
Evaluation; recipient of the American Evaluation Association Outstanding
Evaluation award and the American Society for Public Administration and Center
for Accountability and Performance Joseph Wholey Distinguished Scholarship
Award; Board Member of the American Evaluation Association.
Mark W. Lipsey
Director of the Center for Evaluation
Research and Methodology and Senior Research Associate at the Vanderbilt
Institute for Public Policy Studies; Coauthor of Evaluation: A Systematic Approach; former Editor in Chief of New Directions for Program Evaluation;
recipient of the American Evaluation Association Lazarsfeld Award for
Evaluation Theory.
Peter H. Rossi
Stuart A. Rice Professor of Sociology
and Professor Emeritus at the University of Massachusetts at Amherst; Fellow of
the American Academy of Arts and Sciences and the American Association for the
Advancement of Science; Coauthor of Evaluation:
A Systematic Approach, Thinking About
Program Evaluation, and Program
Evaluation in Education, When? How? To What Ends?; recipient of the
American Sociological Association Commonwealth Award and the American
Evaluation Association Myrdal Science Award.
Lee Sechrest
Professor Emeritus of Psychology at
the University of Arizona and founder of the Evaluation Group for Analysis of
Data; recipient of the American Evaluation Association Myrdal Award for
Evaluation Practice and the Distinguished Scientific Contribution Award from
the Division of Evaluation, Measurement, and Statistics, of the American
Psychological Association; past president of the American Evaluation Association
and the Division of Evaluation, Measurement, and Statistics of the American
Psychological Association.
************************
The 2004
The exchange above about the role of randomized control trials in program evaluation practice in educational settings set the stage for the 2004 Claremont Debate.
The apparent
resurgence of issues reminiscent of the well-known quantitative-qualitative
paradigm wars in evaluation has the potential to be destructive and to stunt
the healthy development of the discipline and profession. In an effort to seek
a deeper understanding of the current dispute, and to possibly discover a
middle ground or productive resolution,
Selected Excerpts from Mark Lipsey’s
Opening Comments
“In this context, it seems to me that there are at least three topics that we might discuss.”
“One has to do with the way randomized trials appear in government agencies and the legislation and so on, some of which is simplistic and inept, as uncharacteristic as that is of government activity.”
“Another thing we might talk about is the little flack in the American Evaluation Association (AEA) that involves the stance that was taken last year opposing an obscure division of the Department of Education to try to bring in some randomized evaluations to some of the projects it was funding. Since this event is being sponsored by an AEA Affiliate, that is a possibility. I’d be happy to explain to you why I think the AEA now has the same relationship to the Field of Evaluation as the Flat Earth Society has to the Field of Geology.”
“The third thing we might talk about is the methodological issue and what is actually at stake in these methodological critiques. That is actually what I want to talk about, but if anyone, maybe the audience, or Michael wants to talk about the others, then I’d be happy to do that.”
“We really are poorly served by this gold standard terminology. I think that when you use randomized experiments, which I am basically going to defend in this context, they are much like what Winston Churchill once said about democracy. He said, ‘It’s the worst form of government except for all the others that have been tried from time to time.’ I do not think this is the gold standard. I think that for impact assessment randomized experiments are the worst methodology except for some of the others that have been tried from time to time. That is pretty much my theme here.”
“Experimental and quasi-experimental designs have been around a long time and have well known properties. What’s really new is this broadside against them from certain research communities.”
“This issue has evoked mostly a yawn in areas where intervention research and program evaluation is done broadly. So, in mental health, public health, drug prevention, medicine, chronic delinquency evaluations, and a whole range of areas this is not a particularly exciting topic where randomized field trials are well respected, well known, widely used, and understood to be something of the state of the art for doing impact assessments. The reactions I’ve seen have come predominantly from the education research culture and to a certain extent from one wing of economists that work in this field that have an interesting take on it. I will get to that later on.”
“Let me turn now to the non-experimental approaches. This is an area that has fascinated me. Back when flap was going on, methodological pluralism was all over the Evaltalk. I kept asking respondents and finally gave up on what these other methods were that were supposed to be equally valid, and the most interesting list came out: epidemiological methods, observational correlation modeling, realist methods, case studies, qualitative, ethnographic, Glasser and Strauss’ grounded theory, and from Michael Scriven the modus operandi technique, forensic analysis, direct observation, all put forth in establishing the effects of programs.”
“I have in recent years, every time I see somebody putting forward the argument that qualitative methods could be used to assess program effects, I’ve been writing them for some examples. Show me a case where this was done convincingly.”
“Why is the education research culture so riled up about randomized experiments? Here are a couple of possibilities. In all the politics this year, the Bush Administration, the Department of Education, the No Child Left Behind Act, there’s a lot not to like there, okay? They have been pushing for randomized designs, so we may as well not like those too. The biggest factor I think is ideological. The education research culture bought into constructivism and post modernist epistemologies and so on really big time and there is a lot of ideological opposition. Tom Cook calls it science phobia to quantitative methods and experimentation and so on. Third, I think that there is a considerable amount of ignorance, not stupidity, not stupidity, but ignorance.”
Selected Excerpts from Michael
Scriven’s Response
“Well, apart from the character assassination at the end, which I can tell you in the education community there may be people in it about which those things can be said, but the greatest attacks on constructivism are from people within the education community. So, there are plenty of others like us who absolutely reject all of that crap and so, it is certainly not true. Some of my friends are also on the side of the angels over there, like Tom Cook, for the new move. So, no, I don’t think that is really a very plausible account of the story.”
“I think that if
you want to look at reasons why people objected, the three big ones are these.
One, the objections were not at all against randomized control trials (RCT),
they were against the decision to take all $500 million dollars of their research
money and pull it out of anything except randomized control trials. Now, it is
quite clear the previous speaker is not identifying himself with this extreme
wing, but who is the leader of the extreme wing? It is the guy who is the head
of the
“Read Tom Cook on problems in practice of running RCTs. So, this is a very tricky procedure. While it has theoretical advantages, the theoretical advantages in validity aspects of it are undeniable. That is not the issue. The issue is not whether or not there is an alternative that has the same theoretical bulletproof-ness. The question is whether there is an alternative that can get you results beyond reasonable doubt, and that is another story all together. Very often, you can get results beyond reasonable doubt in other ways.”
“First, the concessions. We have not used RCTs when we should have many, many times. There have been many occasions when we could have pulled off RCTs, when we could have staffed them with competent people, and this is still the case in the present, and that was the best design around. The arguments around are sloppy arguments including a number of arguments that Professor Lipsey ran into at the Evaltalk discussion. There was a lot of whistling in the dark going on there and ideological crap going on. You have to get down to the logic of the cases and you can’t just pull this off by waving things like constructivism, observational, or etc. So, this is a situation where there is no doubt at all. This is a very powerful tool, and sometimes much the best tool, but it has as the same value as the torque wrench in a good mechanic’s toolbox. For certain tasks, you can’t beat it. After all, this is a quantitative instrument. The torque wrench reads out in inches and meters and so on, so this is very important if you are interested in matching the specs that you are supposed to be matching…a very good instrument. Nothing can match it, but it has a very narrow range of uses. Now, that doesn’t matter if the alternative approaches aren’t very good, but of course there is a lot of them and some of them are very good indeed.”
“Well, there’s a lot more I’d like to say, but perhaps I can just leave it by saying I think I agree strongly with him. A lot of the attacks have been empty and they have lacked specific examples that will work. A lot of the attacks are based on ideological positions, which are logically unsound. All of this is true, but nevertheless, given the difficulties facing RCTs, one has to be very cautious going to any sort of wholesale commitment to them. I hope in the future we can develop a better kind of existence than what we have at the moment.”
Conclusion
Somewhat surprisingly, Lipsey and Scriven agreed that randomized control trials (RCTs) are the best method currently available for assessing program impact (causal effects of a program), and that determining program impact is a main requirement of contemporary program evaluation. However, Scriven argued that there are very few situations where RCTs can be successfully implemented in educational program evaluation, and that there are now good alternative designs for determining program effects. Lipsey disagreed and remained very skeptical of Scriven’s claim that sound alternative methods exist for determining program effects, and challenged Scriven to provide specific examples. Streaming video of the entire Claremont Debate can be viewed at: http://www.cgu.edu/pages/465.asp.
About the Authors
Stewart I.
Donaldson is Dean and Professor of Psychology,
Christina A. Christie
is an Assistant Professor, Director of the Masters of Arts Program in
Psychology and Evaluation, and Associate Director of the
References
Donaldson, S. I.,
& Lipsey, M. W. (in press). Roles for theory in contemporary evaluation
practice: Developing practical knowledge. In I. Shaw, J. Greene, & M. Mark
(Eds.), Handbook of evaluation.
Donaldson, S.
Guiding Principles
for Evaluators (1994). New Directions for Program Evaluation (No.66).
Joint Committee on
Standards for Education Evaluation (1994). The program evaluation standards:
How to assess evaluations of educational programs.
Mark, M. M. (2003).
Toward a integrative view of the theory and practice of program and policy
evaluation. In S. I. Donaldson & M. Scriven (Eds.) Evaluating social
programs and problems: Visions for the new millennium (pp. 183-204).
Abstract
This paper documents
a process of evaluation capacity building in a humanitarian organization in
Résumé
Cet article vise à
expliciter un processus de renforcement des capacités en évaluation de
programme d’une organisation humanitaire en
Introduction
The capacity
building of humanitarian organizations relates to the multiple functions and
activities carried out by these organizations. Literature is rich with articles
and chapters depicting poor capacity building practices in these types of
organizations, “capacity development has been largely unsuccessful” said ALNAP
in the 2003 review of humanitarian action1. These are often
written in a very negative way. In this paper we wish, to provide a more
constructive perspective, as Morgan et al.2 have done regarding
training and education, on the way in which capacity building activities of
humanitarian organizations is carried out, while remaining critical and
rigorous at the same time. For this purpose, we will present the case of a Non
Governmental Organization (NGO) implementing community health programs in
This is why this
paper will only focus on the evaluation capacity building (ECB) activities of
this organization in
A thorough
description of the context is important as ECB practices are highly
“context-dependent”7 according to the most
cited definition. After more than 20 years of conflict and important economic
decline11, chances for development
in Afghanistan are impaired by the worsening health condition of the
population. Indeed, health indicators, especially maternal and infant mortality
rates, are among the worst in the world and some of them are increasing: UNICEF
shows a rise from 600 maternal deaths in 1981 to 1,700 deaths in mid-1990. A
recent women’s mortality survey, conducted in four provinces of
According to a
recent report done for the Afghanistan Research and Evaluation Unit (AREU)15, the health system is
adversely affected by major problems: a grossly deficient, and even absent,
infrastructure; a top-heavy health system; poorly distributed resources; health
care delivered on a ‘project’ basis by many distinct, relatively uncoordinated
service providers; absence of a practical, useful, coordinated information
system for management decision-making. In addition, the pre-war in human
resource capacity has been eroded and there is scarcity of personnel with
managerial and technical skills throughout the country. There is also a lack of
training and a lack of public health expertise, for all health staff and
doctors are generally not able to deal with the most urgent problems at a
community level. Indeed, medical facilities and personnel are very few in
number and are primarily found in
In 2004,
A French medical
NGO founded in 1979, Aide Médicale Internationale (AMI), is acting and working
in Afghanistan since the early eighties, undertaking different kinds of
activities that focus on the rehabilitation of health care structures and on
medical training for health care workers. Initially, all missions were secret
ones taking place during the Soviet occupation of the Afghan territory. From
1985 to 1993 AMI ran a training program (Medical Training for Afghans) in
Peshawar (Pakistan), and provided the 115 graduated students with medical kits
to start their activities inside Afghanistan19. This was a huge
project in term of medical capacity building for
Since 1996, AMI run
a multi-disciplinary health program funded by the European Union and
implemented in partnership with the British NGO “Sandy Gall Appeal for
Medical and Administrative
As we can see, most
of the past and current programs run and supported by AMI have a capacity
building component, mostly on the medical and administrative side, like many
other organisations in international health development2. The training of 115
graduate students during the Mujjahidine times is an earlier one, but in the
past years some Afghan employees had the opportunity to reinforce their
capacities thanks to three strategies: on-the-job training, formal workshops
and courses at the headquarters and formal training abroad. The Afghan
responsible for the biology programme spent two months in different hospitals
in
ECB Framework
During the last
three years AMI commissioned one program evaluation a year in
In one of the most
recent publications on ECB, experts from the Center for Disease Control (CDC)
said “One problem is that the evaluation profession as a whole still lacks a
well-developed theory and associated indicators for understanding evaluation
capacity at an organizational level, particularly its inherent change over time
and “ongoingness.”. This is why, first, this paper does not pretend to provide
extensive data on ECB outcomes, and second, we will use a broad framework to
make the way in which the ECB activities were held in
Between 2001 and 2003, three evaluations in
We have argued elsewhere22 that in an international situation of humanitarian aid where the context of the evaluation is an essential element, but impossible to manage, it is best to use a participative approach and to minimize the distance between the evaluator and the participants. This evaluation model could significantly increase the probability of appropriation of the evaluation results and the application/adaptation of the recommendations. Thus, NGOs wishing to organize an evaluation in such a context may find it very useful to collaborate with expert-facilitators (as evaluators) who use the participative approach, and who at the time same know well the specific situation and the organization that implements the program. The expertise in evaluation is not its own self sufficient. For all these reasons, we believe that this specific approach is, in this particular context, one of the most appropriate evaluation models to improve and build the evaluation capacity of NGOs. We also argue that this does not only hold true for development projects, as we have known for a long time23, but also, as is the case in this paper, it holds true for humanitarian projects run by NGOs in complex settings.
Having said this, we must add that the extent of participation was not the same during the three above mentioned evaluations. Implicitly, we decided to use an evaluation model which employed approaches more and more near the ideal-type of the participative model (practical type and not empowerment type24). The goal was to gradually reinforce competences and knowledge of the NGO stakeholders in terms of evaluation and institutionalization of those activities. Although in the context of international development NGOs have been first to mainly apply this type of pluralist approach22,23, AMI was not truly accustomed to such a process in Afghanistan. The context of permanent war during more than 20 years, obliged the NGO to work in substitution of the State and without much of participation of the communities in decision making, is one of the explanations to the lack of use of such an approach. It should be noted that the implementation of the participative approach for the first time in 2001 during the first evaluation proceeded in parallel with the will of the NGO to give a wider role to the local populations in the management of health centres. It is as of this time that the first attempts to establish Health Management Communities were tried. Also, the gradual approach with regards to participation is justified by the gradual evolution of the context passing from a situation of war with the presence of Tabebans (2001) to a situation of post-conflict and rebuilding of the State (2003).
Table 1. The Three Evaluations from 2001 to 2003
In another article where we analyze in depth the 2001 evaluation participative process22 we proposed, following and adapting Patton25, to define participative evaluation according to nine criteria gathered in three categories. We will distinguish three categories of participants whose hierarchy is instituted according to their capacity to intervene in the use of the evaluation results since we are using an utilization-focused evaluation approache25. Table 2 illustrates the depth of the participation in the three processes and how, gradually, we use the appropriate evaluation model according to the context and the NGO wishes. We will, in the next section, explain in more detail how this progressive practice allowed us to build the evaluation skills of the local staff in order to improve their participation in the process.
The detailed analysis of the elements in Table 2 is beyond the scope of this paper. However, we think that it is useful to give some empirical elements. For that purpose, we are using this table to show how much the degree participation was gradual important from 2001 to 2003. The top of the use of this approach was the evaluation of 2003 which, adapting a method proposed by Aubel26, allowed the utilization of a model close to the ideal-type of the practical participative evaluation model. The details of this last evaluation are presented elsewhere27. We just want to add that to overcome the problem of integration of lessons learned into the program and appropriation of recommendations, it was proposed that the evaluation exercise include a final one-day workshop in which a draft action plan regarding the implementation of recommendations was developed based on the evaluation findings and lessons learned. Then, it was decided to establish an evaluation steering committee in order to organize a participative process to finalize the document of action plan by topic and implement it.
One of the arguments in favour of the utilization of the appropriate evaluation model in order to improve the capacity building activities is that a wrong model will, not only be unable to answer the evaluation question asked by the NGO, but also it would decrease the understanding and the trust of stakeholders regarding the evaluation practice. In others words, as said Bozzo21, “the efforts undertaken will be sustainable over the long term”. Table 2 is of special interest with regard to this point and it demonstrates that the participative approach, in its ideal-type sense, is maybe not the most appropriate model for an effective evaluation. In fact, if the depth in participation gradually increased it was also due to a pragmatic objective: to increase the appropriation of the evaluation model. In other words we can say that if in 2003 AMI wanted an efficiency evaluation, it could be sure that the depth of participation was not as it was for the process evaluation. This observation is not new for evaluation theorists but with this empirical data we confirm it and show that this was certainly one of the elements of the capacity building process.
|
|
|
|
|
|
|
|
|
|
On the field: head of mission and
medical coordinator In the headquarters: persons in charge
for program and medical |
Local department responsible,
expatriates in the field and directors and staff of clinics/hospitals |
Population and patients |
|
|
|
Content |
|
|
|
|
|
|
The evaluation process
involves participants in learning evaluation logic and skills |
+/- + + |
+ ++ ++ |
- +/- - |
|
|
|
Participants focus the
evaluation process and outcomes they consider important and to which they are
committed |
++ ++ ++ |
+/- +/- ++ |
+/- +/- ++ |
|
|
|
All aspects of the evaluation,
including data, are understandable and meaningful to participants |
++ ++ ++ |
+ ++ ++ |
- + + |
|
|
|
Process |
|
|
|
|
|
|
Participants in the
process own the evaluation. They make the major focus and design decisions, they
draw and apply conclusion |
+ + ++ |
++ +/- ++ |
+/- +/- +/- |
|
|
|
Participants work
together as a group and the evaluation facilitator supports group cohesion
and collective inquiry |
- ++ - |
+/- ++ ++ |
- + +/- |
|
|
|
The evaluator is a facilitator,
collaborator, and learning resource; participants are decision makers and
evaluators |
+ ++ + |
++ ++ ++ |
+/- +/- +/- |
|
|
|
Status differences
between the evaluation facilitator and participants are minimized |
++ ++ ++ |
++ ++ ++ |
- +/- +/- |
|
|
|
Finalities |
|
|
|
|
|
|
Internal,
self-accountability is highly valued |
++ + + |
+ + ++ |
- +/- - |
|
|
|
The evaluator
facilitator recognizes and values participants’ perspectives and expertise |
++ ++ + |
+/- ++ ++ |
+ + ++ |
|
Note. Degree of participation from 2001 (first line) to 2003 (third line) + +
= > very intense, + = > intense, +/- = > average; - = > absent.
Since 2001 and throughout the three evaluations, we used every favourable moment to the develop program evaluation skills of the stakeholders engaged in the evaluated projects. Two particular strategies were retained: on-the-job training and workshop training.
On-the-Job
Training During the Evaluation Process
Thanks to the fact
that the Afghan medical coordinator of the NGO remained in his position during
the three years, his presence contributed largely to the NGO capacity building
in evaluation. Admittedly, these evaluation exercises were not the only
capacity building opportunities, and his work throughout the year with
expatriates was as much of an occasion to improve his general knowledge and
skills in public health and project management. In the same vein, the three
evaluations were particular opportunities for him to learn and use concepts in
program evaluation. We use the recommended strategy for adult learners:
“learning by doing”2. The first evaluation
was less participative than the others and more technical, it was also more
research oriented. This enabled us to evoke subjects such as the construction
of a questionnaire, the constitution of a sample, statistical tests, and
concepts like ethics or external validity. This person had also the
responsibility for the administration of the questionnaires in villages aided
by a team of investigators. This enabled him to become aware of the
difficulties on the ground and to assume responsibilities and decisions which
could impact on the validity of the evaluation. Since all investigators did not
speak English (and we know that ECB is language-dependent28), he had to transmit a
certain amount of knowledge to his colleagues, which certainly contributed to
reinforcing it. As an outcome of the ECB process, the medical coordinator was
able at the end of 2001 to design and administer a quick survey when a huge
number of displaced people reached the
This being said, we
should mention that the most significant moment in term of capacity building
for him and one other colleague who is no longer with the NGO, was the second
evaluation in 2002. The method employed for this evaluation consisted of
drawing up a list of indicators through the carrying out of three regional
workshops with all project stakeholders. The medical coordinator acted as a
translator for the foreign consultant, but the translation of certain concepts
required a real understanding of the training contents. How to explain, for
example, the difference between output and outcome, or between objectivity and
subjectivity. We thus worked together to find useful examples. It was necessary
to adapt examples and exercises to the Afghan public, all the more so since the
group members had very diverse backgrounds (which we take pride in), with some
illiterate members. Having doctors and farmers (or teacher, community health
workers) work on the same project is not customary, in
This medical coordinator was also part of the third evaluation (2003), but most of his evaluation (and facilitation) skills were developed through collective action, as well as for a large part of, the second evaluation (2002).
Workshops
Training During the Evaluation Process
In 2002, three training/action
workshops were carried out over three days in Mazar-e-sharif, Gulbahar and
In terms of the outcome of the ECB process and according to the shortened cascade approach in training2, the medical coordinator was able, a few weeks after those three workshops to organize, on his own, the same workshop in another province (Laghman) with 24 participants. He was also in a better position, knowing the logic model approach, to interact with expatriates and contribute to the formulation of new AMI projects and proposals sent to donors. The annual obligatory presentation of NGO program results in the Ministry of Public Health (MoPH) during the National Technical Coordination Committee in front of many stakeholders it was easier to explain the logic of the programmes, performance indicators and the result-based management activities. There were also outcomes for provincial MoPH staff, notably regarding their skills in writing proposals and program planning according to the new health policy (Basic Package of Health Services).
In 2003, the
participatory evaluation process started
with an evaluation planning workshop held in
We used different sources of data
collected through quantitative as well as qualitative methods. The following
methods were used: interview (22), focus group (16), observation (6), document
analysis (2), and questionnaire (3). In addition to the people observed, 205
people (51% of women) had the opportunity to express their views on the
implementation of the AMI program in
So, we can say that this whole
evaluation process done by an evaluation team from the organization was a
perfect approach to develop their evaluation skills in all the evaluation
areas, from the evaluability assessment to the data analysis and action plan
formulation phase. It is also clear that skills to participate in the whole
process were increased, for some, partly due to the capacity building process
done over the past two years. Some of them were able in 2004 to use some
evaluation techniques (focus group and bed census) during an assessment of the
NGO cost-recovery schemes.
Follow-Up of the Baseline Survey in 2004
In addition to
those individual and collective training sessions during the last three
evaluations, we had another opportunity to develop the evaluation skills of the
NGO staff in 2004. During this year, the European Union grant given to AMI
covered four clusters of districts spread out among three provinces of
The expatriate in charge of the survey, asked us to follow the whole process, from a distance in a voluntary and informal capacity. She also solicited our advice and guidance during the evaluation process. As a result many methodological discussions were carried out through e-mail and phone. She decided to adapt the questionnaire that we used in the 2001 evaluation. For some part of the baseline survey, she asked us for some scientific literature (e.g. how to evaluate the quality of health care services) or statistical advice. We also reviewed part of the final report. This 2004 windows was not only an opportunity to develop the staff skills in program evaluation but also to start the building of an infrastructure for data collection, analysis, and presentation that would support program evaluation, in addition to the routine health information system (HIS) which focuses more on input and output than outcome indicators. This infrastructure is now in place and the Afghan collaborators are still in the NGO after the expatriate left. It should be noted that, even though, the expatriate was involved in the design, coordination and analysis of the survey, she was in the field only in one of the four provinces. Therefore in this three other settings, the process was in the hand of the local staff. The medical coordinator delivered 80% of the training for the surveyors in three provinces and 100% in the other. The ECB of the last three years was surely responsible for this outcome.
The third element which helps us to meet the ECB objective for this NGO consists of myriad activities of advocacy in favour of the program evaluation practice and dissemination of results of various evaluations. As we said earlier, the final aim of those advocacy/dissemination activities are to increase the probability of results utilization per se, following the Patton25 approach.
In terms of advocacy, and in addition to our continual personal interaction in favour of evaluation culture, we produced different papers in order to increase the awareness of the NGO staff regarding different topics in relation to evaluation. These papers, in addition of the evaluation reports, targeted NGO staff directly and more generally the humanitarian community. All these papers carry out a discussion on evaluation in a language that is understood. Some of these papers were published in peer reviewed journals and others in professional reviews or books. The following topics were discussed:
Table 3. Publications in French (F)
and English (E) Following the Three Evaluation
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
·
Book chapter on
Canadian humanitarian aid (F) ·
Poster and
proceeding of an international health care financing conference in France (F) |
|
|
|
|
|
·
Article in Humanitarian Affairs Review on health financing
in a complex emergency context (F, E) ·
Article in the Canadian Journal of Program Evaluation
on usefulness of a participatory evaluation model in an emergency context (F) ·
Article in The Journal of Afghanistan Studies on
the results and on the usefulness of a participatory process to explain
changes implemented ,results show 2 years after the evaluation (E) |
·
Book chapter in
the Encyclopedia of Evaluation on
participatory determination of performance indicators and utilization-focused
evaluation model (E) ·
Article in the
internal newsletter (Tam-Tami) for
AMI staff on ethics (F) |
·
Article in Développement et Santé, on basic
concepts in evaluation and the usefulness of a participatory evaluation model
(F) ·
Article in Revue Humanitaire on usefulness of a participatory
evaluation model and lesson learned workshop (F) ·
Article in the
AMI newsletter (La Chronique) for
donors : advocacy for humanitarian program evaluation (F) ·
Book chapter in
the 25th anniversary book on AMI on the basic concepts in evaluation and the
usefulness of a participatory evaluation model (F) |
|
|
|
|
|
|
|
|
|
We clearly know
that following the different stages of knowledge utilization (from transmission
to application), dissemination of results does not mean their utilization. But,
we can also say that these dissemination activities through all these papers
published for various members of the public and in different forms could
contribute to the installation of an evaluation culture in the organization.
Moreover, some articles were specifically written, in their languages, to train
the readers and explain to them the logic of evaluation and the importance of
practicing it (e.g., 33). We tried to translate
one of these articles in the local language and publish it in the Salamati
magazine published by this NGO. But unfortunately, the expatriate in charge on
this publication in
In addition to
these publications, during the past four years we conducted various oral
presentations to present some evaluation results and to raise the awareness of
the NGO staff on the evaluation practice. In
Last but not least,
we took the opportunity of a Canadian bursary program to invite the Afghan
medical coordinator, who was present in all evaluations since 2001, to the 2nd
International Conference on Local and Regional Health Programmes held in
The descriptive elements presented previously clearly show that the implicit step of capacity building was gradual and effective as demonstrated by some of the partial outcomes. Contrary to our definition of ECB which claims that the process need to be intentional, the case shows that a non-intentional process (from the organization point of view) could also have some impact in term of capacity building. The “evaluation capacity building practitioner considers how each study is connected to the development of the organization and to meeting the organization’s goals and mission”4. For this reason and to counter the non-intentional process, we (as individual and not as an organization) decided to use all windows of opportunity, or “teachable moments”3, to act in favour of the ECB for the NGO and its staff. One of the recommendations by Gibbs et al35 after their study on 61 NGOs in the USA in terms of ECB was to “take advantage of every available opportunity to use existing evaluation data as a resource for program improvement”. We have tried to implement this recommendation, and more. This strategy was based on three particular components which, in a concomitant way, allowed us to reach this goal, as shown in Figure 1.

Figure 1.
The implementation
and the quality of the baseline survey planned in 2006 will be a good test for
these capacity building activities. For the moment, this paper has highlighted
some the ECB outcomes, mostly at the individual level for the
Now, it remains for
the NGO to pass from a process of non-intentional ECB program level (
· Designating organizational (independent) evaluation leader at the headquarters and in the field
· Locating those leaders in the organization hierarchy
· Formulating and adopting an evaluation policy (stated for example the preferred evaluation model, the choice for internal or external evaluation, the way for results dissemination and capacity building, etc)
· Producing internal material
· Developing an evaluation consultants network
· Coordinating evaluation activities around projects countries
· Training expatriate and national staff
· Sustaining leadership
References
1. ALNAP. ALNAP Review of Humanitarian Action in 2003. Field Level Learning. London: Overseas Development Institute, 2004.
2. Morgan CJ, Deutschmann PW. An evolving model for training and education in resource-poor settings: teaching health workers to fish. Med J Aust 2003;178(1):21-5.
3. Milstein B, Chapel TJ, Wetterhall SF, Cotton DA. Building capacity for program evaluation at the Centers for Disease Control and Prevention. In: Stockdill SH, Baizerman M, Compton D, eds. The Art, Craft, and Science of Evaluation Capacity Building. New Directions for Evaluation, n°93, spring 2002: Wiley Periodicals, Inc., 2002: 27-46.
4. Baizerman M, Compton D, Stockdill SH. Capacity Building. In: Mathison S, ed. Encyclopedia of Evaluation. Thousand Oaks: Sage Publication, 2004: 38-39.
5. Sanders RJ. Mainstreaming Evaluation. New Directions for Evaluation 2003;99(Fall 2003):3-6.
6. Lusthaus C, Adrien MH, Perstinger M. Capacity development: Definitions, issues, and implications for planning, monitoring, and evaluation. Montreal: Universalia Occasional Paper n°35, 1999: 21.
7. Stockdill SH, Baizerman M, Compton D, eds. The art, craft and science of evaluation building: Wiley Periodicals, Inc., 2002.
8. Milstein B, Cotton D. Defining Concepts for the Presidential Strand on Building Evaluation Capacity. American Evaluation Association. Available at www.eval.org/eval2000/public/presstrand.pdf. 2000.
9. Schaumburg-Müller H. Evaluation Capacity Building. Donor Support and Experiences. Report for the DAC Expert Group on Aid Evaluation, OECD. Copenhagen: Danida, 1996: 30.
10. Cousins JB, Goh SC, Clark S, Lee LE. Integrating evaluative inquiry into the organizational culture : a review and synthesis of the knowledge base. Canadian Journal of Program Evaluation 2004;19(2):99-141.
11. Marsden P, Samman E. Afghanistan : the economic and social impact of conflict. In: Fitzgerald V, ed. War and Underdevelopment. Queen Elisabeth House: University Oxford Press, 2000.
12. UNICEF. Maternal Mortality in Afghanistan : Magnitude, Causes, Risk Factors and Preventability, Summary Findings. Kabul: UNICEF, CDC, MoPH, 2002: 7.
13. van Egmond K, Naeem AJ, Verstraelen H, Bosmans M, Claeys P, Temmerman M. Reproductive health in Afghanistan: results of a knowledge, attitudes and practices survey among Afghan women in Kabul. Disasters 2004;28(3):269-82.
14. World Bank. Joint donor mission to Afghanistan on the health, nutrition, and population sector. Aide-Memoire, 2002: 19.
15. Waldman R, Hanif H. The public health system in Afghanistan: Current issues. Kabul: Afghanistan Research and Evaluation Unit;, 2002.
16. Ridde, V. (2005). "Performance-based Partnership Agreements for the reconstruction of the health system in Afghanistan." Development in Practice 15(1): 4-15.
17. Ministry of Health. A Basic Package of Health Services For Afghanistan. Kabul: Transitional Islamic Government of Afghanistan-Ministry of Health, 2003: 51.
18. Ridde, V. (2002). L'aide humanitaire et la santé de la population afghane sous le régime des Tâlebân. L'action humanitaire du Canada. Histoire, concepts, politiques et pratiques de terrain. Y. Conoir and G. Vera. Québec, Presses de l'Université Laval: 545-566.
19. AMI. Aide Médicale Internationale : 25 ans d'ingérence médicale (provisional title). In press. Paris, 2005.
20. Duignan P. Mainstreaming Evaluation or Building Evaluation Capability ? Three Key Elements. New Directions for Evaluation 2003;99(Fall 2003):7-21.
21. Bozzo SL. Evaluation capacity building in the volontary/nonprofit sector. Canadian Journal of Program Evaluation 2002;17(3):75-92.
22. Ridde, V. (2003). "L'expérience d'une démarche pluraliste dans un pays en guerre : l'Afghanistan." Canadian Journal of Program Evaluation 18(1): 25-48.
23. Cracknell BE. Evaluating Development Aid. Issues, Problems and Solutions. New Delhi.Thousand Oaks.London: Sage Publications, 1996.
24. Cousins JB, Whitmore E. Framing Participatory Evaluation. In: Whitemore E, ed. Understanding and Practicing Participatory Evaluation: Jossey-Bass Publishers, 1998: 5-23.
25. Patton MQ. Utilization-Focused Evaluation. 3rd ed. Thousand Oaks-London-New Delhi: Sage Publications, 1997.
26. Aubel J. Participatory Program Evaluation Manual. Involving Program Stakeholders in the Evaluation Process. Calverton, Maryland: Child Survival Technical Support Project and Catholic Relief Services, 1999: 86.
27. Ridde, V. (2004). "L'évaluation de programme en santé internationale : qu'est-ce que c'est, comment la planifier et utiliser une approche participative ?" Développement et Santé 169: 23-29.
28. Toulemonde J, Bjornkilde T. Building Evaluation Capacity : Experience and Lessons in Member States and Acceding Countries. Budapest: Fifth European Conference on Evaluation of the Structural Funds. 26-27 june 2003, 2003: 13.
29. Patton MQ. Creative evaluation. 2nd ed. Newbury Park, Beverly Hills, London, New Delhi: Sage Publications, 1987.
30. Ridde, V. (2004). Seeds against malnutrition in Afghanistan: an experience in participative performance evaluation training. Encyclopedia of Evaluation. S. Mathison. Thousand Oaks, Sage Publication: 433-434.
31. Thurston WE, Potvin L. Evaluability Assessment: A Tool for Incorporating Evaluation in Social Change Programmes. Evaluation 2003;9(4):453–469.
32. Sartorius RH. The logical framework approach to project design and management. Evaluation practice 1991;12(2):139-147.
33. Ridde , V. (2004). "L’utilité d’un processus participatif lors des évaluations de programmes humanitaires." Revue Humanitaire 11: 59-65.
34. Ridde, V., P. Bonhoure, et al. (2004). "User fees and hospital health care utilization in Afghanistan : lessons learned from a participative evaluation." Journal of Afghanistan Studies 2: 93-109.
35. Gibbs D, Napp D, Jolly D, Westover B, Uhl G. Increasing evaluation capacity within community-based HIV prevention programs. Evaluation and Program Planning 2002;25(3):261-269.
36. Boyle R. Building effective evaluation capacity : some lessons from international practice. Presentation to IDEA Seminar, Rome, october 24th, 2003.
About the Authors
Valéry Ridde is a
postdoctoral researcher at Université de Montréal (GRIS/USI). He works and
studies in global health, community health and evaluation. His PhD in community
health thesis (
Shakir Sahibullah
is a medical doctor living in
A special thanks to
all AMI staff in
Abstract
This paper examines the relationship between ethnographic research methods and evaluation theory and methodology. It is divided into two main sections: (a) ethnography in evaluation and (b) anthropological models of evaluation. Three levels of the leading anthropological models of evaluation are summarized, which include responsive evaluation, goal-free evaluation, and constructivist evaluation. In conclusion, (a) there is no consensual definition of ethnography; (b) in many circumstances, ethnographic evaluation models may be beneficial; and (c) ethnography can be used in evaluation but requires a high level of analysis to transform ethnographic data into useful information for eliciting an evaluative conclusion.
*The author would like to thank
Daniela C. Schröter, Chris L. S. Coryn, and Elizabeth K. Caldwell for editing
this paper and for their extremely useful comments and suggestions.
Ethnography[25], an applied qualitative social science research method, is often employed in program evaluation. Ethnography, alone and according to its pure anthropological definition, is not a research method capable of being the sole method implemented in an evaluation. Ethnography may prove advantageous to evaluators as an additional method to be employed or considered. However, sound evaluation typically requires multiple data collection methods and a higher level of analysis than ethnography alone can provide. Evaluation synthesizes the narrative and develops an evaluative conclusion. There are various instances when the implementation of an evaluation model that relies heavily on qualitative methods based in the tradition of anthropological research is beneficial. As an evaluator, at minimum, familiarity with these models should be in one's repertoire.
The paper is divided into two main sections: (1) Ethnography and Evaluation and (2) Anthropological Models of Evaluation. The first section presents a summary definition of ethnography, its theories, concepts, and benefits; and the difference between ethnography and anthropology. The author then provides a brief definition of evaluation and discusses the relationship between ethnography and evaluation. There are three anthropological models of evaluation in which the author summarizes, discusses the strengths and limitations, and reflects on their relationship with ethnography. The paper concludes with a synopsis of the author’s main impressions and key points.
Ethnography[26] is an applied research method most often associated with anthropology, where it was developed to study cultural interpretation. Ethnography, also called field research, is the process of describing a culture or way of life from a folk peoples’ perspective. Anthropologist Clifford Geertz described the ethnographic method as “thick description.” It provides detailed notes and descriptions of everything that occurs without attempting to summarize, generalize, or hypothesize. In fact, with traditional ethnography, as a rule of thumb, for every half hour of observation a researcher writes for two hours. The researcher focuses on factual description to allow for multiple interpretations to later infer cultural meaning. To obtain this description of a population’s perception, the principle of ‘naturalism’[27] is assumed. Thus, trust and rapport are essential between the researcher and the population being studied.
Ethnographers, if following the constructivist[28] philosophy, believe that pure objectivity is impossible as: (A) ethnography is an interpretive endeavor by fallible human beings; (B) not all field sites are “foreign” for ethnographers in the same way; (C) ethnography is not replicable; and (D) ethnography is not based on a large number of cases. The epistemology of ethnography is typically a model based on a phenomenologically oriented paradigm, which focuses on multiple perspectives and multiple realities of a phenomenon. Phenomenological inquiry seeks to answer the question: “What is the structure and essence of experience of this phenomenon for these people?” (Patton, 1990, p. 69). Constructivists take a heuristic[29] approach to answering this phenomenological question. According to Fetterman (1989), most ethnographers subscribe to ideational theory, which suggests that change is the result of mental activity—thoughts or ideas—versus materialists who believe that “material conditions—ecological resources, money modes of production-are the prime movers” (Fetterman, 1989, p. 16). The most popular ideational theory is cognitive theory, which assumes we can infer peoples’ thoughts from hearing what they tell us.
While many theories, concepts, and methods (e.g., in-depth, open-ended interviews, direct observation, written documents, triangulation) resulting in narrative description commonly recur in the literature, consensus on any one set of fundamental principles of ethnography cannot be found (Genzuk, 2001; Patton, 1990; Payne, 1994). For example, ethnographic theories, concepts, and data collection techniques are also used in non-ethnographic qualitative research and distinctions between ethnography and other qualitative theories, concepts, principles, and methods is not clearly evident. Instead, there are copious combinations of varying concepts considered fundamental to ethnography from researchers and anthropologists alike.
The key in understanding the differences between ethnography and other qualitative social science research methods is to understand the multiple combinations of techniques, concepts, and data collection methodologies encompassed under the term “ethnography.” As with all research methodologies, each philosophical and theoretical decision is located on a spectrum or continuum. Thus, the definition of ethnography and what it entails is idiosyncratic to the ethnographer or researcher depending on her degree of commitment to a hodgepodge of “fundamental” concepts. Past and current literature presents definitions and concepts of ethnography differing by technique, values emphasized, time allotted, data analysis procedures, and commitment to the purist practice of anthropological ethnography. Therefore, ethnographic techniques are qualitative in nature but distinct. Below are a few of the reoccurring concepts specific to ethnography (Fetterman in Bickman and Rog, 1998; Genzuk , 2001; Hall, n.d.):
ü The focus is on culture and cultural interpretation.
ü There is an emphasis on an emic[30] perspective.
ü The holistic perspective is often of greater depth than other qualitative research methods.
ü Sampling measures are conducted over a longer period of time.
ü The researcher herself is the primary tool for data collection.
In contrast to ethnographic methodology, evaluation methodology commonly comprises the use of both qualitative and quantitative techniques. Commonly defined as the systematic determination of something’s merit, worth, or significance (Davidson, 2005; Sanders, 1994; Scriven, 1991). Scriven (1991) claims that evaluation is not only a methodology, but a distinct multi- and transdisciplinary field of study not to be regarded as merely a as a branch of applied social science. As an independent discipline, evaluation may utilize applied social science research methodology, but it is distinct by its unique purpose and methodology (e.g., ranking, grading, and scoring). The determination of merit, worth, and significance of an evaluand requires evaluators to consider the relevant values and to make judgments based on those values. Autonomous subspheres of evaluation are program, policy, product, personnel, performance, and proposal evaluations as well as metaevaluation (i.e., the evaluation of evaluation) and intradisciplinary evaluation. Often but not always, evaluations are based in social science research methods including both qualitative and quantitative data collection procedures. Subcomponents of program evaluations, for example, may include the assessment of context, resources, processes, immediate outcomes (outputs), intermediate and long-term outcomes and impacts considering costs, comparisons to best and worst practices of other programs (Davidson, 2005; Scriven, 1991). Moreover, evaluation may be formative, summative, or ascriptive (Scriven, 2005).
Multiple factors may guide evaluators and researchers alike toward choosing quantitative or qualitative evaluation methodology. In the following, qualitative ethnographic evaluation models will be introduced.
Ethnographic Evaluation Models
Ethnographic evaluation methodologies have been discussed for over thirty-five years and came about as a response to the more traditional evaluation approaches which were overly committed to the scientific paradigm of inquiry. According to Guba and Lincoln (1989), an extreme dependence on the methods of science demonstrated some negative results. For example, reliant on primarily quantitative measurement, evaluands were stripped of their context as if they were not entwined in a highly specific one, resulting in irrelevant or non-useful findings (cf. Seafield Research & Development Services). Moreover, scientific truth is non-negotiable, thus all alternative explanations must be in error.
Ethnographic evaluation methods, in
contrast, utilize stakeholders’ claims and concerns. For example, Guba and
Lincoln (1989) insist upon ethnographic methods for determining what
information is necessary in an evaluation and provide five reasons:
1. Stakeholders are placed at risk by an evaluation.
2. Evaluation exposes stakeholders to exploitation, disempowerment, and disenfranchisement.
3. Stakeholders represent an “untapped market” for the use of evaluations that are responsive to self-defined needs and interests.
4. Stakeholder input expands the scope and meaningfulness of the evaluation, in addition to contributing to the dialectic process that is necessary in conducting a sound evaluation.
5. All individuals and parties can be mutually educated toward more sophisticated personal constructions and they may gain enhanced appreciation of the constructs of other individuals or parties.
Other strengths of implementing ethnographic methods in evaluation are exemplified in connecting quantitative data to observed actual outcomes; the flexibility of design; the ‘thick description’ of program impactees; the clarification of processes; the study of participation; and the identification of unintended positive and negative side effects. The weaknesses in utilizing ethnographic methods in evaluation consist of such problems as introducing complex threats to validity; increasing the time and cost demands compared to other methods; raising the potential for impactee reactivity to the evaluator; and limiting the ability to compare the data from different measurement instruments.
There are many considerations that will need resolution before deciding if an ethnographic method is an appropriate method for an evaluation. Considerations include the purpose of the evaluation; whether the evaluation is formative or summative; the amount of time allocated for the evaluation; the financial and other resources available; and the level of expertise and competence of the evaluation team. Prior to adopting a specific methodology or model, all the typical issues regarding methodology, conceptual context, validity, ethics, etc. must be discussed.
Relationship Between Ethnography and Evaluation
In evaluation, ethnography should be viewed on a spectrum. One end of the spectrum consists of the pure anthropologically-defined ethnography and on the opposite end are various ethnographic techniques of data collection and methodologies loosely defined, combined, and flexibly implemented. Many researchers and evaluators implement one or a few qualitative data collection methods and then claim their research to be ethnographic. However, most agree that ethnography is defined by the rigor of the data collection procedures. Fetterman (1982) identified a study that called itself ethnographic although the researchers were on site for only five days. Deneberg (1969) and Fetterman (1982) claim that these researchers are fickle to scholastic fads and refer to them as “Zeitgeister-Shysters.” Zeitgeister-Shysters become involved in research that is a hot topic or trendy and the result is superficial research. Such researchers contribute minimally to the field and often tarnish the reputation and credibility of ethnography. In describing the Zeitgeister-Shysters, Fetterman stated, “rather than conducting ethnographies, they are simply using ethnographic techniques” (Fetterman, 1982, p.2). Wolcott (1980) concluded that “much of what goes on today as educational ethnography is either out and out program evaluation, or, at best, lopsided and undisciplined documentation” (p.39). Fetterman warns that the adoption of random elements of ethnography without emphasis on the whole, results in “the loss of the built-in safeguards of reliability and validity in data collection and analysis” (Fetterman, 1982, p.2). Researchers often use anthropological tools (ethnography) without understanding the values and cosmology underlying the ethnographic techniques. Wolcott (1980) reminds the reader that the purpose of ethnography is cultural interpretation and this requires the researcher to examine the whole trait complex rather than a few single traits. Still many evaluators study single traits and call their evaluation ‘ethnographic’.
The importance of ethnographic data sources in the evaluation of social programs and policies is rarely argued (Agar, 2000; Fetterman, 1982; Fetterman, 1984; Guba & Lincoln, 1989; Hopson, 2002; Patton, 1997; Posavac & Carey, 1997; Scriven, 1991; Swartzman, 1983; Shadish, Cook, and Leviton, 1995; Stake, 1975; Wholey, Hatry, & Newcomer, 2004; Wolcott, 1982; and Worthen, Sanders, & Fitzpatrick, 1997). Hopson (2002), for example, cites a report by Nastasi and Berg (1999) who urge evaluators to “capture views of program participants about their experience of a program, its acceptability, and whether or not they were influenced to modify behavior or thinking” (p. 45). This has always been a consideration for evaluators, as it pertains to, or affects the program's quality, significance, or merit. Experienced evaluators typically employ several qualitative data collection methods in an evaluation in hopes of understanding some of these cultural issues, albeit less in depth than with pure ethnography.
Focusing on context is crucial in all evaluations and the utilization of qualitative methods is fundamental to any good program or educational evaluation; however, the title 'ethnographic evaluator' may be problematic or misleading. Many readers may assume that the term “ethnographic evaluator” implies the use of ethnography in conducting an evaluation. This is false and arguably not possible. Ethnography is a social science research method that emphasizes cultural interpretation. The product of ethnography is a non-judgmental description of context and then a cultural interpretation of the program.
Evaluation is the systematic process of determining the merit, worth, significance or importance of the evaluand. To evaluate something, the relevant values are determined and used to place judgments regarding the overall quality of the program. Ethnography and other qualitative research methods are instrumental in collecting data for determining the most important values to use as criterion for success. Ethnography may uncover unanticipated costs, processes, and outcomes; however, other qualitative methods may reveal similar side-effects but take much less time. There is a point of saturation when a researcher gets the sense that it is unlikely that further study will uncover significant new information that will be important to include in the evaluation. Extended time in the field may not be necessary or feasible for many evaluations.
To summarize, ethnography is a research method and evaluation uses multiple research methods to collect information for determining the merit or worth of a program. As Fetterman (1984) points out, the distinction between ethnography and evaluation is regarding the level of analysis and objective. Evaluators take the ethnographic data to a higher level of analysis by extracting data which is relevant to some standard; comparing it with data from other methods and sources; and judging the program accordingly. Therefore, I conclude that “ethnographic evaluation” is a misnomer or false label for what some evaluators do. Moreover, evaluations claim to use ethnographic methods while in reality, they simply employ varying degrees of qualitative methods. Anthropologically, pure ethnography may serve useful when analyzed further by an evaluator to examine actual processes and outcomes. Anthropological evaluation techniques may be best when conducted independently of the more quantitative research methods, similar to Scriven's (1991) goal-free evaluation. Therefore, in an evaluation which uses multiple research methods, ethnography serves as a way of triangulating these methods. Furthermore, ethnographic data is useful in triangulating data sources adhering to the principle of critical multiplism (c.f. see Shadish, 1994). An examination of three evaluation models which are based in anthropology will further illustrate the relationship between ethnography and evaluation.
Anthropological Models of Evaluation
Payne (1994) categorizes 4 evaluation models, the fourth of which contains anthropological approaches (see Figure 1).


Figure 1. Models of Evaluation (adapted from
Payne, 1994)
The anthropological models of
evaluation—responsive evaluation, goal-free evaluation, and constructivist
evaluation—have many similarities. They tend to be qualitative, exploratory,
highly descriptive, and take an inductive approach to understanding the program
under evaluation. Each model was created in the post-positivist value-pluralist
perspective, focusing on the question: whose
values and methods should shape or have shaped the evaluation?
The anthropological models protect against any of the evaluator’s personal opinions from being used to determine the values and methods emphasized in the evaluation. However, Scriven separates goal-free evaluation from the other two anthropological models by contending that the stated goals of the client should also not be known or utilized by the evaluator. The three models re-examine the ontology[31] of evaluative interpretations. In both responsive evaluation and constructivist evaluation, the selection of relevant values and the determination of the merit of outcome measures are decided by the program impactees and stakeholders. Evaluators are partners with the stakeholders in the creation of data and they orchestrate the consensus building process. By contrast, in goal-free evaluation, program success is decided by examining change relative to the identified needs through a comprehensive needs assessment. Lastly, all three models rely on an evaluator with significant commitment to and experience with ethnographic and qualitative methods.
The remainder of the paper will discuss each anthropological evaluation model and illustrate its relationship to ethnography and the qualitative research paradigm of evaluation.
Stake (1975) called his approach to evaluation responsive evaluation to stress flexibility and responsiveness to the concerns and issues of the program stakeholders. Responsive evaluation is less reliant on formal communication such as the statement of goals, objective tests, standards of program personnel, and research-type reports. Rather, it focuses on gathering the observations and reactions of the program stakeholders, which as Stake claims, is the way people naturally evaluate things. Stake believes this and other qualitative methods are not frequently employed in evaluation due to “subjectivity.” Responsive evaluation is poorly suited for evaluating formal contracts, and there lies potential to uncover negative side effects or raise embarrassing questions.
Stake suggests examining a program by
organizing the evaluation into four components: environment, workspace, output,
and support (see Figure 2).
|
Environment
Workspace
Output
Support
|
Figure 2. Four Components of Evaluation: Environment,
Workspace, Output, and Support (adapted from Stake, 1975)
Exemplifying educational evaluation, Stake states,
[A]n educational evaluation is a responsive
evaluation if it orients more directly to audience requirements for
information; and if the different value-perspectives present are referred to in
reporting the success and failure of the program.
(Stake, 1975, p. 14)
It is not critical to be explicit about purpose, scope, or causation in determining worth, according to Stake. In conducting responsive evaluation, the evaluator observes the program to gather narrative and descriptive information from program stakeholders; and negotiates values in which to judge the program. An evaluator should not presume that only the measurable outcomes provide evidence of the program's worth. Outcome evaluations tend to negate the idiosyncratic and unique ways people benefit from involvement with the program and among each other; furthermore, they are not sensitive to changes in program purpose. Stake cites Scriven (1967) and suggests that it may be preferable to evaluate the “intrinsic merit of the experience rather than the more elusive payoff” (p. 27). Stake feels that less emphasis on preconceived notions of success will allow for increased stakeholder flexibility in determining the purposes of the evaluation and criteria by which to measure success. In a responsive evaluation, the evaluator has the ability to respond to emerging issues, rather than sticking to a strict evaluation plan or structure. This ultimately leads to an increase in the evaluation's utility to the program stakeholders. Recurring events in responsive evaluation (Stake 1975):
1. Talking with clients, program staff, and audiences.
2. Identifying program scope.
3. Providing an overview of program activities.
4. Discovering purposes, concerns.
5. Conceptualizing issues, problems.
6. Identifying data needs regarding issues.
7. Selecting observers, judges, and instruments if any.
8. Observing designated antecedents, transactions, and outcomes.
9. Providing a theme; preparing portrayals, case studies.
10. Winnowing, match issues to audiences.
11. Formatting for audience use.
12. Assembling formal reports, if any.
Data is collected through direct personal experience or the second best option, vicarious experience. Observations are not only conducted by the evaluator, but the evaluator enlists program stakeholders according to the issues being studied and the audience being served. Having multiple observations and observers increases data reliability; observations continue to be subjective but through replication random error is reduced. The bias of direct or vicarious experience decreases as repeated observation and diverse points of view are attained. The evaluator produces portrayals typically featuring descriptions of persons, such as a five-minute script, a log, scrapbook, multi-media or audience role-plays. The small number of case studies is often criticized for sampling error, but Stake attests that the error may be minimal and that it is a small price to pay for potentially substantial improvements in communication. Moreover, Stake assumes that case studies of several students are more interesting and representative of a program than a few measurements on all program participants. Therefore, the reader benefits by a more comprehensive understanding of the program.
The evaluation encounters two pluralisms of values: (1) in context, or in Stake’s terms the “antecedent condition in which the program is found” (p. 23) and (2) the personal outcomes or outcomes of the program. The evaluation team should not impose its values on the “actors,” “spectators,” and/or “critics” of the program during the consensus building process. Stake identifies two measures of the value of evaluation: its increment of added experience and its enhancement of responsive alternatives.
Strengths of responsive evaluation include it being flexible, adaptable, and good in providing cultural explanation and recognition of diversity. It may be particularly useful in evaluating programs where the stakeholders generally agree on the intrinsic value rather than the instrumental value of the program. For example, many people will discuss the importance of music and art “because they're good things to do” (ibid, p. 16). Furthermore, in formative evaluations, responsive evaluation is useful in monitoring the program and to identify positive and/or negative side effects. It is helpful in summative evaluations by giving the reader an understanding of the program's activities, its strengths and weaknesses, and by providing a vicarious experience in the evaluation.
Limitations of responsive evaluation include the difficulty in making comparisons to standards; it serves the immediate audience and may not fulfill distant or future needs. In today's world, funding constraints on arts education programs, for example, has led to an increased demand for quantifiable outcomes and results which are not emphasized in responsive evaluation. Moreover, responsive evaluations may be less objective, reliable, and generalizable as compared to traditional evaluations, or as Stake calls them preordinate evaluations. Responsive evaluation is not useful when it is important to measure goal attainment, whether promises were kept, or in cases where predetermined hypotheses are to be examined.
Ethnography, in the more traditional sense, has compatibility problems with responsive evaluation but there may be potential for combining them. A primary distinction is that with responsive evaluation, the evaluator solicits the observation of stakeholders, thus making the stakeholders part of the evaluation team and adding them as additional data collection instruments. Nevertheless, with some concessions on both sides, the two may be combined.
The evaluator, in a goal-free evaluation (Scriven, 1991), intentionally enters the field without being aware of the specific stated goals and objectives of the program. The evaluator learns about the program and its outcomes inductively. This means that all program materials are screened either by a non goal-free evaluator on the evaluation team, an administrative assistant, or by the client to ensure that none of the stated goals or objectives are described to the goal-free evaluator. The purpose of this is:
…finding out what the program is actually doing without
being cued as to what it is trying to do. If the program is achieving
its stated goals and objectives, then these achievements should show up; if
not, it is argued, they are irrelevant.
(Scriven, 1991, p. 180)
Goal-free evaluations can assist in determining whether the objectives are worthwhile; instead of “are the objectives being achieved?” It is similar to the double-blind pharmaceutical study; just like the drug evaluator, the goal-free evaluator does not have to know the direction of the intended effect or the intended extent of the outcomes (Scriven, 1973). The evaluator intends to find the program’s actual outcomes and then works backward to determine if the effects were caused by the program. The goal-free evaluator is like the crime scene investigator who tries to eliminate rival explanations which may have led to the outcome under investigation. Information regarding the stated goals of the program is withheld from the evaluator. However the evaluator is able to review some program documents, budgets, schedules, recorded observations, profiles of participants and staff, etc. as long as there is no implication of any stated goal.
A comprehensive, fair, and accurate needs assessment is essential in conducting a goal-free evaluation. Merit is determined by comparing the actual program outcomes to the relevant needs of those impacted, instead of to the program goals or consumer wants or desires. The program is evaluated according to the level of fulfillment of the consumers needs. Scriven believes by keeping the goals vague, a less pure goal-free evaluation still makes finding outcomes difficult and encourages the evaluator to connect program effects to recipients’ needs instead of the stated goals of the program. Altschuld and Witkin (2000) state that the needs at the primary level (i.e., recipients of the program) are the most critical concern, and from there the needs assessment can considers the needs of the service deliverers and the program delivery system. They argue that the primary needs are the “raison d’être” or the “rationale for the existence” of the service deliverers and delivery systems (Altschuld & Witkin, 2000, p. 10).
There are also relative degrees to which an evaluation may be goal-free. Goal-free evaluations may be combined, in full or in part with other evaluation methods (e.g., “qualitative versus quantitative, survey versus experiment, multiple perspectives versus one right answer, etc.”, Scriven, 1991, p. 182). Additionally, an evaluation may begin goal-free and then become goal-based; the reverse is not possible. It is also suggested that goal-free evaluation can be used as a supplement to a traditional outcomes evaluation conducted by a separate evaluator. The evaluator implementing the goal-free evaluation collects exploratory data to supplement and provide context to another evaluator's goal-oriented data. Goal-free evaluators observe the program in an attempt to understand the culture, meanwhile considering needs, processes, and outcomes. Below, the author provides a simplified illustration of a goal-free evaluation using a physical education and training program.
The evaluator of a physical education and training program enters into the evaluation without any prior knowledge of the program's goals. She would likely be capable of directly observing changes in health-related knowledge, strength, and endurance, which are the program's stated goals. However, the goal-free evaluator might also discover changes in endurance, flexibility, physique, changes in behavior, social status, networking with other students, finding new supportive workout partners, sharing of dietary and nutrition tips, increased self-esteem, etc. all of which were not original goals of the program and would be considered positive, unintended side-effects. They would likely have been missed if the evaluation solely examined the stated or preordained goals.
Arguments for the utilization of goal-free evaluation include (Scriven, 1991):
· It may identify unintended positive and negative side-effects and other context specific information.
· As a supplement to a traditional evaluation, it serves as a form of triangulating both data collection methods and data sources.
· It circumvents the traditional outcome evaluation and the difficulty of identifying true current goals and true original goals, and then defining and weighing them.
· It is less intrusive to the program and potentially less costly to the client.
· It is adaptable to changes in needs or goals.
· By reducing interaction with program staff, it is less susceptible to social, perceptual, and cognitive biases.
· It is reversible; an evaluation may begin goal-free and later become goal-based using the goal-free data for preliminary investigative purposes.
· It is less subject to bias introduced by intentionally or unintentionally trying to satisfy the client because it is not explicit in what the client is attempting to do; it offers fewer opportunities for evaluator bias or corruption because the evaluator is unable to clearly determine ways of cheating.
· For the evaluator, it requires increased effort, identifies incompetence, and enhances the balance of power among the evaluator, the evaluee and client.
Scriven and other users of goal-free evaluations have provided minimal direction regarding operational methodology in conducting the model. The only known attempt to develop an operational methodology for goal-free evaluation was by Evers (1980) in a doctoral dissertation. Evers outlined a goal-free evaluation methodology consisting of six components each of with comprising several sub-categories. The six main components were: (1) Conceptualization of Evaluation; (2) Socio-Political Factors; (3) Contractual/Legal Arrangements; (4) The Technical Design; (5) Management Plan; and (6) Moral/Ethical/Utility Questions. The success of a goal-free evaluation is dependent upon the quality of the needs assessment. If there is not an accurate comprehension of the program participants' needs then the entire evaluation may be at jeopardy.
A goal-free evaluation could feasibly be ethnographic. However, goal-free evaluation focuses on using observation to connect needs to actual program activities, rather than for thick description. Furthermore, traditional ethnography focuses on culture which is always goal-free in nature.
Guba and
British scholars call it
’human inquiry’ (inquiry conducted in human ways for humane ends); Americans
scholars call it ‘action research’ (research which aims to produce action on or
through it[s] findings, and third world or developmental evaluators call it
'developmental evaluation’ (evaluation which develops the understanding, and
resources to respond, of those evaluated). A common generic term for it is
‘collaborative inquiry’ (which simply describes what goes on when you use the
method).
In constructivist evaluation, evaluation is:
a. A process that combines data collection and data valuing (interpretation) into one inseparable.
b. A local process.
c. A sociopolitical process.
d. A teaching and learning process.
e. A continuous, recursive and divergent process.
f. An emergent process.
g. A process for sharing accountability.
h. A hermeneutic dialectic relationship.
In collaborative inquiry the people being evaluated participate as informed collaborators rather than research subjects. The purpose of a constructive evaluation is to attain a deeper comprehension of all the issues encountered by all the stakeholders and consumers; while the goals comprise mutual education, improved awareness, and increased motivation to utilize the evaluation results. Most constructivist evaluators are relativist and implement qualitative methods; however, the evaluation is conducted in a disciplined manner and it produces an audit trail to ensure transparency and credibility of its findings. The realities discovered by the constructivist inquiry are the constructions of the reality proposed by the evaluees themselves. They develop into co-constructions and subsequently reconstructions, as both evaluators and evaluees mold them. The constructivist evaluation assumes that evaluators are unable to maintain distance from the evaluees. Therefore, it accepts a hermeneutic dialect. Guba and Lincoln continue by rejecting the positivist assumptions, which they claim are embedded in most evaluation methodology. They use “validity” as an example of a term that evaluators are socialized into accepting as the positivist definition. Furthermore, they feel that the relationship between the evaluator and the program managers is often characterized by disenfranchisement and disempowerment. The constructivist evaluation, in the same vein as Scriven's goal-free evaluation, aims to restore the balance of power.
The process of Constructivist Evaluation can be illustrated in nine steps:
1. Identify all relevant stakeholders.
2. Elicit from each stakeholder group their construction and concerns regarding the issue at hand.
3. Provide a context and methodology which allows for multiple constructions, claims, concerns, and issues that can be comprehended, critiqued, and factored in the evaluation as appropriate.
· Conduct this methodology within each stakeholder group.
·
Cross fertilize each group with the constructions,
claims, concerns, and issues identified by other stakeholder groups, or issues
drawn from the literature or other sites. All view points are taken into
account as long as they are open to critique and criticism.
4. Generate consensus.
5. Prepare
an agenda for negotiation on items where there is little or no consensus.
6. Collect
and provide the information requested in the agenda for negotiation.
7. Establish
and facilitate a forum of stakeholder representatives where negotiation can
occur.
8. Develop
a report, or several reports, that communicate any consensus on constructions and
resolutions. Additionally, the report should communicate the pertinent issues raised
by other stakeholder groups.
9. Recycle
the evaluation to continue working on unresolved constructions.
The main limitations of Guba and
Lincoln's model is that it minimally acknowledges the fundamental role of
evaluation in determining the merit, worth, significance, value, quality, or
importance of the program, which are core elements within the definition of
evaluation (see for examples Scriven, 1991; Davidson, 2005; and Sanders, 1994).
Guba and Lincoln claim to offer a formative evaluation model placing little
emphasis on making an evaluative conclusion and more on program improvement
through consensus-building; however, to suggest improvement, they must
determine deficits in the evaluand, thus they do actually evaluate. A second
weakness with the constructivist model is in assuming that stakeholders will
always offer the reliable, valid, and honest information. There may be many
factors contributing to a stakeholder's knowledge, ability, and candor that
must be weighed relative to the observed program impacts in providing a valid
evaluative conclusion.
Ethnography is an applied social science
research method, while evaluation incorporates various research methods, one of
which may be ethnography. The purpose of ethnography is thick description and
cultural interpretation; evaluation's aim is to systematically judge a
program's merit and develop an evaluative conclusion. The qualitative
evaluation approach has demonstrated benefits for evaluators, and three of
these approaches are epitomized in the anthropological models of evaluation.
Responsive evaluation, goal-free evaluation, and constructivist evaluation have
conceptual and methodological similarities. An evaluator should be able to
recognize when one of these ethnographic or anthropological models may be
feasible and appropriate in evaluating a program. The evaluator should then present
the model and its strengths and limitations to the program stakeholders to be
considered when selecting the most appropriate evaluation methodology. Sound
evaluation typically requires the employment of both quantitative and
qualitative research methods. Ethnography and the anthropological models of
evaluation may be best suited as a supplement to the quantitative components of
an evaluation and serve as a way of triangulating data collection methods and
data sources. A competent evaluator should be informed of these ethnographic
techniques and the anthropological models of evaluation.
Altschuld, J. W. and Witkin, B. R. (2000). From needs assessments to action:
Transforming needs into solution strategies.
Beebe, J. (n.d.). Rapid assessment process.
Beebe, J. (1995). Basic concepts and techniques of rapid appraisal. Human Organization, 54(1), 42-51.
Carney, T. (1991). Fourth generation evaluation. Canadian Journal of Communication, 16(2).
Davidson, E. J. (2005). Evaluation
methodology basics: The nuts and bolts of sound evaluation.
Denzin, N. K. & Lincoln, Y.
S. (Eds). (2000). Handbook of qualitative research (2nd ed.).
Deneberg, V. (1969). Prolizityies A. Zeitgeister. Psychology Today, 311(50).
Evers, J. W. (1980). A field study of goal-based and goal-free
evaluation techniques. Unpublished doctoral dissertation,
Fetterman, D. M. (1982). Ethnography in educational research: The dynamics of diffusion. Educational Researcher, 11(3), 17-29.
Fetterman, D. M. (1986a). The ethnographic evaluator.
Fetterman, D. M. (1986b).
Conceptual crossroads: Methods and ethics in ethnographic evaluation. In D. D.
Williams (Ed.), Naturalistic evaluation: New directions for program
evaluation.
Fetterman, D. M. (1998).
Ethnography. In L. Bickman & D. J. Rog (Eds.) Handbook of applied social
research methods (pp.
473-504).
Genzuk, M. (2004). A synthesis of ethnographic research. University of Southern California-Center for Multilingual, Multicultural Research. Retrieved February 3, 2005 from http://www-ref.usc.edu/~genzuk/Ethnographic_Research.pdf
Guba, E. G. & Lincoln, Y.
S. (1989). Fourth generation evaluation.
Hall, B. (n.d.). Methods: What is ethnography? Center for
Public Interest Anthropology at
Hopson, R. K. (2002). Making (more) room at the evaluation table for ethnography: Contributions to the responsive constructivist generation. Exploring evaluation role and identity (pp. 37-56). Information Age Publishing.
Jessor, R., Colby, A., &
Shwedler, R. A. (1996). Ethnography and human development: Context and
meaning in social inquiry.
Maxwell, J.
A. (1998). Designing a Qualitative Study. In L. Bickman & D. J. Rog (Eds.)
Handbook of Applied Social Research Methods (pp. 69-100).
McLean, L. D. (1975). Judging
the quality of a school as a place where the alis might thrive.” In R. Stake
(Ed.), Evaluating the arts in education: A responsive approach (pp.41-58).
Nastasi, B. K. & Berg, M. J (1999). Using ethnography to
strengthen and evaluate intervention programs. In J. J. Schensul, M. D. LeCompte, G. A. Hess, B. K. Nastasi, M. J. Berg, L. Williamson, J.
Brecher, & R. Glassner (Eds.). Ethnographers toolkit.
Patton, M. Q. (1990). Qualitative
evaluation and research methods
(2nd ed.).
Payne, D. A.
(1994). Designing educational project and program evaluations: A practical
overview based on research and experience.
Sanders, J. (1994). The
program evaluation standards (2nd Ed.).
Scriven, M. (2004). Zen and art of everyday evaluation. Journal of MultiDisciplinary Evaluation, 1. Retrieved July 20, 2005 from http://www.wmich.edu/evalctr/jmde/content/JMDE_Num_001_Part_I.htm
Scriven, M. (1991). Evaluation
thesaurus (4th ed.).
Scriven, M. (1973). Goal-free
evaluation. In E. R. House (Ed), School
evaluation: The politics and process.
Scriven, M. (1967). The
methodology of evaluation. AERA Monograph Series on Curriculum Evaluation, Vol. 1 (pp. 39-83).
Seafield Research &
Development Services. (n.d.). Fourth
generation evaluation. Retrieved January, 27, 2005 from http://www.srds.ndirect.co.uk/4th.htm
Shadish, W. R. (1994): The
guiding principals of evaluation.
Stake, R. (1975). Evaluating
the arts in education: A responsive approach.
United States General Accounting Office. (2003). Ethnographic studies can inform agencies’ actions. GAO-03-455.
Wholey, J. S., Hatry, H. P.,
& Newcomer, K. E. (Eds.) (2004). Handbook of practical program
evaluation (2nd ed.).
Wikipedia (n.d.). Hermeneutics. Retrieved March 22, 2005 from http://en.wikipedia.org/wiki/Hermeneutics.htm
Wolcott, H. P. (1980). How to look like an anthropologist without really being one. Practicing Anthropology, 3(2), 56-59.
Wolcott, H. F. (1982). Ethnographers sans ethnography: The
evaluation compromise.
Worthen, B. R., Sanders, J. R.,
& Fitzpatrick, J. L. (1997). Program evaluation: Alternative approaches and
practical guidelines (2nd
ed.).
About the Author
Brandon W. Youker
obtained his Bachelor of Arts in Social Work from
Albeit some might argue that this review is a little late in coming, it is worth revisiting Pawson and Tilley’s 1997 book, Realistic Evaluation (reprinted in 1998, 2000, 2001, and 2002) as the debate about causation and evidence-based research and evaluation continues to be a topic of debate and concern in the evaluation and research communities (see A Call to Action: The First International Congress of Qualitative Inquiry and The Claremont Debate, in this issue of JMDE). Realistic Evaluation is rooted in the tradition of scientific realism, which is said to be one of the “dominant axes in modern European thinking” (p. 55). In the most general of terms scientific realism concerns “the nature and operation of causal forces” (p. 55). The essential ingredients for assessing these causal forces are C-M-O configurations—where C represents context, M represents mechanisms, and O represents outcomes. Context refers to the “spatial and institutional locations of social situations, together, crucially, with the norms, values, and interrelationships found in them” (p. 216). Mechanisms are the “choices and capacities which lead to regular patterns of social behavior” and the causal mechanisms which generate these patterns of behavior are “deemed ‘social problems’ and which are the rationale for a program” (p. 216). Outcomes “provide the key evidence for the realist evaluator in any recommendation to mount, monitor, modify, or mothball a program” (p. 217). From the C-M-O configuration, the authors argue that the way in which causation in the “social world should be constructed” and that the “basic realist formula” is “mechanism + context = outcome” (p. xv).
Chapter 1, A History
of Evaluation in 28 ½ Pages, presents Pawson and Tilley’s version of the
history of evaluation. The authors begin this history with the experimental
evaluations of the 1960s of the “great social programs of the ‘great society’
[the
Chapter 2, Weaknesses in Experimental Evaluation, presents Pawson and Tilley’s expose on the inherent problems with the experimental tradition; namely, the experimentalists’ “epistemological assumptions about causation and their lack of fit with the nature of social programs” (p. 30). Essentially, the authors argue that more often than not that change cannot be captured in OXO terminology. All in all, it is asserted that “by its very logic, experimental evaluation either ignores these underlying process [causal mechanisms], or treats them incorrectly as inputs, outputs or confounding variables, or deals with them in a post hoc and thus arbitrary fashion” (p. 54).
In Chapter 3, In With the New: Scientific Realism, the authors present the principles and practice of scientific realism. As previously mentioned, the realist view (generative) of causation can be described thusly (as illustrated by the explosion of gunpowder):
Our basic concern is still, of
course, the outcome (the spark
causing the explosion). But what does the explanatory work is first of all the mechanism (the chemical composition of
the substance which allows the reaction), and secondly the context (the physical conditions which allow the mechanism to come
into operation). This proposition—causal outcomes follow mechanisms acting in
contexts—is the axiomatic base upon which all realist explanations build.
(Pawson & Tilley, 1997, p. 58)

Source: Pawson, R. and Tilley, N. (1997). Realistic evaluation.
Figure 1. Generative Causation
Chapter 4, How to Design a Realistic Evaluation, presents the realist evaluation cycle (see Figure 2) and three case studies which apply realist evaluation principles to varying degrees.

Source: Pawson, R. and Tilley, N. (1997). Realistic evaluation.
Figure 2. The Realist Evaluation Cycle
The first case study presented is an evaluation of property marking and described by the authors as testing theory, the second is an evaluation of a housing project and described as theory formation and development, and the third is an evaluation of a prison-delivered higher education program also described as theory formation and development. These case studies are described in some detail and are intended to represent exemplars of realistic evaluation.
I have opted to exclude a review of the remaining chapters (5-9) as these merely focus on collecting realist data and the methodological procedures involved in conducting evaluation as prescribed by Pawson and Tilley.
Despite the book’s title, the true underlying premise of Pawson and Tilley’s Realistic Evaluation is not merely a proposition of how to conduct evaluation, but rather a treatise on the nature of causation and science. While the author’s notion of causation (scientific realism) is compelling, I am not entirely convinced that it is the “final solution” to the causation debate. Neither is it a dramatic improvement over either successionist or other traditions. In their haste to prescribe generative explanations they fail to recognize or acknowledge that numerous experimentalists (and non-experimentalists) give considerable attention to context in their accounts of causation (e.g., moderators, mediators, interaction effects), often to a greater degree than the examples provided throughout the book suggest. Moreover, these causal accounts (i.e., realist accounts) seem little more than explanations of program effectiveness for different groups or consumers, which can be accomplished without the use of realist principles.
Prior reviews
(Patton, 1999; Rogers, 1999) of Realistic
Evaluation have been mixed. For example,
I rarely respond to attacks on or
distortions of my views, especially when they’re based on the twenty-year-old first
edition of the book (Patton, 1978) and don’t take into account subsequent
revisions and elaborations (Patton, 1986, 1997) that I hope have corrected at
least some earlier weaknesses, and have benefited from well-deserved and
well-meaning critiques. I have learned that responding to a distortion risks
reinforcing the very thing I want to correct by calling attention to it.
However, the distortions in the opening chapter of Pawson and Tilley, in which
they sarcastically and disparagingly review (and bemoan) the history of
evaluation, are anything but innocent or trivial. The irony is that, in the
introduction, the authors claim the mantle of “detachment,” “objectivity,” and
“scientific evaluation” (p. xiii). Their mocking review of evaluation’s history
has one primary purpose: positioning themselves as saviors of the profession by
redirecting us to be scientists first and foremost.
(Patton, 1999, p. 387)
While Realistic Evaluation has spurred serious
interest and debate, and even spawned an issue of New Directions for Evaluation (Henry, Julnes, & Mark, 1998),
the approach has not quite received the attention in North America that it has
in the
References
Henry, G. T., Julnes, G. & Mark, M. M. (Eds.) (1998). Realist evaluation: An emerging theory in support of practice. New Directions for Evaluation, 78.
Patton, M. Q. (1999). Realistic evaluation [review of the book Realistic evaluation]. American Journal of Evaluation, 20(2), 385-388.
Pawson, R. and
Tilley, N. (1997). Realistic evaluation.
Rogers, P. J. (1999). Realistic evaluation [review of the book Realistic evaluation]. American Journal of Evaluation, 20(2), 381-383.
[1] The sample reviewed in
this paper includes Australia/New Zealand,
[2] For more precise
definitions see: American Psychological Association, Board of Educational
Affairs, Developing and Evaluating
Standards and Guidelines Related to Education and Training in Psychology,
Context, Procedures, Criteria and Format, Approved by the APA Council on
[3] Principles and guidelines
can be made mandatory by including them in contractual agreements between
commissioners and evaluators.
[4] Using the taxonomy of
Marie-Louise Bemelmans-Videc, Ray C. Rist and Evert Vedung, Carrots, Sticks & Sermons: Policy
Instruments and their Evaluation, Transaction Publishers,
[6] According to the Wikepedia Encyclopedia,
to conduct oneself as a professional is to act in accordance with specific
rules, written or unwritten, pertaining to the standards of a profession.
Evaluation being a young profession, it has yet to develop internationally
agreed standards.
[7] David Marquand, Decline of the Public, Polity, 2004
[8] This is why the notion of
collective responsibility in evaluation is inappropriate.
[9] Carol H. Weiss, Evaluation, Second Edition, Prentice
Hall,
[10] Evert Vedung, Public Policy and Program Evaluation,
Transaction Publishers, 1999
[11] Michael Scriven, Evaluation Thesaurus: Fourth Edition. Sage
Publications.
[12] Michael Scriven, opus
cit.
[13] See Michael Scriven, Hard Won Lessons in Program Evaluation,
Sage, New Directions Publication No 58.
[14] John Flower with Gabi
Ebbers, Global Financial Reporting,
[15] From this perspective,
Alexander Hamilton’s wise words about constitutions are relevant: “(they)
should consist only of general provisions: the reason is that they must
necessarily be permanent and that they cannot calculate for the possible change
of things”.
[16] The Joint Committee was
created in 1974. It published standards for evaluation of educational programs,
projects and materials in 1981 and personnel evaluation standards in 1988. The
Joint Committee was accredited by the American National Standards Institute
(ANSI) to work on program evaluation standards in 1989. Student Evaluation
Standards were published in 2003.The Joint Committee Program Evaluation
Standards were approved by ANSI in 1994.
[17] Peter H. Rossi, Howard E.
Freeman, Mark W. Lipsey, Evaluation: A
Systematic Approach, Sixth Edition, Sage Publishers.
[18] The African Evaluation
Association guidelines (not reviewed in this paper) include modifications in
thirteen out of thirty US PES standards.
[19] The notion that one needs
to trade off accountability for learning is mistaken. It reflects a common
confusion between the distinctive roles of evaluation consultants (who are
loath in their capacity to criticize the performance of their employers and the
policies they pursue) and independent evaluators (who face no such
constraints).
[20] Joseph Wholey, Handbook
of Practical Program Evaluation (Chapter 2), Jossey-Bass Publishers,
[21] According to Donald T.
Campbell, “while all nations are engaged in trying out innovative reforms, none
of them are yet organized to adequately evaluate the outcome of these
innovations”. This observation led him to pioneer the concept of the
“experimenting society” in which “policy relevant knowledge is created,
critically assessed and communicated in real life or natural (not laboratory)
settings, with the aim of discovering through policy experimentation, new forms
of public action which signify a gain in the problem-solving capacities of
society”. See William N. Dunn, Ed. The
Experimenting Society, Essays in Honor of Donald T. Campbell, Policy
Studies Review Annual, Volume 11, Transaction Publishers, New Brunswick, 1998
[22] Additional space is often
devoted to commentaries about the guidelines.
[23] The first author knows
this NGO well since 1996. He has served as its Head of Mission in
[24] Note that if the 2000
Annual American Evaluation Association National Conference was on “
[25] AUTHOR'S NOTE: The author of this
paper uses the terms “ethnography,” “ethnographic techniques,” and often
“qualitative research methods” interchangeably. Additionally, the term
“program” is used generically, to refer to the evaluand*. Ethnography in
the context of this paper is primarily in regards to program and policy
evaluations. Ethnography may also be used in product, personnel, and
performance evaluations.—* “Evaluand: That which is being evaluated
(e.g., program, policy, project, product, service, organization)” (Davidson,
2005, p. 240).
[26] Alternative definitions: Ethnography
is “a descriptive study of an intact cultural or social group or an individual
or individuals within the group based primarily on participant observation
and open-ended interviews. Ethnography is based on learning from people as
opposed to studying people” (Beebe, n.d.). Ethnographic research
“involves the study of groups and people as they go about their everyday
lives” (Emerson, Fretz, & Shaw, 1995). “Ethnography is the art and science
of describing a group or culture” (Fetterman, 1989, p. 11).
[27] Naturalism: Leave natural
phenomenon alone.
[28] Constructivist philosophy maintains
that the researcher manufactures knowledge through her interaction in the field
and that there is no objective truth to be uncovered (ontological relativism)
(Maxwell, 1998 in Bickman & Rog, 1998).
[29] Heuristics is a form of
phenomenological inquiry focusing on the personal experiences and insights of
the researcher—it considers researcher’s experience in addition to other
observers that experience the phenomenon.
[30] Emic perspective is that
of the insider and includes the acceptance of multiple realities.
[31] Ontology: The nature of
the real.
[32] The new meaning of constructivist
methodology: Truth is determined by consensus building among informed
constructors, not of correspondence with an objective reality. Facts are
meaningless without a value framework; therefore, no proposition can be
objectively assessed. Causes and effects do not exist; accountability is relative
and implicates all interacting parties equally (Guba & Lincoln, 1989).