Preparatory ActivitiesIn any evaluation there are activities which must precede judging merit. This section describes the following preparatory activities.
Staffing the EvaluationEvaluators ought to be knowledgeable about all aspects of an assessment. Glancing ahead to the checklist suggests many of the areas in which evaluators must be competent to make judgements. Evaluators should understand issues of test construction and sampling. Equally important is the ability of evaluators to judge social utility. Sixteen courses in measurement and statistics will not guarantee such insights. Experts who possess all of the necessary knowledge are hard to find. A team approach may be used to ensure that various kinds of expertise are represented. I was a member of the 1975 site-visit team appointed by the National Center for Education Statistics to evaluate National Assessment. The membership of that team was strikingly diverse. Five of the nine members were not educationists. I was the only member of the team who could be identified with the traditional measurement-assessment community. Other members of the team contributed perspectives and knowledge that I did not have. David Wallace from the University of Chicago and David Farrell of Harvard were experts in data analysis and computer utilization. Martin Frankel from the National Opinion Research Center played this same role but also poured over the details of NAEP's national sample and research operations; Frederick Hays from the Fund for the City of New York knew more about the management of large-scale bureaucracies than most academicians could ever hope to know. Maureen Webster from Syracuse Educational Policy Research Center and Norman Johnson of Carnegie-Mellon University had remarkable perspectives on national politics and could comment on which NAEP activities were likely to affect public policy and how. Diversity has some drawbacks, of course. It's expensive to buy. In the NAEP example cited, some diversity might well have been traded for a deep study of a less varied team. Also, not having shared assumptions may make cooperation difficult. But this liability is also a strength since it protects against bias or narrow vision. Though evaluators must be technically competent, there is a danger in selecting assessment experts to evaluate assessment. Assessors are likely to be philosophically committed to assessment. Others in the evaluation-measurement community have decided that testing is fundamentally harmful. It would be difficult to find many who were technically trained who did not already have leanings or biases in one direction or the other. These judges may be so committed to their beliefs about the value of assessments that they can be counted on only to discriminate degrees of goodness and badness. Overall judgments of the social utility of assessment programs may more appropriately be left to judges who are less professionally and philosophically committed to a particular view on assessment. A method of evaluation has little validity if the conclusions it generates depend on the personal bias of the evaluators conducting it. Care must be taken from the outset to ensure both the validity and credibility of the evaluation. A first step is to identify biases of potential evaluators. Ideally, we should be able to identify those who favor assessment and those who do not in the same way that Republicans are distinguished from the Democrats or that the Republicanism of Goldwater is different from Rockefeller's. Some affiliations for or against assessment are public knowledge, some are subtle or unconscious and cannot be identified, but some are ones that the evaluators themselves would accede to. At the very least, candidates should be asked to rate themselves on the pro-assessment anti-assessment continuum. Then, an evaluation team should be composed to balance points of view. Evaluators of different persuasions might work together to reach consensus or might function as adversaries using a judicial or adversary approach. In his paper on Evaluation Bias and Its Control, Scriven (1975) recommended a "double-teaming" approach whereby two evaluators (or teams) work independently and then critique each other's reports. There are many public and private complaints about the biases of the blue-ribbon panel of experts who evaluated the Michigan Accountability System (House, et al., 1974). Murphy and Cohen (1974) characterized the evaluators as follows:
Although they might disapprove of these one-line summaries, House and Rivers would probably locate themselves at the "skeptical-about-testing" end of the continuum. The characterization of Stufflebeam is most off the mark; he would probably say he was the pro-testing voice on the panel. One way to reduce these complaints, however, would have been to identify these philosophical differences in advance. The evaluators had a contract with NEA and MEA to prohibit censorship of their reports. But, they had no defense against the claim that they were hand-picked by the NEA to represent its point of view. One good protection would have been to invite the assessment staff or some other pro-assessment group (which the NEA is not) to nominate half the evaluation team. My guess is that they would have selected Stufflebeam or someone very much like him, but such post hoc speculation is not as good as providing the guarantees at the outset. Similar problems might arise if the Oregon State Department tried to quiet hostile anti-testers with the recent formative evaluation of their assessment. The evaluation was conducted by Frank Womer and Irv Lehmann, who are inveterate assessors. Their pro-assessment affiliation--as well as the fact that they conducted a formative evaluation and never meant to question the existence of the assessment--would hardly be palatable to those who believe the assessment does more harm than good. Detecting and controlling bias is not simple. Perhaps we need an inventory to administer to potential evaluators. It could test knowledge and attitudes including questions such as, "Are you more like Bob Ebel or Jim Popham?" It's the kind of inventory that would have to be validated using known groups. Items would be retained if they helped identify clusters such as Bob Evel and Frank Womer, Ernie House and Bob Stake, or Wendell Rivers and Jane Mercer. Another possibility is to select evaluators for their disinterest and
train them in the technical issues. Scriven
(1975) pointed out that the federal General Accounting Office is
independent enough to be above suspicion though they may currently lack
the expertise for conducting evaluation. In a similar context, he mentioned
Alan Post, California's non-partisan Legislative Analyst. My colleague,
Gene Glass, was recently an expert witness in a trial involving the
interpretation of test results. He was impressed with the ability of
two bright lawyers to grasp the technical issues and use the information
after brief but intensive instruction. These examples make the idea
of selecting evaluators who are not technically trained seem promising.
It should certainly be considered for a large-scale evaluation. Defining the Purpose of the EvaluationSpecifying the purpose of an evaluation makes it more likely that the purpose will be accomplished. This point has been discussed at length in the evaluation literature and will not be belabored here. The reader should beware, of course, that we now have to keep straight the goals of the evaluation, the goals of the assessment being evaluated, and the goals of the educational system being assessed. The purpose of the evaluation will determine how the evaluation is conducted, what information is collected, and how it will be reported. Stake (1969) wrote that identifying goals of the evaluation involves recognizing audiences and the kinds of questions to be answered. One technique for accomplishing this is to simulate some possible conclusions and try them out on intended audiences. Would legislators want to know about recommended changes in test scoring or that other assessment programs obtain twice as much information for half as much money? Do classroom teachers want to know if the tests really measure what they are purported to measure or if the assessment should be made every five years instead of three? One of the most important clarifications of purpose is the formative-summative
distinction. Scriven (1967) originally coined the terms to distinguish
between evaluations designed primarily to identify strengths and weaknesses
for improvement (formative evaluation) and those intended to pass an
overall judgment on a program (summative evaluation). Although Scriven
(1974) subsequently stated that "good formative evaluation involves
giving the best possible simulation of a summative evaluation" (p.9),
the intention of the two types of evaluation is fundamentally different.
Evaluators engaged in a summative evaluation are more likely to call
into question the existence of the whole enterprise. Formative evaluators
are more likely to assume the project's continuance and look for ways
to correct weaknesses. Identifying Strategies for Data CollectionHaving decided on the evaluation's purpose, the evaluators must plan their work. Hopefully, the checklist explicated in this paper will serve to determine what kinds of information ought to be collected. There are, however, some over-arching issues about how information is gathered. The well-known evaluation of the Michigan Accountability System (House, et al., 1974) was based on the examination of written documents and on testimony of various representatives of the educational system. Hearings were also used in New York by Webster, Millman, and Gordon (1974) to study the effects of statewide pupil testing. Their findings were combined with results of a survey of college admissions officers regarding the usefulness of the regents' examinations and a thorough analysis of testing costs. Two evaluations of National Assessment have used site-visits. Greenbaum (in press) studied NAEP by reviewing transcripts of the founding conferences and by interviewing NAEP staff. In Florida, staff for the House Committee on Education ("Staff report," 1976) interviewed teachers and administrators from a random sample of ten school districts to identify issues in education. Accountability and assessment was one of the areas probed. Instructional programs are hard to evaluate, but there is still the possibility that with the right combination of tests and observations it will be possible to document program effects. The effects of assessment are more elusive. Evaluations of assessment will necessarily draw heavily on opinion surveys. This is risky. Opinions may be fairly accurate reflections of feelings about an assessment but may not be very good indicators of other kinds of effects. Somewhere in an educational research text was the observation that survey research is good for many things, but finding out which teaching method is best is not something one answers with a survey. Evaluators will have to plan strategies so as not to miss effects. Some suggestions are given in the effects section of the checklist. Scriven's (1974) modus operandi method may be called for, searching for clues as Sherlock Holmes would have. Perhaps one could juxtapose release of assessment results and government spending or study the relationship between sending home pupil test scores and the incidence of unrequested parent visits to school. In the case of the Coleman report (1966), one would try to establish its effects on the funding of compensatory education by first interviewing politicians. Ask what facts they believe are "proven" or are "common knowledge" regarding equality of educational opportunity. Ask for sources they know of that substantiate these facts or ask their legislative aides. Go to the materials they identify and seek their sources in turn. Although it oversimplifies the method, we could say that the more frequently Coleman appears in these bibliographies the greater the impact of that report on compensatory funding. A similar "search for connections" might be used to link the decline in SAT scores with the reinstatement of freshman remedial writing courses. This issue was not saved for the effects section of the paper since some decisions have to be made early on. Because opinions are likely to be an important source of information, evaluators will have to decide whom to talk to and when. Scriven (1974) argued convincingly that the evaluator ought to search for effects without knowing what the effects were supposed to be; thereby, his search would be less biased. Such goal-free evaluation might be particularly difficult in the assessment case since a number of persons will want to talk about goals, though this could be saved for last. Examining test materials and reports will not suffice since judgments about the adequacy of materials will depend on purposes. This point is given considerable attention later in the paper. A compromise which Scriven agrees to is that evaluators may begin goal-free and later switch to a goal-based approach. This would allow detection of unintended as well as intended effects. Then the evaluator could pursue program purposes and intents. To this end, the Assessment Checklist could be modified to consider effects first. Each datum collected about an assessment program is fallible. The best procedure to ensure a fair evaluation is to cross-check each opinion with other data sources. Our strategies should be much like that of investigative reporting now made famous by Woodward and Bernstein (1976). If a teacher testifies that all her children cried when they took the test, verify the effect. Don't just ask another teacher if her children cried--though a random sample of teachers might be appropriate. Check the story with representatives from other levels in the system. Interview children or watch while a test is given. Evaluators should collect information by more than one method to balance the errors of a single method. In the now classic volume on Unobtrusive Measures, Webb et al. (1966) referred to this safeguard as "triangulation":
In addition, the evaluation design should be flexible enough to allow
for the identification of additional data sources to corroborate the
data collected in the first round. Describing the Assessment ProgramThe Assessment Checklist is a set of categories for judging assessment programs. But, evaluation is not only judging. It has a large descriptive component. Some evaluation theorists have, at some times, argued that evaluation is only description--providing information to the decision-maker; they have argued that the application of standards and values ought to be left to the reader (Stufflebeam, 1971; Stake, 1973). I have taken the larger view that the evaluator must provide judgements as well as facts. But, description is essential in either case. Evaluation reports should include a narration about the following aspects of assessment:
These descriptions will be needed by many readers as a context for the judgments that follow. Narration will also be important for better informed readers (like assessment staff) to establish that the evaluators knew what it was they were judging. Some disagreements about judgments may be traced to misinformation about operations or purpose. If the evaluators have decided to begin their study goal-free, most of the descriptive information except procedures should be held until after the study of effects. But, in the final report, context information should be presented first. History includes: The testing programs and other educational indicators available before the assessment; preliminary committee work to create the assessment; legislation; implementation schedule; and the current stage of the assessment in the total chronology. Rationale and purpose should include an account of why the assessment was instituted. An example of assessment purpose is this excerpt from Florida Statutes (Section 229, 57 (2)) describing legislative intent: Another example is taken from the Teacher's Manual for the California Second and Third Grade Reading Test:
If course, there are purposes that are not acknowledged in assessment documents. Unofficial purposes will emerge as evaluators contact various assessment constituents. Development of the assessment is the history of implementation: hiring assessment staff, identifying educational goals, specifying objectives, selecting subcontractors, test construction, field testing, and review panels. Many of the development activities will be judged by technical criteria in the checklist. Most evaluators will find it clearer to combine descriptions and judgments in their reports. In the process of their evaluation, however, thorough descriptions will precede judgment. Procedures are the most observable aspects of the assessment. Procedures refers to the distribution of tests or proctors, children taking practice tests, teachers coding biographical data, children taking tests or performing exercises, and machine scoring of instruments. Dissemination of results
will be an important part in judging assessment usefulness. In the descriptive
phase, "Dissemination of results" means identifying which kind of information
is released to which audiences and in what form. What are the typical
channels for dissemination of results after they leave the assessment
office? Do principals pass information on to teachers? What is the frequency
and tone of newspaper stories. (Incidentally, I recommend that assessment
institute a clipping service. Select a representative sample
of 10 newspapers in the state and save any articles pertaining to assessment.
Do not save only those articles that are examples of journalistic abuses.
If such a file is not available, evaluators will have to do some archival
work.) Modifying the ChecklistIn the introduction to this paper, I argued that evaluation models had to be "translated" to determine what they implied in a specific evaluation context. A method proposed specifically for evaluating assessment requires less translation. But, there are still unique evaluation purposes and features of the assessment being evaluated which require some special tailoring of the plan. The best way to use the Assessment Checklist is to review the categories and specify what each means in a specific assessment context. Some subcategories may be omitted if they do not apply. However, these same categories may be reinstated if early data collection suggests they are relevant after all. The categories of my checklist are not discrete. They interact and give each other meaning. This problem is present in the Scriven and Stufflebeam frameworks, but it is exacerbated by the effort here to be more specific. In the 1975 evaluation of NAEP, Maury Johnson ("An evaluation of the NAEP," 1975) outlined three sets of evaluation questions appropriate for evaluating any assessment program:
These categories are comprehensive and distinct. The level of conceptualization is so synthesized that goals and attainments appear as one category. Still, these grand sets of questions imply most of the finer entries in the Assessment Checklist. The difficulty is that the more specific one becomes, the harder it is to separate the categories-order them, keep them from overlapping and interacting. The most serious and pervasive interaction in the checklist is that of purpose with technical adequacy. Questions of sampling or test construction have to be considered in light of purposes or uses. Measurement texts always carry the adage that validity is not a property of a test, but of a use of a test. It is for this reason that goal-free evaluators may study effects without knowing purposes but must be informed about goals before they judge technical adequacy. The checklist is arranged in an apparently sensible order. An evaluator may proceed in a different order or do several things at once. The best way to use the checklist is to become familiar with the major categories so as to decide how these issues or combinations of issues ought to be addressed in a particular evaluation. It will serve its purpose well if an important area is probed that might otherwise be omitted. |