Sources for the Checklist

Although the usefulness of this paper will depend on the unique insights it provides for evaluating assessments, it would be foolish to start from scratch. Existing evaluation models are not immediately applicable to evaluating assessments; but, if properly translated, they already contain many of the important elements that one would wish to include in an evaluation of assessment. Almost any model, if studied in the assessment context, would add to our evaluation schema for assessment.

The purpose of this section of the paper is to review three existing evaluation frameworks: Stufflebeam's Meta-Evaluation Criteria (1974a), Scriven's Checklist for the Evaluation of Products, Procedures, and Proposals (1974), and Stake's Table of Contents for a Final Evaluation Report (1969). The purpose of this review is to acknowledge how much the groundwork has already been laid for evaluating public service programs (see also, Sanders and Nafzinger, 1976) and to illustrate the additional steps needed to apply these schema to assessment.

Before discussing the applicability of these frameworks for the evaluation of assessment programs, one point should be made. These three sets of evaluation categories in Tables 1, 2, and 3 were devised by their authors for different purposes. None were intended for evaluation of assessment programs. There is some awkwardness in applying each to the evaluation of an assessment, since assessment is both the object of an evaluation and an evaluation activity itself. The Scriven and Stake outlines should be read as if assessment were the educational activity being evaluated. Stufflebeam's Meta-Evaluation criteria are for the evaluation of evaluation.

Stufflebeam's framework is unlike the other two in that is was not intended for judging ordinary educational endeavors. It was meant to be used to evaluate evaluations. Though it would be a grave mistake to equate evaluation and assessment, assessment is nonetheless an evaluation activity. Hence, the Stufflebeam criteria are generally applicable to the evaluation of assessments. These criteria are summarized in Table 1.

Table 1
Stufflebeam's Meta-Evaluation Criteria
I. Technical Adequacy Criteria
  1. Internal Validity:   Does the assessment design unequivocally answer the question it was intended to answer?
  2. External Validity:  Do the assessment results have the desired generalizability? Can the necessary extrapolations to other populations, other program conditions, and other times be safely made?
  3. Reliability:  Are the assessment data accurate and consistent?
  4. Objectivity:  Would other competent assessors agree on the conclusion of the assessment?
II. Utility Criteria
 
  1. Relevance:  Are the findings relevant to the audiences of the assessment?
  2. Importance:  Have the most important and significant of the potentially relevant data been included in the assessment?
  3. Scope:  Does the assessment information have adequate scope?
  4. Credibility:  Do the audiences view the assessment as valid and unbiased?
  5. Timeliness:  Are the results provided to the audiences when they are needed?
  6. Pervasiveness:  Are the results disseminated to all of the intended audiences?
II. Efficiency Criterion
  1. Is the assessment cost-effective in achieving the assessment results?

Note the substitution of assessment as the object of the Meta-Evaluation. Assessment is not evaluation, but it may be considered an evaluation activity.

Note:  table 1 derived from Stufflebeam, 1974 (a).

A creditable job of evaluating assessment could be done by following Stufflebeam's outline. Some of Stufflebeam's criteria could be used as they are. For example, Timeliness and Scope are sub-categories in the Assessment Checklist and have essentially the same meaning as Stufflebeam gave them. The Internal Validity criteria, however, would have to include a number of assessment issues that are not immediately obvious, such as the content validity of assessment materials. A more serious difficulty in using only the Stufflebeam criteria is some apparent omissions. For example, the "utility" criteria neglect negative side-effects. This is not to suggest that Stufflebeam would overlook them in his own work; indeed, he has not when conducting actual evaluations (House, Rivers, and Stufflebeam, 974). But a novice might make this error if he did not understand the full implication of what was meant by "utility." Hopefully, then reducing such risks justifies developing a new plan for evaluation when other schemes already exist.

The Scriven Checklist (1974) in Table 2 has greatly influenced the content of this paper as well as the decision to adopt a checklist format. If conscientiously applied, the Scriven Checklist would result in good evaluation of an assessment program. Once again, however, many of the important considerations are implicit rather than explicit. Scriven's Need category refers to the social importance of a program. The evaluator must determine real need not just audience receptivity. After all "snake oil is salable" (Scriven, 1974, p.13). Scriven's elaborations of the need criterion also includes the requirement of uniqueness. A need diminishes in importance if it is already served by other programs. Checkpoints for importance and uniqueness are subsumed in Assessment Checklist by evaluation of goals.

Table 2
Scriven's Checklist for Evaluating Products, Procedures, and Proposals
1.  Need (Justification) Need must be important and not served by other programs.
2.  Market (Disseminability) Many needed products are unsalable. Must be mechanisms for reaching intended market.
3.  Performance (True Field Trials) Try out the final product in typical setting with real users.
4.  Performance (True Consumer) Who are the real recipients? (Don't focus on teachers, legislators, and state department and miss parents, taxpayers, employers.)
5.  Performance (Critical Comparisons) Would another program be more successful or less costly (e.g., district-initiated testing programs)?
6.  Performance (Long-Term) Long-lasting effects.
7.  Performance (Side-Effects) Unintended effects.
8.  Performance (Process) What goes on when the product is implemented (e.g., teacher anxiety, student enjoyment, parent involvement)?
9.  Performance (Causation) Are the observed effects really due to the product? (Would the legislature have made those same appropriations anyway?)
10.  Performance (Statistical Significance) Are the effects considered in the above criteria real or due to sampling error?
11.  Performance (Educational Significance) Synthesis of items 1-10.
12.  Costs and Cost-Effectiveness Psychic as well as dollar costs.
13.  Extended Support Systematic continuing procedure for upgrading product.
Note:  Table 2 derived from Scriven, 1974.

Scriven's Market checkpoint is distinguished from need. Some things like safety belts are needed but are salable. Much of what is implied by Scriven's Market criterion corresponds to Stufflebeam's Disseminability. Elements of this issue emerge on the Assessment Checklist in both the Technical and Effects categories. Reporting is a technical matter but is also related to whether assessment results are used by their intended audiences.

Scriven suggests eight performance categories that, when combined with Market and Need, are the basis for judging the most important checkpoint, Educational Significance. In general, Scriven has focused on effects. There appears to be little concern for what Scriven (1967) earlier referred to as intrinsic or secondary evaluation, that is, judging the design of a program. Intrinsic evaluation in this context would include many of the technical criteria and some of the management considerations proposed in the Assessment Checklist. Instead, Scriven has focused on questions of program or product "payoff." Of course, technical criteria are failed, effects will be near zero or negative. Certainly Scriven would not miss technical errors or management inefficiency. Nonetheless, there is some merit to making these requirements explicit so that others will not miss them.

Stake's (1969) Table of Contents for Final Evaluation Report, Table 3, provides a larger perspective than is suggested by the other two schema. Rather than just proposing what kinds of information should be collected and judged, Stake has suggested some preliminary activities that ought to occupy the evaluators, e.g., determining the goals of the evaluation. One is reminded that if one were applying the Scriven Checklist or the Stufflebeam Criteria, there would be some additional activities required for beginning and ending the evaluation.

Table 3
Stake's Table of Contents for a Final Evaluation Report
I.  Objectives of the Evaluation
  1. Audiences to be served by the evaluation (assessment staff, legislators, public).
  2. Decisions about the program, anticipated.
  3. Rationale, bias of evaluators.
II.  Specification of Program (Assessment)
  1. Educational philosophy behind the program.
  2. Subject matter.
  3. Learning objectives, staff aims (assessment objectives).
  4. Instructional procedures, tactics, media.
  5. Students (students, teachers, district personnel who participate in the assessment).
  6. Instructional and community setting.
  7. Standards, bases for judging quality (in the assessment context those would be criteria for evaluating the assessment rather that performance standards in the instruments themselves).
III.  Program Outcomes
  1. Opportunities, experiences provided.
  2. Student gains and losses (effects on the educational system).
  3. Side effects and bonuses.
  4. Costs of all kinds.
IV.  Relationships and Indicators
  1. Congruencies, real and intended (a report of all congruence between what was intended and what was actually observed).
  2. Contingencies, causes and effects.
  3. Trend lines, indicators, and comparisons.
V.  Judgments of Worth
  1. Value of outcomes.
  2. Relevance of objectives to needs.
  3. Usefulness of evaluation information gathered (significance and implications of the findings).
Note:  able 3 derived from Stake, 1969.

Much of the Stake outline is redundant with what has already been discussed. An evaluator using Scriven's performance checkpoints or Stake's outcome categories would focus on the same considerations. Stake does make explicit the identification of standards, although this can be inferred from Scriven as well. Stake's specification of the program section includes some descriptive elements which are not in the other frameworks and which have likewise been excluded from the Assessment Checklist and placed in a preparatory section. In the Relationships and Indicators portion, Stake goes further than Scriven in suggesting that effects be linked to program transactions.

In addition to the three general frameworks reviewed, existing evaluations of assessment programs were studied to glean specific criteria for judging assessments. The following reports were read, some in rough draft:

  • The 1974 site visit evaluation of National Assessment ("An evaluation of the NAEP," 1974)
  • The 1975 site visit evaluation of National Assessment ("An evaluation of the NAEP," 1975)
  • Greenbaum's study of NAEP (in press)
  • House, Rivers, and Stufflebeam's Assessment of the Michigan Accountability System (1974)
  • The Michigan State Department Response to the House et al, evaluation ("Staff response," 1974)
  • Stufflebeam's response to the response (1974 [b])
  • Murphy and Cohen's article on the Michigan experience (1974)
  • A legislative evaluation of Minnesota's Assessment ("Minnesota," 1975)
  • The response of the Minnesota Assessment Staff ("Response to the senate," 1975)
  • A staff report to the Florida House Committee on Education ("Staff report to the committee," 1976)
  • Blue ribbon Panel report on Statewide Pupil Testing in New York State (in press)
  • Womer and Lehmann's evaluation of Oregon's Assessment (1974)
  • Complete reference for these works are given in the bibliography.

Individual reports are referenced in the text when they provide substantiation or elaboration of a particular topic.