|
||||||||||||||||||||||||||||||||||||||||||||||||
|
Although the usefulness of this paper will depend on the unique insights it provides for evaluating assessments, it would be foolish to start from scratch. Existing evaluation models are not immediately applicable to evaluating assessments; but, if properly translated, they already contain many of the important elements that one would wish to include in an evaluation of assessment. Almost any model, if studied in the assessment context, would add to our evaluation schema for assessment. The purpose of this section of the paper is to review three existing evaluation frameworks: Stufflebeam's Meta-Evaluation Criteria (1974a), Scriven's Checklist for the Evaluation of Products, Procedures, and Proposals (1974), and Stake's Table of Contents for a Final Evaluation Report (1969). The purpose of this review is to acknowledge how much the groundwork has already been laid for evaluating public service programs (see also, Sanders and Nafzinger, 1976) and to illustrate the additional steps needed to apply these schema to assessment. Before discussing the applicability of these frameworks for the evaluation of assessment programs, one point should be made. These three sets of evaluation categories in Tables 1, 2, and 3 were devised by their authors for different purposes. None were intended for evaluation of assessment programs. There is some awkwardness in applying each to the evaluation of an assessment, since assessment is both the object of an evaluation and an evaluation activity itself. The Scriven and Stake outlines should be read as if assessment were the educational activity being evaluated. Stufflebeam's Meta-Evaluation criteria are for the evaluation of evaluation. Stufflebeam's framework is unlike the other two in that is was not intended for judging ordinary educational endeavors. It was meant to be used to evaluate evaluations. Though it would be a grave mistake to equate evaluation and assessment, assessment is nonetheless an evaluation activity. Hence, the Stufflebeam criteria are generally applicable to the evaluation of assessments. These criteria are summarized in Table 1.
A creditable job of evaluating assessment could be done by following Stufflebeam's outline. Some of Stufflebeam's criteria could be used as they are. For example, Timeliness and Scope are sub-categories in the Assessment Checklist and have essentially the same meaning as Stufflebeam gave them. The Internal Validity criteria, however, would have to include a number of assessment issues that are not immediately obvious, such as the content validity of assessment materials. A more serious difficulty in using only the Stufflebeam criteria is some apparent omissions. For example, the "utility" criteria neglect negative side-effects. This is not to suggest that Stufflebeam would overlook them in his own work; indeed, he has not when conducting actual evaluations (House, Rivers, and Stufflebeam, 974). But a novice might make this error if he did not understand the full implication of what was meant by "utility." Hopefully, then reducing such risks justifies developing a new plan for evaluation when other schemes already exist. The Scriven Checklist (1974) in Table 2 has greatly influenced the content of this paper as well as the decision to adopt a checklist format. If conscientiously applied, the Scriven Checklist would result in good evaluation of an assessment program. Once again, however, many of the important considerations are implicit rather than explicit. Scriven's Need category refers to the social importance of a program. The evaluator must determine real need not just audience receptivity. After all "snake oil is salable" (Scriven, 1974, p.13). Scriven's elaborations of the need criterion also includes the requirement of uniqueness. A need diminishes in importance if it is already served by other programs. Checkpoints for importance and uniqueness are subsumed in Assessment Checklist by evaluation of goals.
Scriven's Market checkpoint is distinguished from need. Some things like safety belts are needed but are salable. Much of what is implied by Scriven's Market criterion corresponds to Stufflebeam's Disseminability. Elements of this issue emerge on the Assessment Checklist in both the Technical and Effects categories. Reporting is a technical matter but is also related to whether assessment results are used by their intended audiences. Scriven suggests eight performance categories that, when combined with Market and Need, are the basis for judging the most important checkpoint, Educational Significance. In general, Scriven has focused on effects. There appears to be little concern for what Scriven (1967) earlier referred to as intrinsic or secondary evaluation, that is, judging the design of a program. Intrinsic evaluation in this context would include many of the technical criteria and some of the management considerations proposed in the Assessment Checklist. Instead, Scriven has focused on questions of program or product "payoff." Of course, technical criteria are failed, effects will be near zero or negative. Certainly Scriven would not miss technical errors or management inefficiency. Nonetheless, there is some merit to making these requirements explicit so that others will not miss them. Stake's (1969) Table of Contents for Final Evaluation Report, Table 3, provides a larger perspective than is suggested by the other two schema. Rather than just proposing what kinds of information should be collected and judged, Stake has suggested some preliminary activities that ought to occupy the evaluators, e.g., determining the goals of the evaluation. One is reminded that if one were applying the Scriven Checklist or the Stufflebeam Criteria, there would be some additional activities required for beginning and ending the evaluation. Table 3
Stake's Table of Contents for a Final Evaluation Report
Much of the Stake outline is redundant with what has already been discussed. An evaluator using Scriven's performance checkpoints or Stake's outcome categories would focus on the same considerations. Stake does make explicit the identification of standards, although this can be inferred from Scriven as well. Stake's specification of the program section includes some descriptive elements which are not in the other frameworks and which have likewise been excluded from the Assessment Checklist and placed in a preparatory section. In the Relationships and Indicators portion, Stake goes further than Scriven in suggesting that effects be linked to program transactions. In addition to the three general frameworks reviewed, existing evaluations of assessment programs were studied to glean specific criteria for judging assessments. The following reports were read, some in rough draft:
Individual reports are referenced in the text when they provide substantiation or elaboration of a particular topic. |
||||||||||||||||||||||||||||||||||||||||||||||||