A Basis for Determining the Adequacy of Evaluation Designs
by
James R. Sanders
Western Michigan University
Dean N. Nafziger
Northwest Regional Educational Laboratory
April 1976
Paper #6
Occasional Paper Series
This paper was prepared under contract support from the
Alaska Department of Education to the Northwest Regional
Educational Laboratory.
TABLE OF CONTENTS
| Introduction . . . . . . . . . . . . . . . . | 1 |
| Basic Questions Regarding Evaluation . . . . | 3 |
| A Checklist for Judging the Adequacy
Evaluation Designs . . . . . . . . . . . . . |
7 |
| Example Application of the Checklist
to an Evaluation Design . . . . . . . . . . |
18 |
| A Review of Previous Work as a Basis
for Determining the Adequacy of an Evaluation Design . . . . . . . . . . . . |
39 |
| References . . . . . . . . . . . . . . . . . | 51 |
INTRODUCTION
In recent years, the educational community has widely acknowledged the usefulness of evaluation in providing information about educational programs, policies, and curricula; as a result, evaluation studies are presently an expected----and often mandated----part of most educational programs. At the same time, many evaluation studies fail dismally in their mission of providing helpful and critical decision--making information. Too often such failure is attributable to poor prior planning.
The purpose of this paper is to provide a basis for judging the adequacy of evaluation plans or, as they are commonly called, evaluation designs. The authors assume that using the procedures suggested in this paper to determine the adequacy of evaluation designs in advance of actually conducting evaluations will lead to better evaluation designs, better evaluations, and more useful evaluative information.
To assist the reader, the paper has been divided into four general sections. Readers are encouraged to concentrate on those sections that seem more appropriate for their needs.
First, some basic questions are considered----Why evaluate? Why do we need evaluation designs? Why do we need a basis for judging the adequacy of an evaluation design? Answers to these questions should serve to underscore the importance of providing a consistent basis for judging evaluation designs.
Second, a checklist (1) of basic considerations important in judging evaluation designs is presented. Each component of that checklist is briefly discussed within this section.
Third, a sample design is presented, together with an example of how the checklist can be used in judging an evaluation design.
Fourth, noted professional educators thoughts about judging the adequacy of evaluation designs are presented. This fourth section is intended especially for the reader who would like additional background based upon current literature in the field.
We anticipate that the primary audience for this paper will be educators and educational administrators--particularly project directors and evaluatots----who have to deal with evaluations frequently. The paper is not written for a highly technical audience; the authors recognize that many educators have not had time to devote to the detailed study of measurement and statistics. Therefore, in the interest of making the paper useful to the widest possible readership, the criteria presented for judging designs rely on concepts that are easily communicated or commonly known to educators. Technical or otherwise esoteric concepts are deliberately omitted.
Information contained in this paper can be used in two ways. First, it can be used by evaluators as a guide in preparing----and later reviewing and improving--their own evaluation designs. Second, project directors can use the checklist to judge the adequacy of evaluation designs submitted to them. Special communication needs often arise between an evaluator and project director; evaluation designs can facilitate clear communication and can serve as a standard to assure quality evaluation. An evaluation design provides a written record of decisions about the evaluation to which both the evaluator and project director can refer.
BASIC QUESTIONS REGARDING EVALUATION
As a preface to the checklist of criteria for determining the adequacy of evaluation designs, a few basic questions relating to evaluation are briefly addressed in this section. Answers to these questions amplify the assumptions and rationale underlying this paper. (2)
Why Evaluate?
Evaluation gives information about the quality of educational programs. Without it, we could not know whether a curriculum was effective, whether a student was performing satisfactorily, or whether the dollars earmarked for education were being spent well.
Proper evaluation is an essential part of all education. Its benefits may include the following:
1. Identification of strengths and weaknesses----a first step toward improvement.
2. Detection of problems before correction becomes difficult or impossible.
3. Identification of needs that should be addressed through educational action.
4. Identification of human and other resources that can be used effectively in education.
5. Documentation of desired outcomes of education.
6. Information useful in educational planning and decision making.
7. Cost information that can ultimately reduce educational expense.
Why Do We Need Evaluation Designs?
Everyone implicitly engages in evaluation virtually every day of his life. When buying a new coat or choosing a restaurant we make decisions based on our evaluation of the quality of the available choices. These evaluations are often informal and are seldom planned in terms of procedures and outcomes. Given time constraints and the relative low penalties for making errors, such informal evaluations are entirely appropriate. However, when the choices or courses of action affect students, result in expenditures of scarce public funds, or involve long--term commitments or benefits, the situation is different.
Carefully planned evaluation procedures, which are referred to in this document as designs, help both the project director and the evaluator understand the process through which a program or project will be judged. The design also provides for the organization of resources and activities that are required for an evaluation study.
Preparation and use of an evaluation design has benefits for both the evaluator and the project director. Presenting an evaluation design gives the evaluator an opportunity to communicate with project staff concerning proposed evaluation procedures in order to ensure their clear understanding of the process. At this point changes can be made without disrupting the evaluation. For the project director and staff, an evaluation design provides an opportunity to review the type of Information to be obtained by the evaluation so that additional or alternative types of data collection can be suggested if necessary to provide complete information to all users of the evaluation results. Also, evaluation procedures can be reviewed in order to ensure that no unexpected disruptions of the program will occur. Many misunderstandings have occurred between evaluator and project staff, and many an evaluation has altered in focus because a clear, systematic evaluation design was not prepared early in the evaluation.
The advantages to completing an evaluation design early include the following:
1. Assuring clear and accurate direction for the study by establishing the uses for evaluation results.
2. Assuring completeness of procedures by giving others an opportunity to make suggestions.
3. Identifying inconsistencies in perceptions by the evaluator and project director of evaluation plans so that these can be resolved prior to actual evaluation.
4. ProvidIng a clearly defined set of tasks for the evaluation so that attention is maintained on important outcomes.
5. Assuring efficiency in the evaluation by organizing resources and activities. (Like any substantial educational undertaking, evaluation requires good management and accounting.)
In short, evaluation design helps the evaluator and project director communicate clearly about the project. Because of the importance of the design, it is critical that it be closely scrutinized and all details discussed. Specific criteria or guidelines are particularly helpful to clients in critically reviewing a design.
Why do we Need a Basis for Judging the Adequacy
of an Evaluation Design?
Most school administrators have few, if any, persons on their staffs with sufficient training and experience in evaluation to judge the adequacy of evaluation designs solely on the basis of their own knowledge. Furthermore, qualified persons are in such demand that they are often unable to spend the time necessary to personally review all evaluation designs used In the system. Therefore, administrators and other educators are often left with little or no help in determining whether designs proposed for evaluations of their programs are sound and capable of providing useful information about those programs. Given this situation, there is a need for written guidelines that might serve as a basis for judging an evaluation design. Several benefits are expected to accrue from the use of such guidelines:
1. The guidelines should improve the quality of evaluation. Established guidelines should represent what is known about producing useful, technically correct evaluation, and their use should therefore preclude many errors common to evaluation studies.
2. The guidelines should provide a framework for developing evaluation designs. Established guidelines clarify and make public the expectations about what a good evaluation design ought to include. Because they aid communication in this way, guidelines can be used as a basis for designing evaluations.
3. The guidelines should assist administrators in monitoring evaluation work. The use of guidelines ensures that important aspects of an evaluation will be described in the design, and that descriptions will be specific enough to assist in monitoring the evaluation study.
4. The guidelines can help address ethical considerations in contract evaluation work. Established guidelines help guarantee that aspects of the evaluation that are subject to questions of ethics--such as reporting procedures, information release and dissemination policies----will be considered, and relevant issues resolved prior to the evaluation study. This in turn helps prevent inappropriate use of the evaluation results.
Ethical conduct in educational evaluation is a critical issue that pervades much of the current literature on evaluation. Unfortunately the scope of this paper does not permit an adequate discussion of the topic. A comprehensive treatment of ethical standards and conduct, while in order, much await another document devoted specifically to that issue.
A CHECKLIST FOR JUDGING
THE ADEQUACY OF EVALUATION DESIGNS
Virtually everyone involved in any way with evaluation Is concerned with the quality of the evaluation effort. The checklist presented on the following pages provides a basis for judging the adequacy of evaluation designs. The checklist is divided into four general sections, each of which covers several criteria regarding evaluation designs. Those criteria are addressed through a set of related questions. All criteria are more thoroughly discussed following the presentation of the checklist.
Briefly, the four general sections are as follows. The first section includes Criteria concerning the adequacy of evaluation planning, which covers such issues as whether the proposed evaluation addresses all important aspects of the program, and whether the evaluation can be completed within existing constraints.
The second section includes Criteria concerning the adequacy of the collection of the processing of information. These questions cover the reliability, objectivity, and representativeness of the information obtained.
The third section, Criteria concerning the adequacy of the presentation and reporting of information, deals with the usefulness and completeness of the anticipated reports.
The fourth section includes General Criteria, those that deal with ethical considerations and protocol.
Use of the Checklist
The checklist should be used like any other set of guidelines. Once the design has been read thoroughly, each item on the checklist should be considered with respect to the design. For each question related to the criteria, one of the four available options----Yes, No, ?, Not Applicable (NA)---should be circled, depending on whether the criterion was adequately met.
Each question should be clearly and fully addressed by the
evaluation design. If that is the case and if the requirements
of the question are met, the reviewer should circle "Yes." For
any question that is not discussed or the requirements of the
question are not met, the reviewer should circle "No." If for
some reason---such as inadequate information---it cannot be
determined whether the question is appropriately answered, the
reviewer should circle "?." If a question is not applicable to a
particular evaluation, the reviewer should circle "NA."
In the space marked "Elaboration" the reviewer should note any additional comments that should be transmitted to the author of the evaluation design. In particular, if a criterion was not met or if there was some question about its being met, elaboration would be warranted. Further, ambiguous intentions or plans seeming to require revision should all be noted in the "Elaboration" section. When the checklist is completed, it should be given to the evaluator and to others affected by the evaluation so that it can be used to revise the evaluation design.
There will likely be instances in which the reviewer will want to obtain advice from another person about whether a question has been appropriately answered. For example, this might occur when judging information about the validity of a test or about the appropriateness of a data collection design. The user of the checklist should always seek and obtain advice when the content of an evaluation design or items on the checklist prevent him from making a judgment.
It is important to remember that an evaluation design is a vehicle for communication between an evaluator and those whose role calls for reviewing the evaluation plan. The checklist helps organize that communication. In cases where an evaluation is conducted by a contractor, the design becomes a contract between the evaluation and the client. In such cases the checklist assists a client in judging adequacy of the design, and provides a basis for giving feedback to the evaluator. If the evaluator is involved in the program being evaluated, the guidelines provide a basis for the evaluator and his or her colleagues to check the design.
Each major point of consideration noted in the checklist is reviewed in the next few pages, along with information that should be covered in evaluation design.
Criteria Concerning the Adequacy of Evaluation Planning
A. Scope. The evaluation design should include plans to collect information about all significant aspects of the program, product, or process being evaluated. If a students performance is being evaluated, and the evaluation design does not call for collecting information about conditions that might adversely affect his or her performance, that oversight should be noted. The primary concern of this criterion is whether the focus of the evaluators attention is too narrow.
B. Relevance. The design should include plans to collect information that addresses the concerns of those who requested the evaluation. For example, if a compensatory education project is being evaluated and the project director is concerned about upgrading the reading skills of children in the program, the evaluation design should call for collecting information about improvement in childrens reading skills. To make the design relevant to the needs of the evaluation audiences, the evaluator should indicate the various audiences that need information and give the expected uses of the information. Any suggestions or changes concerning the information to be collected should be noted.
C. Flexibility. The evaluation design should be open enough to allow for the addition of new information gathering and processing activities. This is especially important in complex, long-term program evaluations where changes in program plans are likely. If a new program directed toward changing the attitudes of minority children toward school is just getting underway, and the evaluation design does not allow for changes in instrumentation resulting from changes in program objectives, it should be noted that the criterion is not met; and suggested means of allowing for such change should be given.
D. Feasibility. The evaluation design should provide enough information so that the feasibility of carrying out the study can be determined. Many evaluation designs fail to meet this criterion. Feasibility can be determined on the basis of schedules, budget, personnel assigned to conduct specific activities, proposed procedures in data collection, and reporting plans. An evaluation design is not useful unless it can actually be implemented.
Criteria Concern the Adequacy of the Collection
and Processing of Information
A. Replicability. The evaluation design should include procedures for assuring that the information being collected is accurate and that if the evaluation were replicated the same results would occur. Statistical reliability indices should be provided for standardized instruments, and procedures for determining the reliability of information collected by nonstandardized instruments should be included in the evaluation design. The reviewer should check the design to see whether such information is provided. If the design provides no way to check the accuracy or replicability of information being collected, those concerns should be described.
B. Objectivity: The evaluation design should incorporate procedures to control for biases. Those biases that may affect an evaluators collection or interpretation of information should be clearly labeled and minimized. Methods for maintaining fairness and objectivity--such as the use of external data collectors, objective and unbiased instrumentation, or interpretation panels for reporting findings--should be incorporated into an evaluation design whenever possible. If the reviewer has concerns about inherent bias in the evaluation design, those concerns should be noted and discussed with the evaluator.
C. Representativeness. The information to be collected should accurately represent the program or project being evaluated. Data collection instruments should be valid, and they should obtain information that bears upon all the evaluation questions. Information about all significant aspects of the program should be reported. Sampling procedures are often used when the amount of information needed for a complete picture becomes too unwieldy. When this is done, representative samples should be selected.
Criteria Concerning the Adequacy of the Presentation
and Reporting of Information
A. Timeliness. The evaluation design should describe how reports and other presentations fit into the schedule for decision making. Report deadlines should reflect the informational needs of the persons to whom the presentations are directed. The design should contain a reporting schedule and content descriptions of reports or other presentations, and show the relationship to the decision-making schedule.
B. Pervasiveness. The evaluation design should call for the delivery of reports or presentations to all relevant audiences. These include any persons or groups that affect or are affected by the evaluation itself or the object of the evaluation. Suggestions about the distribution of evaluation information should be recorded under "Elaboration."
General Criteria
A. Ethical Considerations. The evaluation design should cover whatever ethical considerations may be of concern. In some cases certain information obtained through the evaluation may be confidential, and steps to protect confidentiality should be included in the design. An evaluator should also be aware that some data collection procedures--such as use of peer informers--may be threatening to subjecta, and such practices should be avoided. Additional ethical considerations not addressed within the design should be noted under "Elaboration."
B. Protocol. The evaluation design should include some consideration of protocol. For example, it is often necessary to obtain a superintendents permission to talk to a building principal or teacher before actually contacting that person. In many cases, it is professional courtesy to request permission to use the work of others before referencing it. In all phases of information collection and reporting, strict protocol should be observed.
Summarizing the Information Contained In the Checklist
After considering each question on the checklist, a reviewer will have a circled responses in one column and a number of comments in the other. "No" or "?" responses indicate a need for additional information. Comments in the "Elaboration" section will provide a basis for making improvements in the design. In short, the information from the checklist summarizes for the evaluator what changes are needed to make the evaluation design acceptable.
Whenever evaluation is conducted under contract, the evaluation design becomes an important focus of communication among the evaluator, his staff, and the client. Modifying the design to make it acceptable to both sides can aid that communication process. Should irreconcilable differences arise between evaluator and client, one alternative is to terminate the relationship; another is to bring in an objective outsider to negotiate changes. In most cases, however, differences can be resolved through design modification.
The following section of the paper provides a sample application of the checklist; that sample application is intended to clarify concepts described in this section. The reader is encouraged to gain experience in using the checklist by first applying it to the design, and then comparing his results with those of the authors.
EXAMPLE APPLICATION
OF THE CHECKLIST TO AN EVALUATION DESIGN
The checklist for judging evaluation designs that is given in the previous section is to be used as a tool to help identify strengths and weaknesses in an evaluation design. The design can then be improved before the evaluation begins.
In this section, the checklist is applied to a fictitious evaluation design not intended to represent any actual evaluation study; any resemblance to an existing evaluation study is purely coincidental. Rather, the design represents the type of evaluation design frequently encountered by project directors and other administrators. The design is neither all good nor all bad. As will be seen, it contains some components that are entirely adequate and others that require improvement.
The second part of this section of the paper is the actual application of the checklist. Each question in the checklist is answered for the fictitious design, and an explanation of each answer is given.
Evaluation Design for the Hartman Reading Program
for Five Boroughs in Alaska
Introduction
In recent years reading instruction has become a major target area for education not only in Alaska but throughout the United States. As a result of this emphasis, several new reading programs, textbooks, and instructional materials have been developed.
Recently, one of these new programs, the Hartman Reading Program, was adopted jointly by five Alaskan boroughs: Elk Mountain, Donelly, Banks, Karnaska, and Port. The Hartman Reading Program is appropriate for students in grades one through six. It was selected because it had been developed for use in a variety of cultural settings, and because it purported to improve the self--concept of students from minority cultural groups. The expense involved in adopting the Hartman Reading Program was too much to be borne by any one borough alone, but a joint effort made adoption feasible.
The purpose of this evaluation is to determine whether the Hartman Program is fulfilling the goals that the five boroughs have set for new reading programs.
Program Goals and Evaluation Questions
The five--borough Planning Committee that selected the Hartman Reading Program has established four goals that any new reading program within those boroughs is expected to attain. These four goals are listed below, along with several associated evaluation questions.
Goal 1: Children in the program will achieve in all reading subjects at a rate commensurate with their own age, ability, and grade level.
Question 1.1: How does the performance of children in the new program, as measured on a standard reading achievement test, compare to that of other children in the United States at the same grade level?
Question 1.2: How does the performance of children in the new program, as measured on a standard reading achievement test, compare to the performance of children in the district in past years?
Question 1.3: How does the performance of children in the new program compare to that of children in the old reading program?
Goal 2: Children in the new program will demonstrate growth in self--esteem and improvement in self--concept.
Question 2.1: How do children in the new program compare with children in the old program on measures of self--esteem and self--concept?
Goal 3: All teachers and staff members of participating classrooms will be involved in a comprehensive inservice training program.
Question 3.1: What percentage of teachers and staff members from participating classrooms have taken the voluntary training program?
Question 3.2: To what extent do teachers and staff members express satisfaction with the training program?
Goal 4: Parents will be involved in the implementation of the new program.
Question 4.1: What percentage of parents of students in participating classrooms became involved in the classroom activities designed for parents?
Audiences for the Evaluation
The primary audience for the evaluation is the Planning Committee for the five boroughs. Based upon the results of the evaluation, the Planning Committee will decide to adopt the Hartman Reading Program throughout the five boroughs, or to eliminate use of the program. That decision will be made in July.
One secondary audience for the evaluation is teachers throughout the boroughs. Data collected during the pretest can be used by teachers to diagnose students reading difficulties and poor self-concepts.
Another secondary audience consists of project directors, evaluators, and other educators throughout the state who would like information about the Hartman Reading Program or about the evaluation procedures used in this study.
Data Collection Design for the Hartinan Reading Program
In order to allow for classroom differences while making necessary comparisons, a pre-post test, treatment-control group design was developed. Students in the new program are designated the treatment or experimental group, and those in the regular school program are considered the control group. Three alternative methods for gathering comparative data have been designed. Each of these designs depends on random assignment of students or classrooms to treatment and comparison groups at the beginning of the school year. The alternatives are listed below in order of desirability. Because more desirable designs may also be more difficult to implement, the most desirable alternative that can be implemented within the constraints imposed by the school situation will be chosen.
Alternative I:Random Assignment of Students Within Classrooms.
This experimental design allows for random assignment of students to program and control groups within classrooms. This design is based on the assumption that such assignments are acceptable to teachers, and that the two reading programs can be implemented in each classroom.
Students Selection Procedure.
1. Determine, by grade and classroom, the number of students who would participate in the program.
2. Make an alphabetical list, by classroom, of students who may be selected to fill program to capacity. (This list should contain twice the number of students needed to fill program quota.)
3. Alternately assign names to program and control groups in each classroom, as follows: first name on list to program; second name to comparison group; third name to program; fourth name to comparison group, etc.
Alternative II: Random Assignment of Classes. The second alternative involves the random assignment of entire classes to treatment and control groups. It assumes that several classes of students at each grade level can adopt the new program or remain with the old one.
Classroom Selection Procedure.
1. Determine, by grade, the number of students who would participate in the new program.
2. Prepare a list, by grade, of classes that would participate in the program. Assign a number to each classroom on the list.
3. Use a random number table to select classes to participate in the treatment group, and choose half of the classes for that purpose. The remainder will constitute the control group.
Alternative III: Teacher Selection of Program. This alternative allows teacher to choose whether they would like to participate in the new program or continue in the old one. The selection procedure simply involves allowing teachers to choose according to their preferences.
The comparison design will be used to determine the effects of the Hartman Reading Program in the areas of reading performance and self--concept. Statistical techniques appropriate for the design chosen will be used. Comparative analysis of differences in performance on all pre-- and post tests will be included in the design. The specific question answered here is whether children in the program are learning significantly more than comparable children not in the program.
Reporting Procedures
Three types of reports will be prepared---a Teacher Report for each teacher, an Administrative Report, and a Technical Report.
A Teacher Report will be compiled for each teacher's classroom to summarize pretest data for the classroom. The teacher feedback report will include:
--- Tables (two per class) showing scores, percentiles, and stanines for each pupil on each test.
--- Tables (two per class) of profiles showing graphically the percentile equivalents of the average score for each test and comparison of each child with his class, with children in other classes, and with students at the same grade level in other schools tested.
--- Local norms and standardization as given in the Administrative Report.
--- An interpretive guide for using the data provided.
The Administrative Report will include a summary of the comparison study results. The effects of the Hartman Reading Program in comparison to the standard program will be summarized and interpretations will be given. The Technical Report will include:
--- Detailed description of data--collecting methods and procedures.
--- Detailed description of procedures used in data analysis throughout the project.
--- Summary tables as presented in the Administrative Report.
--- Item analysis of all tests used in project.
--- Norms on all tests used in project.
The Administrative Report and Technical Report will be reviewed by a panel of teachers, administrators, State Department of Education personnel, and university educators to determine the accuracy, fairness, and impartiality of the report. Reports will be revised on the basis of those reviews, and, if consensus is not reached, an addendum giving the opposing interpretations will be attached.
Description of Program and Comparison Treatment
Program groups will receive reading instruction as described in the Hartman Reading Program Guide for Instruction. The Guide gives a detailed account of materials to be used, involvement of parents, sequencing of concepts, and time required for each activity. The Guide also provides the philosophical underpinning of the program, general program objectives, and settings in which the program should be used. Because the Guide is readily available, the program description is not repeated in this design. The comparison group will receive instruction in the usual curriculum offered in the five boroughs. Because the same curriculum is used in each of the boroughs, no further standardization of treatment will be required. A detailed description of the standard curriculum and its implementation is provided in the Curriculum Guide.
Testing Instruments
Tests were chosen to measure important reading skills being taught in the reading programs of the boroughs. These skills encompass listening and writing as well as more typical reading skills. In addition, a test of self-esteem is included. The tests chosen--the Sequential Tests of Educational Progress, the Multicultural Reading Series, and the Self-Observation Scale--are described on the following pages.
The Sequential Tests of Educational Progress (STEP) are achievement-oriented tests. These instruments measure the broad outcomes of general education, focusing on the ability to solve new problems on the basis of information learned as opposed to ability to handle only "lesson material." The STEP instruments provide for continuous measurement of skills over nearly all of the years of general education: therefore, they measure more of the cumulative effect of instruction.
The STEP Listening tests were designed to measure a etudents ability to understand, interpret, apply, and evaluate what he listens to. The listening skills are broken down into sub-abilities that are classified as follows: plain-sense comprehension, interpretation, evaluation, and application.
The STEP Listening tests include typical examples of what might actually be said to students in a school situation. Each test includes materials of the following types: direct and simple explanation, exposition, narration, argument and persuasion, and aesthetic material (both poetry and prose).
These tests are available for grade four to college sophomore level. They are subdivided into four levels of difficulty to provide for a wide range of abilities.
STEP Listening test interpretation begins with a score that is translated into percentiles through the use of nonned tables. The publisher also provides norms from a nationwide sample of students at the same educational level. Directions for constructing local STEP norms are provided.
The STEP Writing test measures ability to think critically in writing, organize materials, choose appropriate materials to write effectively, and use appropriate, conventional punctuation and grammar.
The materials chosen were those from actual student writing excerpted from letters, newspapers, answers to test questions, reports, stories, notes, outlines, questionnaires, and directions.
The STEP Writing test is based on the same criteria as the listening test. Norms were formulated in the manner described in the listening section.
The tests of reading in the Multicultural Reading Series are designed to measure both vocabulary and comprehension. At grade levels beyond primary one, comprehension is measured by two subtests: speed of comprehension, and level of comprehension.
Scores on the tests of reading may be used not only as measures of achievement in reading itself, but also as bases for estimating ability to achieve. In grouping children and adjusting instruction to individual differences, a measure of reading ability is often useful as a measure of mental ability. After a child has learned to read, the use of both measures is much better than the use of either one alone.
The test was constructed by the Testing Research Associates (1962) especially for multicultural student populations. Administration time varies from 30 to 50 minutes. Given specific instructions, a teacher may administer the test successfully.
The technical report of the series presents an average parallel test reliability of .87 and an average correlation of .78 with the STEP; this indicates a relatively high concurrent validity.
The Self--Observation Scales (SOS) is a direct, self--report, group--administered instrument comprising 45 items (Forms A and B) designed to measure five dimensions of childrens affective behavior: self--acceptance, social maturity, school affiliation, self--security, and achievement motivation. The SOS has been translated into various languages including Spanish, Italian, Chinese, Greek, Korean, Japanese, Tagalog, and Arabic.
The Technical Bulletin (No. 1) for the SOS reports the following split--half reliability values (N=4144):
| Self--
Acceptance |
Social Maturity | School Affiliation | Self--
Security |
Achievement
Motivation | |
| Form A | .75 | .77 | .76 | .81 | Not Available (NA) |
| Form B | .79 | .79 | .79 | .81 | NA |
Intersubscale correlations are reported as follows (N=4144):
| Self--
Acceptance |
Social
Maturity |
School
Affiliation |
Self--
Security |
Achievement
Motivation | |
| Self--
Acceptance |
-- | .06 | .48 | .18 | NA |
| Social Maturity | -- | -- | -- | .58 | NA |
| School Affiliation | -- | -- | -- | .36 | NA |
| Self--
Security |
-- | -- | -- | -- | NA |
Content validity is assured by publishers at the Institute for Development of Educational Auditing.
The validation and norming sample includes students from 150 schools nationwide. In drawing the sample, particular attention was paid to the social, geographic, and socioeconomic characteristics of the participating schools. The norm group was composed of 9,030 students at K--3 levels.
According to the publishers, "The SOS differs from other similar instruments in (a) the extensive validation study that has accompanied the national norming effort, (b) the emphasis on the healthy and positive, rather than pathological and negative dimensions of childrens affective behavior, and (c) the practical decision-making orientation rather than a research, theoretical orientation."
Other Data Collection Forms
Data about the participation of teachers and staff members in inservice programs will be collected from the records of inservice instructors. The satisfaction of teachers with the training will be measured using the Training Satisfaction Questionnaire (TSQ). The TSQ has been used frequently in the boroughs. It consists of 20 questions about the training, and has adequate reliability (KR-20 coefficient = .83) for this type of questionnaire.
Participation of parents in classroom activities will be determined through the use of a form to be filled out by teachers and a questionnaire to be sent to parents. Information from these two instruments will be cross-checked and discrepancies resolved by the evaluation team with follow-up correspondence.
Procedure Clearance Steps
All data collection activities, teacher training workshops, evaluation questionnaires, and mass communication strategies will be submitted to the chief school officer in each borough for approval prior to use. Procedures for implementing any evaluation plans will be determined jointly with the chief school officer.
Evaluation Activities Time Line
September
Select treatment and control groups
Request student names and identification numbers
Deliver test materials to schools
Conduct pretest evaluation inservice
October
Submit completed student I.D. blanks to evaluation unit
Administer pretests
Pick up completed pretests from schools
Visit schools evaluation team
November
Administer listening tests
Mail student information blank to schools
Complete and deliver individual Teacher Reports
December
Begin classroom observations
Submit completed student information blank to evaluation unit
January
Monitor experimental--comparison groups and continue classroom observations
Conduct evaluation conference for parents/advisory council members
February
Participate in visits to schools
Classroom observations
March
Continue classroom observations and monitoring of experimental/comparison groups
Continue participation in visits to schools
April
Mail parent/teacher/administrator questionnaires
Conduct post--test inservice
Classroom observations
Questionnaires due in the evaluation unit by the end of the month
May
Deliver post--test materials to schools
Administer post--test
Pick up completed post--tests
June
Complete Technical Report and Administrative Report
July
Use reports for adoption or elimination of the use of the Hartinan Reading Program
Use of the Checklist
with the Fictitious Evaluation Design
In this section the fictitious evaluator design is reviewed to demonstrate the use of the checklist in determining the adequacy of an evaluation design. The rationale for each response is provided immediately following each set of equations on the checklist. These elaborations are somewhat longer than would be provided by most users of the checklist.
|
| A. | Scope: Does the range of information to be provided include all the significant aspects of the program or product being evaluated? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The criterion of Scope seems to be only partially met in the design. The first of the four questions can be answered with a "Yes." The design does include a description of the Hartman Reading Program, although it is done by referencing the Guide for Instruction for the program (see page 24) (3).
Also, the objectives of the program are given through a series of questions that relate the general goals of the Planning Committee (see pages 19 and 20). However, there is no plan to evaluate the objectives of the program, themselves, to determine if they have any value. This is a serious oversight in the design.
On the last two questions the evaluation design does not fare as well. No provision is made for any unintended effects that might occur from the use of the program. Neither is any information given about the cost of the program. In order for the criterion of Scope to be adequately met, the two types of missing information should be included.
| B. | Relevance: Does the information to be provided adequately serve the evaluation needs of the intended audiences? | ||||
identified? |
Yes | No | ? | NA | |
explained? |
Yes | No | ? | NA | |
congruent with the information needs of the intended audiences? |
Yes | No | ? | NA | |
allow necessary decisions about the program or product to be made? |
Yes | No | ? | NA |
The evaluation design has inadequately met the criterion of Relevance. Primary and secondary audiences were identified (page 21) but were not consulted in forming the evaluative questions. To be responsive to the people who are affected by the program as well as to those who control the program, evaluative questions should be solicited to supplement those projected by the evaluator. The objectives of the evaluation were delineated in a set of questions that followed from projected information needs of the primary audience. Further, although decisions about the program can be made on the basis of the answers to the evaluation questions, there is no assurance that these questions comprise the most important questions to be answered about the program. This assurance can only come from consultation with all important audiences.
| C. | Flexibility: Does the evaluation study allow for new information needs to be met as they arise? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The evaluation design seems to be reasonably successful regarding the criterion of Flexibility. It seems that the proposed evaluation would be able to accommodate new information needs because several data collection procedures and instruments are to be employed. In general, an evaluation that uses several procedures is more flexible than an evaluation that relies heavily on one or two methods or instruments. Another strength of the design is that there is a set of alternatives for gathering comparative data. Selection of groups for a comparison study is typically an area in which some flexibility is needed.
A weakness regarding the Flexibility criterion is that there is no discussion of the constraints on the study. Nearly all evaluation studies are subject to constraints of various degrees of importance, and they should be explained in the design.
| D. | Feasibility: Can the evaluation be carried out as planned? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The adequacy of the evaluation design as it relates to the feasibility criterion is in question. The available resources to conduct the study are not given, and so no judgment can be made about their adequacy. There is no management plan that lists the major tasks, time required to complete tasks, or personnel. Also, there is only a little evidence that particularly difficult tasks are feasible. Clearly, more information relating to the feasibility of the study is needed.
| II. Criteria concerning the Adequacy of the Collection and Processing of Information |
| A. | Reliability: Is the information to be collected in a manner such that findings are replicable? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
Adequate information supporting the Reliability criterion seems to be included. These tests and questionnaires to be used in the study are described in adequate detail, and their reliability is shown to be sufficiently high (pages 25ff). In the one instance where low reliability of data may occur--teacher and parent reports of parent involvement--the data are to be cross-checked (page 28).
| B. | Objectivity: Have attempts been made to control for bias in data collection and processing? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The Objectivity criterion seems to have been met. It is clear from whom each type of data will be collected. Further, there do not seem to be any particular threats to the objectivity of the data, and so no special controls are required. Hence, the "NA" for the second question.
| C. | Representativeness: Do the information collection and processing procedures ensure that the results accurately portray the program or product? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The Representativeness criterion has not been met satisfactorily in this design. The inadequacies with respect to this criterion are brought to light by the first two questions. First, the validity of the achievement tests is open to question. No information about the validity of the Sequential Test of Educational Progress for the purpose of evaluating the new program is provided, although such information may well be available. Some validity information is given for the Multicultural Reading Series (page 26). For the Self-Observation Scale, only an ambiguous statement about validity is given (page 27).
| III. Criteria Concerning the Adequacy of the Presentation and Reporting of Information |
| A. | Timeliness: Is the information provided timely enough to be of use to the audiences for the evaluation? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The evaluation design may meet the criterion of Timeliness. No visible attempt has been made to inquire about the information needs of important audiences. The needs of the audiences should be taken into account and a reporting schedule developed consistent with those needs.
| B. | Pervasiveness: Is information to be provided to all who need it? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The Pervasiveness criterion is met partly in that the intended audiences for the evaluation are to receive adequate information. However, there are possible unintended audiences that have been largely ignored. The only report to be made available on a broad scale is the Technical Report. Other people who might benefit from information from the evaluation should be considered, and an appropriate report should be written for them. For example, a general summary of the major effects of the Hartman Reading Program would probably be useful information for many superintendents and principals.
| IV. General Criteria |
| A. | Ethical Considerations: Does the intended evaluation study strictly follow accepted ethical standards? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The criterion of Ethical Considerations does not seem to have been completely met. There is nothing to suggest that the evaluator will engage in any unethical conduct, but neither is there information to suggest that the evaluator has considered all of the ethical problems that can arise during an evaluation study.
One way in which the evaluator has been responsive to potential ethical problems is by requiring that evaluation reports will be approved by a panel of educators before release (page 24). This panel will provide guidance on several ethical issues. However, the evaluator has not considered the two other issues treated by this criterion. The evaluator should provide evidence that he intends to comply with protection of human subjects guidelines as applicable in the study. Also, the evaluator should guarantee that the data collected during the study will not be released to unauthorized personnel or be used inappropriately.
| B. | Protocol: Are appropriate protocol steps planned? | ||||
|
Yes | No | ? | NA | |
|
Yes | No | ? | NA |
The evaluator has given adequate consideration to Protocol criterion in the design. In this case, the evaluator plans to clear virtually everything through the chief school officers (page 28).
Although more specific protocol steps will evolve during the evaluation study, the evaluator has set a procedure to meet initial protocol needs.
Summary
As was noted earlier, the fictitious evaluation design of the Hartman Reading Program is neither all good nor all bad. The design has both strengths and weaknesses, and use of the checklist has helped identify them. However, simply using the checklist is not enough. Information about the evaluation design fLom the checklist should be provided to the evaluator so that weaknesses in the design can be discussed and corrected before the evaluation begins. By so doing, an important step toward producing a helpful evaluation study will have been taken.
A REVIEW OF PREVIOUS WORK AS A BASIS FOR DETERMINING THE ADEQUACY
OF AN EVALUATION DESIGN
Most educators who have ever been involved in evaluation have worried about determining the quality of the evaluation effort. Although implicit standards have long been used in determining the quality of evaluation plans, evaluation specialists have only recently begun to develop an explicit, well-defined basis for determining the adequacy of such designs.
Michael Scriven (1969) first coined the term "ineta-evaluation" to refer to the evaluation of evaluation. Since then, several evaluators have proposed standards for determining the quality of evaluation designs.
Many specialists' proposed standards have evolved from their training background or from definitions of evaluation that they have adopted. Consideration of such proposals can help one understand the evolution of the checklist offered in the previous section. Because of the considerable effort that has recently gone into the development of a basis for evaluating evaluation designs, it is important to draw as much usable information as possible from these efforts.
Bases for judging evaluation designs have generally been presented in one of three ways: 1) as guidelines that provide a format for evaluation designs, 2) as essays describing elements of a good evaluation, or 3) as checklists that guide the application of standards to evaluation designs. Examples of each are include in this section.
Guidelines for Evaluation Designs
Worthen and Sanders (1973) suggested the following format for evaluation designs, a set of elements that could be considered to all evaluation designs:
I. Rationale (Why is this evaluation being done?)
II. Objectives of the Evaluation Study
A. What will be the product(s) of the evaluation study?
B. What audiences will be served by the evaluation study?
III. Description of the Program Being Evaluated
A. Philosophy behind the program
B. Content of the program
C. Objectives of the program, implicit and explicit
D. Program procedures (e.g., strategies, media)
E. Students
F. Community (federal, state, local) and instructional context of program
IV. Evaluation Design
A. Constraints on evaluation design
B. General organizational plan (or model for program evaluation)
C. Evaluative questions
D. Information required to answer the questions
E. Sources of information; methods for collecting information
F. Data collection schedule
G. Techniques for analysis of collected information
H. Standards; bases for judging quality
I. Reporting procedures
J. Proposed budget
V. Description of Final Report
A. Outline of report(s) to be produced by evaluator
B. Usefulness of the products of the study
C. Conscious biases of evaluator that may be inadvertently injected into the final report.
A similar format was suggested by Stake (1969) in the following guide for a final evaluation report:
Section I - Objectives of the Evaluation
A. Audiences to be served by the evaluation
B. Decisions about the program, anticipated
C. Rationale, bias of evaluators
Section II - Specification of the Program
A. Educational philosophy behind the program
B. Subject matter
C. Learning objectives, staff aims
D. Instructional procedures, tactics, media
E. Students
F. Instructional and community setting
G. Standards, bases for judging quality
Section III - Program Outcomes
A. Opportunities, experiences provided
B. Student gains and losses
C. Side effects and bonuses
D. Costs of all kinds
Section IV - Relationships and Indicators
A. Congruences, real and intended
B. Contingencies, causes and effects
C. Trend lines, indicators, comparisons
Section V - Judgments of Worth
A. Value of outcomes
B. Relevance of objectives to needs
C. Usefulness of evaluation information gathered
ESSAYS ABOUT EVALUATION QUALITY
Essays on educational evaluation offer general statements about the elements of good evaluation and provide a second source of standards. One such essay, by Worthen (1973), "A Look at the Mosaic of Educational Evaluation and Accountability," covered the following considerations:
1. conceptual Clarity
Conceptual clarity is an essential feature of any good evaluation plan. By "conceptual clarity" I refer to the evaluators exhibiting a clear understanding of the particular evaluation he is proposing. Is he planning a formative or summative evaluation? Is it a comparative evaluation design or a single program evaluation? Is the evaluation to be goal--directed, with the design built around lists of evaluative questions generated Independently of the goals? Answers to these questions should be apparent in any good evaluation plan; for without clarity on these points, proper evaluation could occur only by chance.
2. Characterization of Program
No evaluation is complete without a thorough, detailed description of the program or phenomenon being evaluated. Without such characterization, judgments may be drawn about a program which never really existed. For example, the concept of team teaching has fared poorly in several evaluations, resulting in a general impression that team teaching is ineffective. Closer inspection shows that the methods frequently labeled "team teaching" provide almost no real opportunities for staffs to plan together or work together in direct instruction. Obviously, a better description of the phenomenon would have avoided these misinterpretations completely. One simply cannot evaluate adequately that which he cannot describe accurately.
3. Recognition and Representation of Legitimate Audiences
Any evaluation will be adequate only to the extent to which it provides for obtaining input from and reporting to all legitimate evaluation audiences. An evaluation of a school program which answers only the questions of the school staff and ignores questions of parents, children and community groups is inadequate. Each legitimate audience must be identified and the objectives or evaluative questions of that audience considered in designing a plan for data collection. Obviously, some audiences will be more significant than others and some weighting of their input might be necessary. Correspondingly, the evaluation plan should provide for receipt of appropriate evaluation information by each audience which has a potential interest in the program.
4. Sensitivity to Political Problems in Evaluation
Many a good evaluation, unimpeachable in all of its technical details, has failed because of its political naivete. It is pointless to promise to collect sensitive data--e.g., principals ratings of teachers--without first obtaining permission from the office or individual who controls those data. Procedures governing access to data and data sources, and safeguards against misuse of evaluation data must be agreed upon early in the project. Steps must be taken to guarantee that program staff have opportunities to correct factual errors in evaluation reports without compromising the evaluation and the more explicitly they are dealt with, the more likely the evaluation is to survive political pressures.
5. Specification of Information Needs and Sources
Good evaluators tend to develop and follow a blueprint which tells them precisely what information they must collect and through what sources that information is available. At the very least, they know how (as Scriven puts it) to lay snares at critical points in the game trails. Conversely, the novice evaluator goes about randomly turning over stones or beating the brush to see what he can find. No evaluation can depend on a random, scattered "here a little, there a little" approach to collecting data. An adequate evaluation plan specifies at the outset the information which must be collected. If the evaluation is goal-directed, the plan will specify information that will help to determine whether the objectives were attained. If the evaluation is built around evaluative questions (of the "What would you need to know to decide whether the program was a success or failure?" variety), the evaluation plan should specify information which, when collected, will answer those questions. And in every case, specifying needed information leads logically to identification of the sources from which that information can be obtained. Failure to attend to these seemingly pedestrian but truly critical steps is one of the greatest single reasons that many evaluations produce little useful information.
6. Comprehensiveness/Inclusiveness
This category is really an elaboration of the previous one. No evaluation can hope to collect all of the relevant data--nor would it be desirable to do so, since there will always be inconsequential and trivial data not worth the bother to collect. Collecting too much data is seldom the concern, however. The greater problem is collecting enough data--or more precisely, collecting data on enough important variables to be certain one has included in the evaluation all the major considerations which are relevant. A good evaluation includes all of the main effects, but also includes provisions for remaining alert to unanticipated side effects. A good comparative evaluation doesn't stop with comparing the experimental arithmetic program with a control group which receives no arithmetic instruction. It goes on to identify the critical competitors--SNSG math, Cuisennaire Rods, and so forth--and compares their new program with those for which costs are roughly comparable. In short, the weak evaluation is almost always characterized by a narrow range of variables and omission of several important variables. The wider the range and the more important the variables included in the evaluation, the better it generally is.
7. Technical Adequacy
More evaluations founder on this shoal than on almost any other, and this is due to the scarcity of educational evaluators who are even marginally competent in technical areas. Good evaluations are dependent on construction or selection of adequate instruments, the development of adequate sampling plans, and the correct choice and application of techniques for data reduction and analysis. Volumes have been written on educational measurement, sampling, and statistics and it would be pointless to try to review that knowledge here. Suffice it to say that competence in these areas is essential to most evaluations. Without knowledge and control of these tools of his trade, the evaluator has little hope of producing evaluation information which meets scientific criteria of validity, reliability and objectivity.
8. Consideration of Program Costs
Educators are not econometricians and should not be expected to be skilled in identifying all the financial, human or time costs associated with programs they operate. That bit of leniency cannot be extended to the evaluator, however, for it is his job to bring these factors to the attention of teachers and administrators who are responsible for the programs. Educators are often faulted for choosing the more expensive of two equally effective programs, just because the expensive one is packaged more attractively or has been more widely advertised. The real fault lies with the evaluations of those programs which fail to focus on cost factors as well as on other variables. As any insightful administrator knows, costs are not irrelevant, and it is important for him to know how much program X will accomplish and at what cost so he may know what he is gaining or giving up in looking at other options which vary in both cost and effectiveness.
9. Explicit Standards/Criteria
It Is always a bit disconcerting to me to read through an evaluation report and be unable to find anywhere a statement of the criteria or standards which were used to determine the programs success or failure. The measurements and observations taken in an evaluation cannot be translated into judgments of worth without standards or criteria. Is an in-service program for teachers successful if 75% of the teachers attend 75% of the meetings? That all depends on the standard that is set for the program. What about a 60% attendance rate in a high school English class--is that good or bad? Again it depends on the standard. If it is a regular English class, with a standard of 95%, 60% looks pretty bad. But in an English class for rehabilitated dropouts who work part-time to support their parents, the standard might be 50% and the attendance rate of 60% might be quite acceptable. Every good evaluation will include a statement of standards and criteria.
10. Judgments and/or Recommendations
The only reason for insisting on explicit standards or criteria is that they are the stuff of which judgments and recommendations are made, and these judgments and recommendations are the sine qua non of evaluation. An evaluators responsibility does not end with the collection, analysis, and reporting of data. The data do not speak for themselves. The evaluator who knows those data well is in the best position to apply standards for judging effectiveness. Making judgments and recommendations is an essential part of the evaluators job. An evaluation without judgments is as much an indictment of its authors sophistication as one with recommendations that are not based on the data.
11. Reports Tailored to Audiences
I argued a few minutes ago that there are multiple audiences for most evaluations and these audiences have different informational needs. For example, when you complete an evaluation, your colleagues in evaluation will be interested in a complete, detailed report of your data collection procedures, analysis techniques, and the like. Not so for the school board, or the PTA or the little old lady in tennis sneakers who heads the local taxpayer group. These audiences do not share the evaluators grasp of technical details or his interest in test reliability and validity or the appropriate choice of an error term in a randomized blocks design. The evaluator will have to tailor reports for these groups so that they depend on non-technical language, and he must avoid over-use of tabular presentation of data analyses. A typical evaluation might produce one omnibus technical evaluation report which self-consciously includes all the details and one or more non-technical evaluation report(s) aimed at the important audience(s).
Another notion should be inserted here as well-that of interim or even continual reporting of evaluation findings. Timeliness is an important concern in evaluation. Information that is presented too late to affect the decision for which it is relevant is useless. Good evaluations will not depend solely on the printed word, but will include a variety of report formats--including "hot-line" telephone reporting--so the information is reported whenever it is needed to make a particular decision.
Other general standards that have been widely used include the following, developed by Stufflebeam et al. (1971):
1. Internal validity. Does the evaluation design provide the information it is intended to provide? The results of the evaluation study should present an accurate and unequivocal representation of the object being evaluated.
2. External validity. To what extent are the results of the study generalizable across time, geographical environment and human involvement? In many small evaluation studies, the concept of external validity is irrelevant since the evaluator is interested in collecting and interpreting information about one specific program at one point in time. However, the concept may be quite important in large-scale evaluation studies where sampling is used and findings must be generalized back to the total population.
3. Reliability. How accurate and consistent is the information that is collected? The evaluator should be quite concerned about the adequacy of his measures since his results can only be as good as the information on which they are based.
4. Objectivity. How public is the information collected by the evaluator? The evaluator should strive to collect information and make judgments in such a way that the same interpretations and judgments would be made by any intelligent, rational person evaluating the program.
5. Relevance. How closely do the data relate to the objectives of the evaluation study? Defining objectives for an evaluation study enables the evaluator to check himself on the relevance of his activities.
6. Importance. Given a set of constraints on the design of an evaluation study, what priorities are placed on the information to be collected or program components to be evaluated? It is often tempting to study one relevant aspect of a program in depth and to collect much information which may subsequently prove to be less important at the conclusion of the study than less detailed information about another aspect might have been. It is the responsibility of the evaluator to set priorities on the data to be collected.
7. Scope. How comprehensive is the design of the evaluation study? There are a wide variety of considerations to explore, as emphasized in several papers presented in the previous chapter. The evaluator must consciously avoid the possibility of developing "tunnel vision" by taking a holistic approach to program evaluation.
8. Credibility. Is the evaluator believed by his audiences? Are his audiences predisposed to act on his recommendations? The evaluator-client relationship is an important one if the evaluator wants his efforts to have some impact on the program he is evaluating.
9. Timeliness. Will evaluation reports be available when they are needed? Many evaluators have missed the chance to influence action because they reported too much, too late. When decisions affecting a program are being made, any reliable information is better than none. The provision of interim, often informal, reports will help to avoid
10. Pervasiveness. How widely are the results of the evaluation study disseminated? It is true that, in many cases, only one audience needs to be addressed. However, the evaluator is responsible to provide the results of his study to all individuals or groups who should know about the results.
11. Efficiency. What are the cost/benefits of the study? Have resources been wasted when that waste could have been avoided? Operating under the constraints imposed on most evaluation studies, the evaluator is responsible for making the best possible use of material and human resources available to him.
Checklists that Guide
the Application of Standards to Evaluation Designs
Checklists that guide the application of standards to evaluation designs or reports are a third source of standards. These checklists cover many general concerns; the most useful checklists also include highly specific, comprehensive standards that can assist in determining the quality and completeness of evaluation designs.
Each existing checklist seems unique in form, content, and purpose; nevertheless, many share common characteristics. Generally, checklists for judging evaluation designs include considerations of the scientific or technical adequacy of the evaluation, the practicality and cost efficiency of the design, the usefulness of the data to be collected, and the responsiveness of the design to legal and ethical issues.
Four checklists for judging evaluation designs are described below. The first of the checklists, that written by Stake (1970), contains five general areas in which evaluation designs are to be judged: 1) the evaluation itself, 2) specifications of the program being evaluated, 3) program outcomes, 4) relationships and indicators, and 5) the programs overall worth. Each general area, in turn, covers specific considerations which, when relevant, are to be judged on their individual adequacy.
The checklist by Bracht (1973) includes six areas on which evaluation design should be judged: 1) communication, 2) importance of the evaluation, 3) design for making judgements, 4) relationships and indicators, and 5) the programs overall aorth. Each general area, in turn, covers specific considerations which, when relevant, are to be judged on their individual adequacy.
Stuff lebeams (1974) administrative checklist covers six aspects of the design: 1) conceptualization of the evaluation, 2) sociopolitical factors, 3) contractual/legal arrangements, 4) the technical design, 5) the management plan, and 6) moral/ethical/utility questions. Beyond questioning the adequacy of certain aspects of evaluation design, Stufflebeain seeks specific information about the context and implementation of the evaluation study.
The final checklist, compiled by Smith and Murray (1974), includes a number of questions from other checklists. Smith and Murray address three areas of evaluation design: 1) content descriptions, 2) evaluation activities/results, and 3) document characteristics. Each of these major areas is further divided into two subareas with appropriate exemplary questions designed to determine the adequacy of those subareas.
Guidelines for evaluating school practices provide another source of evaluation design standards. Directions for program audits produced by the federal government and directions for evaluation audits produced by auditing agencies contain examples of such criteria. Such guidelines are also available from the National Study of School Evaluation (NSSE) Evaluative Criteria for secondary schools, middle schools, elementary schools, and multicultural programs. These guidelines, used by accreditation teams throughout the country in evaluating school programs, contain a comprehensive list of school characteristics that may be useful in checking the completeness of a design for evaluating a school program.
Summary
The review provided in this section demonstrates the extensiveness of the work that has been done by educators in producing criteria for judging evaluation designs and reports. Because of this considerable effort, the practice of judging evaluation designs and reports is becoming more and more common among educators who are involved with producing or using evaluation studies on a daily basis. And, while there are many differences among the various sets of criteria presented in this section, many common threads of thought can be found. The criteria presented earlier in this paper represents an attempt to reflect those common elements.
REFERENCES
Astin, A. w., & Panos, R. J. The evaluation of education programs. In R.L. Thorndike (Ed.), Educational Measurement. Washington, D.C.: American Council on Education, 1971.
Bracht, G. H. Evaluation of the evaluation proposal. Unpublished manuscript, 1974.
National Study of Secondary School Evaluation. Evaluative Criteria. Washington, D.C.: Author, 1969.
Scriven, M. An introduction to meta-evaluation. Educational Products Report, 1969, 2, 36-38.
Smith, N. L., & Murray, S. J. Evaluation review checklist. Unpublished manuscript, 1974.
Stake, R. E. Evaluation design, instrumentation, data collection and analysis of data. Educational evaluation. Columbus, Ohio: State Superintendent of Public Instruction, 1969.
Stake, R. E. A checklist for rating an evaluation report. Unpublished manuscript, 1970.
Stufflebeam, D. L., Foley, W. J., Gephart, W. J., Guba, E. G., Hammond, R. L., Merriman, H. O., & Provus, M. Educational evaluation and decision making. Itasca, Illinois: Peacock, 1971.
Stufflebeam, D. L. An administrative checklist for reviewing evaluation plans. Unpublished manuscript, 1974.
Worthen, B. R. A look at the mosaic of educational evaluation and accountability. Research, Evaluation, and Development Paper Series. Portland, Oregon: Northwest Regional Educational Laboratory.
Worthen, B. R., & Sanders, J. R. Educational evaluation: Theory and practice. Belmont, California: Wadsworth, 1973
Wright, W. J., & Worthen, B. R. Standards and procedures for development and implementation of an evaluation contract.
Portland, Oregon: Northwest Regional Educational Laboratory,
1975.
1. The authors wish to acknowledge the work of a committee on
meta evaluation at the Northwest Regional Educational Laboratory
that was responsible for the development of this checklist.
Members of the committee included Richard Arends, Harry
Fehrenbacher, Jerry Fletcher, James Sanders, and Nick Smith.