Using the Checklist |
|
|
|
| 2. TECHNICAL ASPECTS | |
| 3. MANAGEMENT | |
| 4. INTENDED AND UNINTENDED EFFECTS | |
|
|
| 5. COSTS | |
|
|
1. Goals and PurposesKinds of GoalsBelow are outlined possible purposes for an assessment program with illustrations of ways in which the assessment design should be modified to serve each purpose. Pupil
diagnosis is the most molecular purpose of testing. It means
at least finding out what an individual child's strengths and weaknesses
are. In a more specialized sense, it means finding out what kinds of
errors are made so as to prescribe remediation. For this purpose, classroom
teachers should not give tests to find out what they already know. For
example, suppose we have an old-fashioned reading test of 50 items.
If a child's teacher can predict accurately that the child will miss
20 and get 20 correct but is unsure about the remaining 10, the best
use of assessment time (for this purpose) is to administer those 10
items--or better still 20 or 30 items of the same type. In this way,
the teacher will acquire new information about what this child is able
to do and may even have enough information to notice consistent errors.
Pupil certification is another general purpose for testing individual students. Tests used for this purpose must necessarily be the same for every individual and may produce results that are redundant to the classroom teacher. The redundancy is necessary to obtain the external verification of achievement. The new California Proficiency Test, whereby 16 and 17-year-olds may obtain high school diplomas early, is an example of an "assessment" with this purpose. The New York State Regents' Examinations and the state administered bar exams are also examples. Personnel evaluation is probably the most distrusted purpose for conducting a large-scale assessment. Judging teachers, principals, or superintendents on the basis of pupil test scores is invalid when there has been no correction for initial differences in pupil achievement, motivation, or ability to learn. Statistical adjustments for pupil differences are imperfect. While such adjustments may be acceptable for research purposes or for program evaluation, their potential for error must be taken into account when making individual decisions of merit. For a succinct refutation of the use of test scores to evaluate teachers, see Glass (1974). When student performance data is to be used for personnel evaluation, class differences unrelated to the quality of instruction must be controlled either by random assignment of students to teachers or by statistically adjusting for differences in student ability. Even after this is done, confidence intervals should be constructed to reflect the accuracy in the statistical adjustments. One way to protect against over-interpreting chance differences is to use results from more than one year. If a teacher's student achievement (adjusted for ability) is very low year-after-year, and other teachers in the same circumstance have much greater achievement the evidence is compelling that the teacher is not effective. Even after properly adjusted, it should be remembered that test scores only reflect attainment in a few of the many areas for which teachers are responsible. Program evaluation is an appropriate use of assessment results. It should be noted, however, that the assessment results are not themselves the evaluation but may be used as data in an evaluation. Using assessment results to discriminate among programs requires that test content be program-fair and that tests have sufficient sensitivity to find important differences when they exist. Program evaluation may be conducted at the national, state, district, or school building level. While there are many parallels in the requirements for conducting program evaluation at these different levels, they ought to be considered as separate purposes when judging adequacy or cost effectiveness. Program evaluation includes both summative and formative purposes. For example, evaluation using assessment results may be used to select the best of three math curricula. It is more likely, however, that it will be used to identify strengths and weaknesses for use in modifying existing programs. Curriculum experts at both the state and local level may have the need for the most detailed reporting of assessment results. Developers of a reading program, for example, will need performance data which distinguishes between literal and interpretive comprehension and between phonetic and structural analysis skills. Resource allocation is an often-stated purpose of assessment. Whether assessment results are actually used for this purpose or whether they should be is a tangled problem. Some state and federal funds are awarded to needy districts which have been identified on the basis of low test scores. Some politicians have objected that this system "rewards failure." As an alternative, Michigan adopted its Chapter Three incentives program. In the first year, funds were to be allocated on the basis of need, but in the succeeding years, money would be distributed on the basis of pupil progress. Districts with the highest percentage of children who reached 75 percent of the month-for-month criterion would get the highest funding. The plan was to reward success instead of failure. Murphy and Cohen (1974) point out the fallacies in the plan and document the fact that is has never been implemented. House et al. (1974) objected to it because it was demeaning to educators. I object to it because it ignores the political and moral justification for providing these funds in the first place. In order to "punish" the staff because of low pupil goals, funds are withheld; students who are in the need of the most help get the least assistance. It is more defensible that assessment results would be used to establish state and national priorities: more dollars for education, fewer for highways or more dollars for math, fewer for science. Some differential allocation to districts may be possible if the above dilemma is resolved. In addition to allocating money, states may also distribute curriculum consultants on the basis of test results. Accountability is implied in some of the other purposes (pupil certification and program evaluation), but it is also an end in itself. One connotation of accountability is the publicness of information. Satisfying the public's right-to-know may be a purpose of assessment. This purpose requires that assessment results be both comprehensive and brief. The public and representatives of the public are impatient with mountains of data. They want simple answers. Research
may be a purpose of assessment. It is different from program evaluation
in that it not only seeks the best instructional methods but tries to
discern why one
method is better than another. Research is often ridiculed as an assessment
purpose because it is not practical. It may be dysfunctional, but for
a different reason. Research may be an awkward purpose for assessment
because it requires tight controls that are not possible on so grand
a scale. Criteria for Judging GoalsImportance. The question here is whether the assessment is attempting to accomplish "good goals." One could hardly quibble with the importance of such goals as "improving education" or even "firing incompetent teachers," though the latter would have to be faulted on feasibility and technical criteria. If assessments are hardly likely to aspire to bad goals, one might ask why this criteria is included. Well, relative judgments will have to be made in the ultimate analysis of cost and benefit. If a goal is good but not earthshaking and costs a lot, it will not justify the assessment. Uniqueness. Importance and uniqueness are the two ingredients in Scriven's Need checkpoint. I have separated these criteria because many goals are likely to pass the first and fail the second. The distinction prevents importance from camouflaging redundancy. If an identified purpose is already served by other programs, it is less urgent. State assessment goals should receive low marks if they duplicate those of local district programs. NAEP ("An evaluation of the NAEP," 1975) received high marks for the goals which it served uniquely, adult assessment and administration performance exercises. Duplication of assessment efforts not only reduces the social needs; it may also have direct deleterious effects in over testing pupils. Feasibility. Some assessment programs are guilty of inflated advertising. Many have promised that assessment will not only locate educational strengths and weaknesses, but will find solutions to ameliorate deficiencies. This criterion may sound like a judgment of technical adequacy. But it is not a matter of fine tuning the assessment design. Some intended purposes are beyond the scope of any assessment and should never have been promised. For example, finding and disseminating exemplary programs requires much more than assessment data and would involve product development and research. Scope. Are the assessment goals too broad or too narrow? In the simplest sense, this may mean asking whether the assessment ought to include science and social studies as well as reading and math. More significantly, however, this criterion pertains to the question about how many purposes can be combined before they begin to detract from one another. Curriculum experts and program evaluators need comprehensive information on content objectives. The California Reading Test for second and third graders includes almost 250 items (although each child only answers about 30 questions). Classroom teachers need information for every pupil. If they select different tests for each child, the results cannot be aggregated for accountability purposes or to answer the questions of the curriculum experts. If the comprehensive test was administered to every child, it would be extremely time consuming, give the classroom teacher a great deal of redundant data along with the good stuff, and would have much more statistical power than was desired by either the curriculum experts or program evaluators. This dilemma is exemplified by Murphy's and Cohen's (1974) distress at the development of a new test in Michigan:
Some purposes will have to excluded so that those remaining can be
served well. 2. Technical AspectsTestsCriterion-referenced vs. norm-referenced. This consideration might be relegated to a discussion of test content or to the question of interpretation. The subject emerges so frequently as a topic of debate, however, that it deserves special attention. Criterion-referenced tests are preferred by many because they yield results that tell what individual children can do, rather than rank-ordering the test takers. Unfortunately, criterion-referenced testing may produce voluminous data that is ill-suited for some assessment purposes. Classroom teachers and curriculum makers are better served by specific information on 100 objectives but state legislators require more condensed information. Norm-referenced tests are probably preferable for placement purposes when the percentage of students who can be placed in certain situations is fixed. This is true of college admission. In other circumstances, knowing absolute, rather than relative, performance levels is more diagnostic both for individuals and programs. Most criterion-referenced tests are, however, missing the "criteria" necessary for aggregate, social-policy interpretation. Most so-called criterion-referenced tests are content-referenced or objective-referenced but are missing the implied standards. Elsewhere, I have discussed the difficulties inherent in setting standards and have proposed some mechanisms for establishing criteria (Shepard, 1975). Without standards or comparative norms to assist in interpretation, the assessment results will be judged useless for certain audiences. Content validity. The largest issues in content validity overlap with the question of scope in goal-evaluation. The content of the assessment should match the educational goals. There are tradeoffs allowed by other categories in the checklist to justify assessing in some goal areas and not others. It is much less defensible, however, to narrow the definition of those subjects selected for assessment. Making sure the tests measure what they are supposed to measure is assured in part by proper development procedures. Deciding what to include or exclude in the definition of an assessment topic should not be left to closeted groups of subject matter experts. Parents and taxpayers also have something to say about what language or social studies tests ought to include. Greenbaum (in press) criticized NAEP for allowing the political clout of the school audience to disenfranchise lay representatives on objective-writing committees. More detailed specification of test content is accomplished by writing objectives and items. Items should be reviewed to verify that they measure the intended objective and that, taken together, they provide a balanced representation of the content universe. There is a tendency to create many "easy to write" types of items and fewer addressing the objective that are hard to measure. Someone has to keep an eye on the total picture. Cultural bias. A test is culturally biased if certain children who have a skill are unable to demonstrate it because of extraneous elements clouding test questions. Webster, Millman, and Gordon (1974) helped clarify the distinctions between those aspects of culture that are relevant to achievement and those that are irrelevant and indicative of bias: "A test of ability to read road signs printed in English would have a strong and legitimate culture-specific content, while a test of reading comprehension might be biased by a culture-specific set of passages that comprise that test" (p. 16). Tests are harmful if they are formed on majority groups (or groups in which the majority dominated) and are used to make irreversible decisions about individual children. Using test results to group children for instructional purposes is not harmful, however, if there is easy and frequent movement in and out of groups. When tests are used to evaluate programs, there is less concern that individual children will be diagnosed incorrectly, but it is still essential that each child be able to "do their best." Empirical validity. Tests are valid if they measure what they purport to measure. Content validity, discussed above, is established by logical analysis of test items referenced to the intended content universe. Additional evidence is needed, however, to verify the soundness of both the test-construction logic and the content analysis. Empirical evidence of test validity is typically obtained by correlating test scores with other measures of the same skills. Since these various criterion measures are also fallible, it is preferable to use more than one validation criterion. If a new test is developed for an assessment program, it may be advisable to study the agreement between student scores and teacher ratings of student skills and to examine correlations between subtests and other more lengthy tests available for assessing the same subskills. For an excellent summary of validation procedures see Cronbach (1971). As was stated earlier, validity is not inherent in a test but depends on how a test is used. The uses that will be made of test results must, therefore, be taken into account when seeking evidence of validity. For example, when tests are to be used for program evaluation, evidence is needed of sensitivity to instruction or sensitivity to between-program differences (Airasian and Madaus, 1976; Rakow, Airasian, and Madaus, 1976). If tests are to be used to select individuals, predictive validity should be well established. The evaluator or evaluation team are not responsible for collecting such data but must certainly judge the adequacy of existing validity data. My consideration of validity here is necessarily brief because there are so many items in the checklist to cover and because the importance and method of test validation are known to both assessors and evaluators. I have introduced a few special issues to think about in the context of large-scale assessment. Such an abbreviated treatment should by no means be mistaken for an underrating of the importance of test validity. If the assessment instruments have serious flaws, the entire assessment effort is invalid. If the tests are not content valid, are culturally biased, or lack evidence of empirical validity, then the assessment results are questionable. Reliability.
Reliability is a prerequisite for validity. In order to measure accurately,
an assessment instrument must measure dependably. Traditionally, reliability
is best operationalized by test-retest correlation coefficients. However,
when many children receive near-perfect scores, which may be the case
when tests are used to certify minimum competencies or to discriminate
among programs rather than children, correlations will be near zero
because of restriction of range. Additional information will be needed
to make a sensible interpretation. In some instances, it would be appropriate
to statistically correct for range restriction. Or, depending again
on the use of the tests, it may be appropriate to examine the stability
of program differences rather than pupil differences. Some evidence
must exist, however, that assessment results are not quixotic and that
conclusions would not be altered if children were tested on another
day or in another setting. Articles by Huynh (1976) and Subkoviak (1976)
indicate some of the progress being made at developing appropriate reliability
statistics for criterion-referenced or mastery tests. SamplingSampling must be appraised in light of assessment purposed. If the Michigan first-grade test (cited earlier) was ever implemented statewide, it would be a disaster due to the divergent purposes of individual testing and state-level testing. To meet the state level purposes efficiently, sampling is prescribed. For classroom teachers' use, every-pupil testing is required. If sampling is appropriate, many more technical issues remain: adequate
specification of the population, practical sampling methods (e.g., cluster
sampling vs. random sampling), correct representation of sub-populations
for reporting purposes, efficient sampling (small standard errors),
and adequate follow-up of non-respondents. Administration of TestsJudging the administration of the assessment hinges on two major questions: do testing procedures yield the most accurate results possible and is the disruption as little as possible? Test administration includes the selection and training of testers
and the clarity of instructions to teachers. In most instances, evaluators
should also inspect instructions to district superintendents and building
principals to see if they have been well informed about the assessment
being conducted in their jurisdiction. Most importantly, evaluators
will have to judge how appropriate test format and vocabulary are for
the children to whom it is administered. An assessment would get low
ratings in this category if separate answer sheets were required for
children in the primary grades. The California Assessment Program would
get bonus points because every test is accompanied by a practice test
to make sure that all children know how to respond. At least some practice
questions are essential for all ages of respondents. ReportingReporting is essential to make an assessment useful. Evaluation in this area is likely to require study of effects as well as study of assessment endeavors. Data analysis. Routine data processing will have to be judged by the often incompatible criteria of accuracy and efficiency. Data analysis is more complicated. It requires judgments about statistical correctness but also prompts the larger question about appropriateness of selected variables. This may turn out to be a major issue in evaluating the utility of the assessment. Some critics of NAEP, for example, have insisted that results should be reported for Chicanos or for Spanish-speaking respondents. There are fairly large cost considerations, of course, that must be weighed against the gain in information. Different reports for different audiences. Basically, what we are looking for are reports that contain the right kind of information and are understandable to their respective audiences. Recommendations for better reporting practices were made in Shepard (1975). Intrinsic evaluation of reporting documents involves consideration of length, medium selected for presentation, availability of personal explanations, journalistic style, use of visual displays, and avoidance of technical jargon. Empirical verification of the adequacy of reports should come from true field trials or from follow-up studies of reporting in the full-scale assessment. Interpretations. Some assessment exercises have self-evident meaning. It is clearly bad, for example, that in the first citizenship cycle, NAEP found that 25 percent of the nation's 17 year olds believed that they had to vote according to their party registration. But most assessment results require additional information to give them meaning. The extra ingredient must either be a performance standard or norm against which the obtained results may be compared. Some assessors are fearful of making interpretations because they believe it will undermine their neutrality, an attribute considered essential for working cooperatively with schools. However, assessors could take responsibility for seeing that interpretations are made without passing judgment themselves. A number of states have begun the practice of inviting subject-matter experts to review results and publish their interpretations. This practice will ultimately have to be judged by its effects, but, in general, should be positively rated by the evaluators. As a word of caution, such responsibility for interpretation should not be left only to the school people. Perhaps the League of Women Voters or the NAACP should be invited to make interpretations as well. Greenbaum (in press) proposed an additional group of reactors, political and social scientists. In a similar vein, I heard in a committee meeting a reference to the ultimate reporting vehicle: "Your results probably won't receive public notice until somebody famous writes a book about you." 3. ManagementCheckpoint three calls for judgments of facilitating or enabling activities. Only two of the subpoints, formative evaluation and grievance procedures, are ends in themselves. The remaining topics are considered because they are likely to affect program outcomes. Planning. I have a personal aversion for elaborate PERT charts, but it is essential that the assessment staff identify each step in development and implementation and plan who will work on each activity during which weeks and months. There is a tendency in all of us to work methodically at first and then run out of time for the last segments of a project. This paper has more attention at the beginning than at the end. Perhaps its only salvation is that the total outline was decided before the detailed work began. Assessment staffs ought to be scrutinized particularly for how they deal with unreasonable expectations. If they are being asked to do more than is possible, do they have good "coping behaviors"? Are they able to set priorities? Are they able to petition effectively for more resources or diminish expectations ("This will take two years instead of one.")? Documentation of process. This is a record keeping requirement. Work may be duplicated if good records are not kept. Compromises reached about test content or sampling plans should be well documented, specifying the arguments on each side and the details of the compromise. Faulty memory on these matters will require that consensus be constantly reestablished. Though constant debate would be dysfunctional for the operations of the assessment, new compromises will have to be reached from time to time. In these instances, an accurate history would make the deliberations more efficient. Documentation may be maintained by recording decision-making meetings or by keeping a file of working papers summarizing arguments and agreements. Formative evaluation. Scriven's Extended Support checkpoint causes us to look for mechanisms whereby the product will be continually upgraded. Formative evaluation may be a comprehensive study of the assessment or it may be a combination of separate activities such as field testing instruments and reports. It should, in either case, address questions of social utility. Is the assessment being used? Are there unintended uses that can be expanded? Scriven (1974) asks the question, Are there others who are producing a better product to serve the same need? "One decision that should remain open is the decision to cease production, even if it is commercially profitable to continue, when the evidence clearly indicated the existence of a superior product that can reasonably be expected to take over a free market" (p. 21). The assessment program should be graded on the provision for and use of formative evaluation. Personnel. The next three categories, Personnel, Support Services, and Subcontractors, should be taken together to ensure that all the necessary competencies are available to conduct the assessment. Assessment requires political insight, technical expertise, management skills, writing ability, and a good sense of humor. Each member of the assessment team does not have all of the skills, but they must all be represented. An assessment run by experienced educators with lots of political savvy is just as imbalanced as one run by a dozen statisticians. If some of the necessary skills are missing in the assessment staff, assessment personnel have to be clever enough to know when to hire consultants or subcontractors. Support services. At the simplest level, this item means asking whether work is being thwarted because six staff members share one secretary. Evaluators will not want to waste their time measuring office space and counting filing cabinets, but gross inadequacies or excesses in materials or services should be challenged. If assessment staff are housed in separate floors so that conferences are difficult to arrange, evaluators may give this ailment special attention. If computer facilities are grossly underused, time-sharing arrangements with other agencies might be recommended. Consultants should be used when need to augment staff expertise. Subcontractors. Writing RFP's (request for proposals) is an art that assessment staff will be graded on. Assessment staff are responsible for deciding which task can most appropriately be assigned to a subcontractor, for selecting the subcontractor, and for carefully monitoring subcontract activities. For example, a minor criticism of NAEP ("An evaluation of the NAEP," 1975) was that they had "farmed-out" a research report on background variables. At issue was not the quality of the report produced by the subcontractor, but the opportunity lost to the NAEP staff. It's bad for morale to relegate staff to routine tasks and give more inspired assignments to a subcontractor. In addition, NAEP staff might lack the familiarity with the intricacies of the report that they could have had if they had been its authors. More frequent sources of error are subcontractors who are generally competent, but who lack specific insight into the needs of a local assessment. The remedy for this is close cooperation between assessment staff and contractor. Effective decision-making procedures. Ineffective decision making could be an amorphous malaise that is observable only in late reports or contradictory public statements. However, if evaluators can be specific about decision-making difficulties, there may be clear implications for program improvement. Here are some things to look for: Are all decisions reserved for a boss who's out-of-town 50 percent of the time? Are decisions made by default? Are key decisions made by uninformed sources outside of the assessment staff? Are members of the staff satisfied with the way decisions are made? Equal opportunity. Assessments may have special problems as equal-opportunity employers. Assessment is a "math-related" activity. Women have been taught to avoid math and may remove themselves from the competition for assessment jobs. If some minority group members have poor elementary and secondary training in mathematics, they may have eschewed math-related subjects in college. Special recruitment may be necessary to interrupt this trend. Redress of grievances. Regular procedures should be established for hearing and redressing complaints against the assessment. If particular school and district results are in error, new reports should be printed. Assessment staff are likely to receive a certain amount of hate-mail and irate phone calls. In the interest of fairness, responses to complaints should make clear exactly what will be done to correct the problem. In many instances, the appropriate answer is that no change will be made. Reasons for this staff stance should be clearly delineated. The assessment staff should only be faulted if they make false promises. Timeliness. This criterion needs little elaboration. Good information is not helpful if it arrives after decisions have already been made. When assessment results are collected for long-term census monitoring, delays of six months in reporting are not crucial. When assessment results are meant to be used for program evaluation or state-level decision-making, delays may render the reports useless. This item certainly overlaps with the effects considered in the next section. 4. Intended and Unintended EffectsPeople and Groups Who May be AffectedEarlier, I suggested some strategies for seeking the effects of assessment. Effects may be subtle and indirect. The evaluator must be a bit of a sleuth to detect them. For example, the true effect of the New York State Regents' Exam may not be to provide evidence for admission to college but to increase the amount of studying done by high school juniors and seniors. Possible effects might be discovered by surveying diverse audiences. Then, purported effects must be tracked down and verified. The evaluator must be particularly cautious not to identify consumers who are only in the educational community. Educators are the more visible consumers and are perhaps better informed about assessment but important recipients exist in the larger community. The Assessment Checklist identifies individuals and groups who may be affected by assessment:
The evaluator may add to the list. Each audience must then be studied in light of the possible effects. This checkpoint incorporates process effects as well as outcomes. If children enjoy taking a test or if teachers feel threatened by the results of the test, these are effects of the assessment. The discussion of effects subsumes Stufflebeam's utility criteria of Relevance, Importance, Credibility, and Pervasiveness. The impact of the assessment is measured in terms of who uses the results and how widespread the effect is. Suppose one or two teachers testify that they have used assessment results to modify their classroom instruction. If they are isolated examples of this practice, there is evidence of potential use but not of significant current use. Scriven's discussion of Educational Significance suggests that at some point we go back to the original need. Is this program serving the purpose for which is was intended? Of course, we must also be on the lookout for unintended effects and canceled purposes. Kinds of EffectsThe second set of entries in the Effects category contains some possible consequences of assessment to consider. Outcome and process effects. In other evaluation contexts, process variables are usually given separate consideration. In this situation, however, processes are those described in the management portion of this paper. Things that happen to teachers and children in the process of test administration are actually side effects of the assessment. Immediate and long-term effects. Immediate consequences of the assessment should be observable in program changes, public reactions, resource allocations, and development of new curricula. Long-term effects can be only estimated by the evaluator. Educational assessment is in its infancy. Some uses which may emerge in the future have not been identified. Reliance on the on-going nature of the information has not been established. The evaluator may have to be an advocate in trying to project the assessment's ultimate usefulness. What uses have emerged for data from analogous enterprises, the U.S. Census and the GNP, that were not thought of when the data collection began? Of course, the evaluator's advocacy should not subvert his responsibility to obtain confirming opinions of his projections or to report contrary views of the future as well. Opinions. Attitudes toward testing and toward education may be influenced by an assessment program. For example, if teachers believe that an assessment program is unfair, they may loose confidence in all tests and convey this misgiving to their children. A positive outcome may occur when children have an opportunity to take "easy" tests. The California first-grade test is not meant to discriminate among children. It is used for baseline data to distinguish programs. Tests on which many children get a perfect score can still show differences between schools and programs. The California test consisted of questions that 80 or 90 percent of the children answered correctly. On the hardest question half of the children got the right answer. Teachers reported that children thought the test was fun. The affective component of accountability should also be considered under this heading. Public confidence in education may result simply because people have access to information. Decisions, laws, resource allocation. Decisions are tangible, e.g., a new emphasis on math, government funding for consumer education, legislative requirements for minimum competency testing. The question is, which decisions are attributable to assessment results? One might begin by asking decision-makers, superintendents, and legislators. But many effects are more subtle. Scriven's modus operandi method cited previously many help us uncover more grass-roots influences. Are citizens voting against school bond issues because they won't pay higher taxes or have they lost confidence in the public schools? What evidence do citizens cite for or against public schools? How many editorials in selected newspapers refer to assessment results? How do interest groups use assessment to lobby for their purposes? The evaluator should review legislative debates on educational measures. Are assessment results quoted? Decisions about resource allocation should be the easiest to document. What are the formulae for distributing federal and state funds? Within districts, how are special resources apportioned to schools? There is probably no harm in assessment staffs collecting evidence that verifies the uses of assessment results as long as they do not throw out evidence of misuses. True, this may give a biased view since many isolated examples of use may not reflect pervasive use. However, it will be up to the evaluator to substantiate the extent of each type of use. In searching for examples of information use, evaluators should be wary of the "common knowledge" fallacy. In the State Senate evaluation of the Minnesota Assessment (1975), there is this complaint:
Have such things always been known? What these authors claim is no secret may not have always been common knowledge. We ought to find out how much of this was publicly acknowledged before Coleman. What kinds of evidence were used to argue for ESEA Title I and Headstart? The Senate researchers may have a valid point if these are countless assessments all producing duplicate information; but that is a separate issue from whether the information is useful in itself. Effects
of assessment technology. Assessment efforts may produce
expertise as well as data. Through numerous conferences with state and
local educators NAEP has disseminated its methodological expertise and
fostered sharing among other assessors. Shared expertise should improve
the overall quality of assessment. However, evaluators should watch
for instances where the imported model is inappropriate for local purposes.
State departments may, in turn, give assistance to local districts who
wish to conduct assessments for their own purposes. "Piggy-backing"
is the term that has been adopted in the trade to signify local assessments
that take advantage of the instrument development and sampling plan
of the larger assessment organization. By over sampling in districts
who ask for more information, state assessments can provide them with
test data much more economically than if they had tried to conduct the
assessment themselves. 5. CostsThe evaluator should collect cost data as carefully as he gathers effects data. Costs must include expenditures in dollars and time. It is a matter of style, however, whether one wishes to count emotional costs in this category or consider them as effects in checkpoint four. Scriven (1974) made several requirements of cost data: "There should be some consideration of opportunity costs...What else could the district or state have done with the funds...Cost estimates and real costs...should be verified independently...[and], costs must, of course, be provided for the critical competitors" (p. 20-21). In an assessment context, costs should include obvious things like the salaries of assessment staff and the cost of their offices. The most obvious cost is that of consultants or subcontractors since they present a bill. Costs which are often overlooked are the salaries of teachers and curriculum experts who are released by their districts to consult on objectives development or item writing. Also, forgotten are the cost in teacher and student time while the assessment is underway. How many total hours are spent by district personnel and principals distributing and collecting test materials. Be sure to count the time they spend on the telephone answering questions. |