Section A

Basic Principles of Evaluation: An Overview

Harold G. Levine, M.S.

Importance of Testing and Evaluation
Although the process of testing and evaluation is often treated as something separate from instruction, evaluation is an essential part of the instructional process. One way of looking at instruction is to note that it has three elements. The first element is introducing the learners to instructional goals and objectives, i.e. new ideas, processes, methods, etc. The second element is practice which provides the opportunity for learners to use the new ideas in appropriate contexts. The third element is feedback or evaluation when the learners are informed about whether or not they have used the new ideas in an appropriate fashion. The feedback phase of instruction is often neglected, so that learners fail to discover that they have not mastered ideas or skills.Even worse, learners may receive information from the feedback phase which is not helpful. For example, junior medical students taking a pediatrics course may be introduced to the enormously important set of concepts relating to fluid and electrolyte balance. If the students do not get a chance to practice these ideas and get appropriate information or feedback about their mastery of the concepts, the students may see children suffering from dehydration and have no idea about how to manage the children’s problems.

Definition of Evaluation
Based upon the description of evaluation given above, testing and evaluation can be conceptualized as providing learners with information about the results of their competency in acquiring knowledge and skills.The evidence that learning has taken place is based upon a sample of the behavior of learners, either directly observed behavior such as watching learners examine patients, or behavior on tests such as choosing, describing, etc.Behavior alone is not sufficient; the learners must be informed that the behaviors sampled have satisfied some criteria of effective behavior. Examinees must be informed about the results of the test; those observed must be told if they performed the task effectively. Thus, a test or evaluation is a sample of behavior which is used to make some value judgment. If the value judgment is not generalized beyond the particular context when it is gathered,we call it feedback. Medical students who examine patients and are told by preceptors that they did an effective or ineffective job of examining the patients are getting feedback. If the preceptors decide on the basis of a number of observations that certain students need to repeat a course or receive pass or honors,the students have been evaluated, i.e. they have received grades based upon generalizations derived from their accomplishment of a number of tasks.

Effects of Evaluation
The process of testing and evaluation has a large number of effects on education and educational systems beyond the task of providing feedback to learners in practice situations. Evaluation tells the learner what to study and how hard to study. It provides information about which learners need additional education and which ones should be dropped from an educational course of study. It helps education managers to develop curricula, and to select those who will be permitted to enter their programs. It helps learners to decide on their future careers. The effectiveness of decisions which are assistedby the various roles of testing and evaluation are considerably influenced by the strategies used by educational managers in the design of evaluation programs.Regardless of the particular approach taken in devising a testing instrument, it should simulate the work of a physician.

Evaluation Strategies
Instructors and course directors have a variety of strategies they can use to carry out their testing and evaluation functions. They can use analytic methods which gather samples of behavior in which each sample is only a small part of a more complex idea or concept, e.g. results of objective tests, to provide feedback and evaluation, or they can use synthetic methods in which the sample of behavior gathered is large and complex, e.g. observing a medical student gathering historical information, or they can use methods which lie between these types, e.g. simulation tests. They can give tests which are comprehensive, i.e. cover material learned over large blocks of time, or scatter tests throughout the course of study. They can give grades which focus on pass–fail, or provide a series of grades such as A, B, C…. They can base their selection of tests on a huge reference work, or on a defined set of instructional materials. They can use tests and testing personnel from outside the local system in designing the testing program or they can rely mainly on internal decision makers. They can rely on the learners choosingamong a defined set of possible answers, multiple choice tests, or generating responses, essays and oral tests. All of these various strategies of approaching testing and evaluation have advantages and disadvantages based on their effects on learning and the costs of the evaluation system.

Criteria for Selecting Evaluation Methods
Evaluation specialists have developed four important criteria for choosing evaluation methods. Two of these criteria, validity and reliability, are technical attributes of all measurement instruments which should be taken into account in using measurements. The third criterion, practicality and costs, is essentially a management criterion which depends greatly upon the resources and values of those developing and using tests and evaluation methods. The fourth criterion is the effect on learning. This criterion is the most important for reasons which have already been discussed.

A test or evaluation has two attributes–the gathering of samples of behavior, and a decision based upon the sample. Validity is the characteristic of a test or a testing program which relates to the decisions made. For example,students may know a large number of facts, and do well on fact based examinations.Based upon these data, it might be decided that these students can solve problems.However, this decision may or may not be valid because it goes beyond the data provided by the test.In many cases learners’ knowledge of facts is a necessary but not sufficient condition for deciding that the learners have mastered the information required to meet the standards desired by the program.In clinical situations, the mastery of facts is often given overwhelming importance in deciding on whether students have satisfied program standards when a number of attributes of effective performance in clinical situations, e.g. problem-solving skills, interpersonal skills, technical skills such as those required in performing a physical examination or obtaining materials for laboratory examinations, work habits and attitudes are not assessed. In this case the test of facts may be valid, but the evaluation system is not valid.

While validity is an attribute of the decisions made based on the results of evaluation methods, reliability is a technical attribute of the measurement method itself as used on a particular population of individuals who are assessed by the instrument. Since evaluation is ultimately generalizing about a sample of behavior, reliability is an estimate of the amount of error which exists in a particular measurement. Error may be conceptualized as the likelihood that the results would be similar, if the measurement were repeated . Some types of error are typical of the evaluation method. For example, examinees with bad handwriting usually do less well on essay tests than they would on other types of tests. The most common type of error in tests and the most pervasive is sampling error.We all know, intuitively, that there is a great deal of error in small samples. A learner might get one question right and another wrong simply because of the choice of questions. The more questions that are asked, the more likely it is that a test is reliable. For this reason, certifying examinations such as the Medical Licensing Examination contain hundreds of questions. Evaluation exercises which can be used for feedback purposes, e.g. observing a medical student with one patient, cannot be used for decisions about promotion because of the error in small samples of behavior. Another source of error in examinations such as essays, orals or observations is rater error. Raters tend to value different attributes of performance, focus on different attributes of what is observed and if the evaluation is complex, weight elements of the sample of behavior, differently. Raters also have different standards, even if they agree on what is observed. Two raters may rank a group of examinees the same, but one might give higher grades than the other. Even though observer errors can be limited by examiner training, sampling error can still create great unreliability in any test which uses only a small number of exercises or observations.

Practicality and Cost
Some possible tests are impractical. It may be difficult to get enough oral examiners to conduct a certification examination, or a sufficient number of patients cannot be assembled to allow all the examinees to provide samples of behavior with patients for assessment purposes. Since it is essential that an educational program provide some assessment of clinical skills, in order to develop a valid evaluation system, issues of cost in terms of faculty time, the hiring of simulation patients, the utilization of support personnel, etc., go back to the values of those running the program. If faculty members receive little in the way of rewards or recognition for teaching, they will be reluctant to spend energy in evaluating learner performance.Sometimes an imaginative use of resources can modify the cost-benefit ratio of effective evaluation techniques. Examinees can be screened to see if they are at risk of marginal performance, and only the weakest performers given the more expensive techniques. Observations which are quite expensive in faculty time can be made by non-faculty members such as students, nurses, physician assistants, etc.

Effect on Learners
The possible effect on learners of evaluation methods has already been mentioned, so this section is quite brief. It is particularly important for course directors to realize that learners will focus on what is evaluated. If important attributes of clinical performance are not assessed, then the learners will neglect those aspects in favor of those which are assessed. Regardless of what is written in a course outline or syllabus, learners perceive that the objectives of the faculty are what is assessed by the faculty.

Interactions Among Criteria
Unfortunately, it is difficult to use one evaluation method which adequately meets all the criteria described above. Therefore, it is necessary to use a variety of evaluation methods, and to use these methods to establish standards of performance in imaginative ways. The most reliable of all types of tests are objective tests since they contain little rater error, and they can sample large amounts of information. Objective tests can be valid for at least some of the objectives of medical education.Unfortunately, it is much easier to write trivial objective questions than searching ones. Furthermore, the objective format does not allow the sampling of important information relating to clinical skills, work habits and attitudes.

Observations of clinical experience with actual or trained patients are highly valid in that the behaviors assessed are similar to those required in clinical practice. Unfortunately, these observations while splendid for feedback purposes require great amounts of observer time and are subject to observer error.Faculty are justifiably concerned about the reliability of such observations for grading or promotion. Even if numerous observations are made, the preceptors’ concerns about the “subjective” grading makes them reluctant to require learners to undergo repeat learning experiences on the basis of observations of performance. Ideally, clinical courses must increase the amount of observations they make of clinical skills and be willing to require the students to undergo more intensive training if they are found to be deficient.

One still confronts a key question in the evaluation enterprise: What is to be done with the result or score?While it might appear self evident that clerkship directors aim to provide students with an accurate evaluation of their effort -with the grade-, the compiling of such an overall assessment can be challenging.Traditionally, most clerkships rely upon a mass of subjective data and hope that by enlarging the number of raters, the outcome assessment will become more accurate.Various instruments which will be described in this section aim for a more objective and quantitative student score.If we can achieve this objective, how should the information be treated?Should it constitute a fixed fraction of the overall student grade?Should the objective test score aim at identifying a particular subgroup of students>For example, there is a keen interest in clearly identifying the marginal student in a objective fashion.Similarly, we seek a tool which will discriminate those who perform at a level better than “pass/satisfactory”. but fall short of the top (honors) group.The availability of an objective student score also raises interesting questions regarding its effect on the overall grade.For example, should a student score very poorly on the objective test, should he be disqualified from an honors grade?Would outstanding performance on the objective test cancel out a marginal clinical performance over the clerkship?

Posing these questions affords no answers; they are meant to provoke thoughtful analysis.Informal discussions with other clerkship directors (both in pediatrics and other specialties) brings out three common themes regarding the use of an objective score.Many favor using the score as a fixed percentage of the overall mark – typically 10-15%.Clerkship directors are particularly keen to devise a discriminating tool, first for the marginal student and second, for the better than average student.who is not at an honors level.

Challenges in the Ambulatory setting:
In addition to developing effective tools for students in traditional clerkship settings, we need to look to the future and plan for changes.Pressure to move medical student education into the ambulatory setting comes from fiscal managed care, pedagogical and training quarters.While the basic principles for student assessment remain unchanged, their application will be more complex.Inevitably, a larger faculty teaches in the ambulatory setting, students rotate through sites, and likelihood for discontinuity increases. Regardless of whether a student participated in community or medical center sites, specialty or subspecialty ambulatory learning, all sub teaching environments run the risk of being more fragmented than a ward -based team of faculty and house officers.Some can make the argument that the demands of an ambulatory settings actually increase the need for kinds of testing instruments discussed in the following sections.Such instruments provide some consistency in evaluation and will tend to drive a curriculum in a more coherent fashion.when in place.It is not the purpose of this monograph to discuss ambulatory teaching. nor it’s assessment.Nonetheless, we need to redouble our efforts to design better evaluation tools as students disperse into ambulatory sites with an enlarged faculty who devote a smaller fraction of their working day to teaching.

Tests and evaluations are basic requirements of educational programs.Measurement efforts must possess sufficient validity, reliability, practicality and cost to provide effective methods of influencing learning. Present day evaluation methods in many clinical educational programs pay too little attention to clinical skills such as problem solving skills, interpersonal skills, technical skills, work habits and attitudes to be of optimum effectiveness as educational methods.