




|
 |
Section A
Harold G. Levine, M.S.
Importance of Testing and Evaluation
Although the process of testing and evaluation is often treated as something separate
from instruction, evaluation is an essential part of the instructional process.
One way of looking at instruction is to note that it has three elements. The
first element is introducing the learners to instructional goals and
objectives, i.e. new ideas, processes, methods, etc. The second element is
practice which provides the opportunity for learners to use the new ideas in
appropriate contexts. The third element is feedback or evaluation when the
learners are informed about whether or not they have used the new ideas in an
appropriate fashion. The feedback phase of instruction is often neglected, so
that learners fail to discover that they have not mastered ideas or
skills.Even worse, learners may
receive information from the feedback phase which is not helpful. For example,
junior medical students taking a pediatrics course may be introduced to the
enormously important set of concepts relating to fluid and electrolyte balance.
If the students do not get a chance to practice these ideas and get appropriate
information or feedback about their mastery of the concepts, the students may
see children suffering from dehydration and have no idea about how to manage
the children's problems.
Definition of Evaluation
Based upon the description of evaluation given above, testing and evaluation can be
conceptualized as providing learners with information about the results of
their competency in acquiring knowledge and skills.The evidence that learning has taken place is based upon a sample
of the behavior of learners, either directly observed behavior such as watching
learners examine patients, or behavior on tests such as choosing, describing,
etc.Behavior alone is not sufficient;
the learners must be informed that the behaviors sampled have satisfied some
criteria of effective behavior. Examinees must be informed about the results of
the test; those observed must be told if they performed the task effectively.
Thus, a test or evaluation is a sample of behavior which is used to make some
value judgment. If the value judgment is not generalized beyond the particular
context when it is gathered,we call it
feedback. Medical students who examine patients and are told by preceptors that
they did an effective or ineffective job of examining the patients are getting
feedback. If the preceptors decide on the basis of a number of observations
that certain students need to repeat a course or receive pass or honors,the students have been evaluated, i.e. they
have received grades based upon generalizations derived from their
accomplishment of a number of tasks.
Effects of Evaluation
The process of testing and evaluation has a large number of effects on education
and educational systems beyond the task of providing feedback to learners in
practice situations. Evaluation tells the learner what to study and how hard to
study. It provides information about which learners need additional education
and which ones should be dropped from an educational course of study. It helps
education managers to develop curricula, and to select those who will be
permitted to enter their programs. It helps learners to decide on their future
careers. The effectiveness of decisions which are assistedby the various roles of testing and
evaluation are considerably influenced by the strategies used by educational
managers in the design of evaluation programs.Regardless of the particular approach taken in devising a testing
instrument, it should simulate the work of a physician.
Evaluation Strategies
Instructors and course directors have a variety of strategies they can use to carry out
their testing and evaluation functions. They can use analytic methods which
gather samples of behavior in which each sample is only a small part of a more
complex idea or concept, e.g. results of objective tests, to provide feedback
and evaluation, or they can use synthetic
methods in which the sample of behavior gathered is large and complex, e.g.
observing a medical student gathering historical information, or they can use
methods which lie between these types, e.g. simulation tests. They can give
tests which are comprehensive, i.e. cover material learned over large blocks of
time, or scatter tests throughout the course of study. They can give grades
which focus on pass--fail, or provide a series of grades such as A, B, C....
They can base their selection of tests on a huge reference work, or on a
defined set of instructional materials. They can use tests and testing
personnel from outside the local system in designing the testing program or
they can rely mainly on internal decision makers. They can rely on the learners
choosingamong a defined set of
possible answers, multiple choice tests, or generating responses, essays and
oral tests. All of these various strategies of approaching testing and
evaluation have advantages and disadvantages based on their effects on learning
and the costs of the evaluation system.
Criteria for Selecting Evaluation Methods
Evaluation specialists have developed four important criteria for choosing evaluation
methods. Two of these criteria, validity and
reliability, are technical attributes
of all measurement instruments which should be taken into account in using
measurements. The third criterion,
practicality and costs, is essentially a management criterion which depends
greatly upon the resources and values of those developing and using tests and
evaluation methods. The fourth criterion is the effect on learning. This criterion is the most important for
reasons which have already been discussed.
Validity
A test or evaluation has two attributes--the gathering of samples of behavior,
and a decision based upon the sample. Validity is the characteristic of a test
or a testing program which relates to the decisions made. For example,students may know a large number of facts,
and do well on fact based examinations.Based upon these data, it might be decided that these students can solve
problems.However, this decision may or
may not be valid because it goes beyond the data provided by the
test.In many cases learners' knowledge
of facts is a necessary but not sufficient condition for deciding that the
learners have mastered the information required to meet the standards desired
by the program.In clinical situations,
the mastery of facts is often given overwhelming importance in deciding on
whether students have satisfied program standards when a number of attributes
of effective performance in clinical situations, e.g. problem-solving skills,
interpersonal skills, technical skills such as those required in performing a
physical examination or obtaining materials for laboratory examinations, work
habits and attitudes are not assessed. In this case the test of facts may be
valid, but the evaluation system is not valid.
Reliability
While validity is an attribute of the
decisions made based on the results of evaluation methods, reliability is a
technical attribute of the measurement method itself as used on a particular
population of individuals who are assessed by the instrument. Since evaluation
is ultimately generalizing about a sample of behavior, reliability is an estimate
of the amount of error which exists in a particular measurement. Error may be
conceptualized as the likelihood that the results would be similar, if the
measurement were repeated . Some types of error are typical of the evaluation
method. For example, examinees with bad handwriting usually do less well on
essay tests than they would on other types of tests. The most common type of
error in tests and the most pervasive is sampling error.We all know, intuitively, that there is a
great deal of error in small samples. A learner might get one question right
and another wrong simply because of the choice of questions. The more questions
that are asked, the more likely it is that a test is reliable. For this reason,
certifying examinations such as the Medical Licensing Examination contain
hundreds of questions. Evaluation exercises which can be used for feedback
purposes, e.g. observing a medical student with one patient, cannot be used for
decisions about promotion because of the error in small samples of behavior.
Another source of error in examinations such as essays, orals or observations
is rater error. Raters tend to value different attributes of performance, focus
on different attributes of what is observed and if the evaluation is complex,
weight elements of the sample of behavior, differently. Raters also have
different standards, even if they agree on what is observed. Two raters may
rank a group of examinees the same, but one might give higher grades than the
other. Even though observer errors can be limited by examiner training,
sampling error can still create great unreliability in any test which uses only
a small number of exercises or observations.
Practicality and Cost
Some possible tests are impractical. It may be difficult to get enough oral examiners
to conduct a certification examination, or a sufficient number of patients
cannot be assembled to allow all the examinees to provide samples of behavior
with patients for assessment purposes. Since it is essential that an
educational program provide some assessment of clinical skills, in order to
develop a valid evaluation system, issues of cost in terms of faculty time, the
hiring of simulation patients, the utilization of support personnel, etc., go
back to the values of those running the program. If faculty members receive
little in the way of rewards or recognition for teaching, they will be
reluctant to spend energy in evaluating learner performance.Sometimes an imaginative use of resources
can modify the cost-benefit ratio of effective evaluation techniques. Examinees
can be screened to see if they are at risk of marginal performance, and only
the weakest performers given the more expensive techniques. Observations which
are quite expensive in faculty time can be made by non-faculty members such as
students, nurses, physician assistants, etc.
Effect on Learners
The possible effect on learners of evaluation methods has already been mentioned,
so this section is quite brief. It is particularly important for course
directors to realize that learners will focus on what is evaluated. If
important attributes of clinical performance are not assessed, then the
learners will neglect those aspects in favor of those which are assessed.
Regardless of what is written in a course outline or syllabus, learners
perceive that the objectives of the faculty are what is assessed by the
faculty.
Interactions Among Criteria
Unfortunately, it is difficult to use one evaluation method which adequately meets all the
criteria described above. Therefore, it is necessary to use a variety of
evaluation methods, and to use these methods to establish standards of
performance in imaginative ways. The most reliable of all types of tests are
objective tests since they contain little rater error, and they can sample large
amounts of information. Objective tests can be valid for at least some of the
objectives of medical education.Unfortunately, it is much easier to write trivial objective questions
than searching ones. Furthermore, the objective format does not allow the
sampling of important information relating to clinical skills, work habits and
attitudes.
Observations of clinical experience with actual or trained patients are highly valid in that
the behaviors assessed are similar to those required in clinical practice.
Unfortunately, these observations while splendid for feedback purposes require
great amounts of observer time and are subject to observer error.Faculty are justifiably concerned about the
reliability of such observations for grading or promotion. Even if numerous
observations are made, the preceptors' concerns about the
"subjective" grading makes them reluctant to require learners to
undergo repeat learning experiences on the basis of observations of
performance. Ideally, clinical courses must increase the amount of observations
they make of clinical skills and be willing to require the students to undergo
more intensive training if they are found to be deficient.
One still confronts a key question in the evaluation enterprise: What is to be done
with the result or score?While it
might appear self evident that clerkship directors aim to provide students with
an accurate evaluation of their effort -with the grade-, the compiling of such
an overall assessment can be challenging.Traditionally, most clerkships rely upon a mass of subjective data and
hope that by enlarging the number of raters, the outcome assessment will become
more accurate.Various instruments
which will be described in this section aim for a more objective and
quantitative student score.If we can
achieve this objective, how should the information be treated?Should it constitute a fixed fraction of the
overall student grade?Should the
objective test score aim at identifying a particular subgroup of
students>For example, there is a
keen interest in clearly identifying the marginal student in a objective
fashion.Similarly, we seek a tool
which will discriminate those who perform at a level better than
"pass/satisfactory". but fall short of the top (honors) group.The availability of an objective student
score also raises interesting questions regarding its effect on the overall
grade.For example, should a student
score very poorly on the objective test, should he be disqualified from an
honors grade?Would outstanding performance
on the objective test cancel out a marginal clinical performance over the
clerkship?
Posing these questions affords no answers; they are meant to provoke thoughtful
analysis.Informal discussions with
other clerkship directors (both in pediatrics and other specialties) brings out
three common themes regarding the use of an objective score.Many favor using the score as a fixed
percentage of the overall mark - typically 10-15%.Clerkship directors are particularly keen to devise a
discriminating tool, first for the marginal student and second, for the better
than average student.who is not at an
honors level.
Challenges in the Ambulatory setting:
In addition to developing effective tools for students in traditional clerkship
settings, we need to look to the future and plan for changes.Pressure to move medical student education
into the ambulatory setting comes from fiscal managed care, pedagogical and
training quarters.While the basic
principles for student assessment remain unchanged, their application will be
more complex.Inevitably, a larger
faculty teaches in the ambulatory setting, students rotate through sites, and
likelihood for discontinuity increases. Regardless of whether a student participated
in community or medical center sites, specialty or subspecialty ambulatory
learning, all sub teaching environments run the risk of being more fragmented
than a ward -based team of faculty and house officers.Some can make the argument that the demands
of an ambulatory settings actually increase the need for kinds of testing
instruments discussed in the following sections.Such instruments provide some consistency in evaluation and will
tend to drive a curriculum in a more coherent fashion.when in place.It is not the purpose of this monograph to discuss ambulatory
teaching. nor it's assessment.Nonetheless, we need to redouble our efforts to design better evaluation
tools as students disperse into ambulatory sites with an enlarged faculty who devote
a smaller fraction of their working day to teaching.
Summary
Tests and evaluations are basic requirements of educational programs.Measurement efforts must possess sufficient
validity, reliability, practicality and cost to provide effective methods of
influencing learning. Present day evaluation methods in many clinical
educational programs pay too little attention to clinical skills such as
problem solving skills, interpersonal skills, technical skills, work habits and
attitudes to be of optimum effectiveness as educational methods.
|