Norm Berman,Geisel School of Medicine,Hanover,NH,Sherilyn . Smith,U Washington,Seattle,WA,Jennifer . Kogan,U Penn,Philadelphia,PA,Michael . Dell,Case Western Reserve University,Cleveland,OH,Michael . Stein,Uniformed Services University,Bethesda,MD,Steven . Durning,Uniformed Services University,Bethesda,MD
Clinical reasoning is a fundamental skill of physicians but providing reliable and valid formative or summative assessment is difficult. A well-written summary statement is a manifestation of clinical reasoning. A validated 5 component rubric for evaluating summary statements exists and may thus be a reliable and valid method for assessing clinical reasoning. Reliable automated evaluation of summary statements within virtual patient cases using machine-learning techniques could be a valuable formative assessment and research tool.
Reliable evaluation of summary statements within MedU cases using machine-learning techniques.
Samples of student summary statement responses from 3 virtual patient cases were obtained from a database. Five hundred responses were randomly selected from each case, excluding responses with less than 10 words. Two physicians were trained in applying the 5-component rubric and achieved consensus on coding approach on training examples. They then coded the 500 summary statements using the validated rubric. Machine-learning software (LightsideTM) developed case-level algorithms that attempt to find features reproducing the surface-level behavior of human graders. Inter-observer reliability across human and automated systems was calculated using quadratic weighted kappa. Qualitative analysis of results was also performed.
In 7 of 15 examples (5 rubric components across 3 cases), machine-learning performance was above the 0.6 threshold recommended as a baseline for assessment for credit. This varied heavily by case. Performance also varied by rubric component. Least accurate was the Accuracy component (kappa 0.29-0.47); most accurate was the Global Rating component (kappa 0.58-0.76), which demonstrated the highest performance across the three cases. Automated analysis tends toward lower scores than human scorers. Qualitative analysis reveals a machine-learning tendency to reward overly thorough writing.
This pilot study demonstrates that automated analysis of student summary statements using machine-learning techniques is feasible. The software appears to be reliable enough for formative assessment, and in aggregate across cases is likely reliable enough for summative assessment.