December 2025

Hello COMSEP!

Attached and below is our December edition of the Journal Club, including a validation of the value of coaching, a look at tolerance for uncertainty in different medical specialties, and yet another look at AI tools in medical education—this time answering questions about a common pediatric disease.

Stay warm and warmhearted this winter season.

Enjoy,

Amit, Jon and Karen

Coaching works!

Parris RS, Dong Z, Clark A, et al., Effect of Coaching on Trainee Burnout, Professional Fulfillment, and Errors: A Randomized Controlled Trial.Acad Med. 2025 Aug 1;100(8):940-949. Epub 2025 Feb 14. doi:10.1097/ACM.0000000000005999.

Reviewed by Meghan Handley  Jaclyn B. Wiggins

What was the study question?

What is the effect of coaching on errors and burnout among graduate medical education (GME) trainees?

How was it done?

In a randomized controlled trial at a large, urban academic medical center trainees (fellows and residents, n=184) and faculty (n=150) were randomized using a random number generator to either receive/perform coaching or to receive/give standard mentorship alone. Several clinical specialties were represented and faculty in the coaching arm were trained formally in a novel coaching curriculum. In the coaching curriculum, coaching dyads met up to 4 times for 60 min; each meeting had 3 parts: a check-in, positive psychology-based exercise, and goal setting. Faculty coaches were trained through a mandatory interactive workshop. Quantitative metrics for primary (burnout and medical errors) and secondary outcomes (self-validation and growth mindset) were studied using validated tools. Qualitative outcomes exploring the effect of coaching on the trainees and faculty were explored using focus groups.

What were the results?

“Coachees” showed reduction in burnout and improved professional fulfillment when compared with “mentees”. There were no differences in resilience of self-valuation between the two groups. There were no differences in burnout, fulfillment, resilience, and self-valuation between coaches and control faculty at baseline or after the curriculum intervention. Coachees had higher odds of reporting no medical errors, but this was not statistically significant and they were less likely to report being “unsure” in their involvement when compared to mentees.

How can this be applied to my work in education?

This study concluded that coaching improved trainee burnout and professional fulfillment, demonstrating that an interdepartmental coaching program is an effective tool for both GME trainees and faculty. This is the first study of its kind to show persistent burnout benefit after an intervention. Further research is warranted concerning the effect of a coaching curriculum on medical errors.

Reviewer’s Comments: A strength of this study is the use of validated tools for measuring the quantitative outcomes, demonstrating measurable differences in burnout and professional fulfillment. What I appreciated most, however, was the interdepartmental nature of the coach/coachee pairings. Some of the exemplar quotes talked about the psychological safety associated with having someone outside of their direct clinical area, along with other benefits of allowing coaches to problem solve, self-reflect and be exposed to new perspectives. (KFO)


Career decision-making mythbusters: tolerance for uncertainty and medical students' specialty choices

Wegwarth O, Pfoch M, Spies C, et al. Tolerance for uncertainty and medical students' specialty choices: A myth revisitedMed Educ. 2025;59(8):833-841. https://dx.doi.org/10.1111/medu.15610

Reviewed by Dan Herchline

What was the study question?

Is there a link between specialty choice and uncertainty tolerance?

How was the study done?

The authors conducted a cross-sectional survey of German medical students in their final year with follow-up one year later. Surveys gathered information on each student’s choice of specialty as well as their level of uncertainty tolerance using three different validated tools including the modified tolerance for ambiguity scale, the physicians' reaction to uncertainty scale, and the uncertainty intolerance scenario method.

What were the results?

The study included 263 students spanning 34 different medical schools. The authors note that the distribution specialty choice within the study sample was descriptively similar to the distribution of the general population of medical students. Uncertainty tolerance was not correlated with specialty choice both before and after medical students had chosen a medical specialty. This finding was consistent across all 3 uncertainty tolerance tools. Interestingly, the authors noted low correlation among the 3 tools despite measuring similar constructs.

How can I apply this to my work in education?

The notion that specialty choice and uncertainty tolerance are related has been touted for numerous decades. This study casts doubt on the popular myth and also calls into question the utility of the selected tools in measuring uncertainty tolerance given their low correlation. The authors note that these types of myths can serve multiple social functions such as maintaining power structures, simplifying complex decision-making during interviewing processes, and perpetuating stereotypes about different specialties. Uncertainty is present universally in medicine across specialties and uncertainty tolerance is a complex phenomenon that is difficult to characterize using existing tools.

Editor’s Note: So--not recommended as part of your pediatric interview screening.  Interestingly, low uncertainty tolerance may be associated with decreased mental health, so the construct may be useful for other reasons (JG)


Seek and ye shall find….inaccuracies

Aykac, K., Cubuk, O., Demir, O.O. et al. Comparing ChatGPT-3.5, Gemini 2.0, and DeepSeek V3 for pediatric pneumonia learning in medical students. Sci Rep 15, 40342 (2025). https://dx.doi.org/10.1038/s41598-025-27722-2

What was the study question?

Which large language model (LLM) provides the most comprehensive and reliable overview of pediatric community-acquired pneumonia (CAP) for use by students and educators?

How was the study done?

Study authors compared ChatGPT-3.5, Google Gemini 2.0, and DeepSeek V3. Each model was provided the full text of the “Community-Acquired Pneumonia” chapter from Nelson Textbook of Pediatrics. The LLMs were tasked with answering a standardized set of 27 open-ended questions created by pediatric infectious disease (ID) specialists on topics related to pediatric CAP. The questions were divided across five clinical domains: Diagnosis and Clinical Features, Etiology and Age-Specific Pathogens, Diagnostics and Imaging, Complications, and Management, Treatment, and Prevention. Two pediatric ID specialists evaluated the LLMs’ responses for accuracy (1-6 points), completeness (1-3 points), and clinical safety (0-1 point) using the Language Intelligence Certifier tool. Friedman test was used for comparison among models and Wilcoxon signed rank for pairwise comparison.

What are the results?

Across all the questions, DeepSeek had the highest mean scores (9.9 out of 10) compared to the responses of ChatGPT-3.5 (7.7) and Google Gemini 2.0 (7.5). DeepSeek V3 also had the highest mean scores across all domains and had statistically significantly higher scores for accuracy and completeness. Safety scores did not differ significantly between models; however, ChatGPT-3.5 was the only model to produce a response deemed unsafe due to clinically inaccurate information. Neither DeepSeek V3 nor Google Gemini had an answer that was deemed clinically unsafe.

How can this be applied to my work in education?

For educators, their results underscore the necessity of selecting high-performing LLMs when using artificial intelligence to develop learning materials. There are several important limitations to this study, particularly the narrow focus on a single clinical topic, but nonetheless, it provides valuable insight into the role of LLMs in pediatric education for both educators and learners.

Editor’s Note: The study reinforces that while generative AI may be a helpful tool, its output should be validated as there are potentially many issues with incorrect and potentially unsafe clinical recommendations. (AP)