By Jennie McKee
“Clinical histories accompanying imaging orders are crucial for accurate radiological interpretation,” Koirala said. “For example, a simple notation of ‘check lungs’ provides far less context than ‘47-year-old smoker with recent weight loss and persistent cough for three weeks, concern for neoplasm.’”
While research clearly shows that complete, relevant clinical histories improve radiologists' accuracy in detecting and characterizing abnormalities, clinical histories accompanying imaging orders commonly remain incomplete, Koirala noted.
“Previous improvement efforts have relied on time-consuming manual assessments, with radiologists individually reviewing each history for completeness,” he said. “This manual process is tedious and not sustainable in busy clinical settings.”
Koirala and colleagues aimed to automate this “completeness assessment” using AI, making it possible to efficiently evaluate hundreds or thousands of clinical histories at a much faster rate. The goal was to identify patterns of incomplete information and enable targeted improvements in how clinical histories are documented.
In the study, investigators compared the performance of open-source and proprietary LLMs using prompt engineering with in-context learning. They then used a training set and a validation set to further fine-tune the best-performing open-source LLM from that group, Mistral-7B.
To assess model agreement, the researchers used Cohen’s kappa, a statistic that measures how much two raters agree on a classification task (also accounting for agreement chance), and BERTScore, a metric that measures how similar two pieces of text are based on the meaning of the words in context. Using Mistral-7B, they extracted five elements (medical history, what, when, where and clinical concern) and analyzed the quality of 48,492 clinical histories from the emergency department of one large academic medical center to determine a quality benchmark.
The team found a weighted mean inclusion rate of 73.8% for all the relevant elements. Both Mistral-7B and GPT-4 had substantial agreement with the two.
“The most unexpected result was that smaller, open-source AI models performed nearly as well as frontier AI models (ChatGPT) in evaluating clinical histories,” Koirala said. “This is significant because open-source models are freely available, and require fewer computing resources, while still maintaining high accuracy.”
Koirala noted that open-source models are also unaffected by unpublicized model updates and can be fully deployed locally, avoiding some clinical data privacy concerns, while external servers are required by proprietary models.
“As for the clinical histories themselves, our analysis of imaging orders from our emergency department showed that only 26% of orders contained all five elements. However, when we ran a weighted average based on fields that were more important, we found a completion rate of 74%.” Koirala said.
“We think this finding is probably better than what most radiologists are getting, and is probably due to a dedicated quality improvement project we completed with our emergency department colleagues several years ago,” he added. “We would love to compare these results with other sites. We are planning to make the tool freely available so that others can both measure their performance and use it in their quality improvement efforts.”
Access the presentation, “Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Open- and Closed-Source Large Language Models,” (M3-SSIN02-5) on demand at RSNA.org/MeetingCentral
© 2024 RSNA.
The RSNA 2024 Daily Bulletin is the official publication of the 110th Scientific Assembly and Annual Meeting of the Radiological Society of North America. Published online Sunday, December 1 — Friday, December 6.
The RSNA 2024 Daily Bulletin is owned and published by the Radiological Society of North America, Inc., 820 Jorie Blvd., Suite 200, Oak Brook, IL 60523.