AI Can Help Determine the Completeness of Clinical Histories 

Tuesday, December 3, 2024

By Jennie McKee

Smaller, open-source large language models (LLMs) are a helpful tool for assessing the completeness of clinical histories that accompany imaging orders, according to the results of a study presented by Arogya Koirala, a machine learning engineer at the Stanford AI Development and Evaluation (AIDE) Lab in California.

Arogya Koirala
Koirala

“Clinical histories accompanying imaging orders are crucial for accurate radiological interpretation,” Koirala said. “For example, a simple notation of ‘check lungs’ provides far less context than ‘47-year-old smoker with recent weight loss and persistent cough for three weeks, concern for neoplasm.’”

While research clearly shows that complete, relevant clinical histories improve radiologists' accuracy in detecting and characterizing abnormalities, clinical histories accompanying imaging orders commonly remain incomplete, Koirala noted.

“Previous improvement efforts have relied on time-consuming manual assessments, with radiologists individually reviewing each history for completeness,” he said. “This manual process is tedious and not sustainable in busy clinical settings.” 

Koirala and colleagues aimed to automate this “completeness assessment” using AI, making it possible to efficiently evaluate hundreds or thousands of clinical histories at a much faster rate. The goal was to identify patterns of incomplete information and enable targeted improvements in how clinical histories are documented. 

 
Summary of how each model was adapted and fine-tuned, and how model performance was evaluated. Courtesy of Arogya Koirala.
Summary of how each model was adapted and fine-tuned, and how model performance was evaluated. Courtesy of Arogya Koirala.  

Conducting the Assessment

In the study, investigators compared the performance of open-source and proprietary LLMs using prompt engineering with in-context learning. They then used a training set and a validation set to further fine-tune the best-performing open-source LLM from that group, Mistral-7B.

To assess model agreement, the researchers used Cohen’s kappa, a statistic that measures how much two raters agree on a classification task (also accounting for agreement chance), and BERTScore, a metric that measures how similar two pieces of text are based on the meaning of the words in context. Using Mistral-7B, they extracted five elements (medical history, what, when, where and clinical concern) and analyzed the quality of 48,492 clinical histories from the emergency department of one large academic medical center to determine a quality benchmark.

The team found a weighted mean inclusion rate of 73.8% for all the relevant elements. Both Mistral-7B and GPT-4 had substantial agreement with the two. 

“The most unexpected result was that smaller, open-source AI models performed nearly as well as frontier AI models (ChatGPT) in evaluating clinical histories,” Koirala said. “This is significant because open-source models are freely available, and require fewer computing resources, while still maintaining high accuracy.” 

Koirala noted that open-source models are also unaffected by unpublicized model updates and can be fully deployed locally, avoiding some clinical data privacy concerns, while external servers are required by proprietary models.

“As for the clinical histories themselves, our analysis of imaging orders from our emergency department showed that only 26% of orders contained all five elements. However, when we ran a weighted average based on fields that were more important, we found a completion rate of 74%.” Koirala said. 

“We think this finding is probably better than what most radiologists are getting, and is probably due to a dedicated quality improvement project we completed with our emergency department colleagues several years ago,” he added. “We would love to compare these results with other sites. We are planning to make the tool freely available so that others can both measure their performance and use it in their quality improvement efforts.”

Access the presentation, “Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Open- and Closed-Source Large Language Models,” (M3-SSIN02-5) on demand at RSNA.org/MeetingCentral