Daily Bulletin Logo

Engineering Prompts, Extracting Diagnoses

Tuesday, December 2, 2025

By Nick Klenske


Mana Moassefi, MD
Moassefi

Unstructured reports are an ongoing challenge in radiology—a challenge that can limit the ability to extract standardized information for research, quality improvement and AI development. 

“The rapid evolution of large language models (LLMs) offers promising opportunities for radiology report annotation, particularly when it comes to identifying specific diagnostic findings,” said Mana Moassefi, MD, an incoming radiology resident at Mayo Clinic, in Rochester, MN. Dr. Moassefi co-authored a recent study evaluating whether LLMs could identify the presence or absence of specific diagnoses or findings in radiology reports across multiple institutions. 

With a focus on LLMs with strong natural language understanding and adaptability, researchers looked to see if these models—when optimized through prompt engineering—could overcome inter-institutional variability and thus serve as scalable tools for radiology report labeling and cohort generation.

“The idea was that if we can get reliable labels from the massive amount of existing radiology data, then medical data will no longer be seen as being too rare and too expensive,” said Dr. Moassefi, who made her remarks during a Monday session. “If we achieve that, then we can start building powerful and effective AI models using the data we already have.”  

A Uniquely Cross-Institutional Evaluation

The study is unique in that it consisted of a cross-institutional evaluation spanning six major academic centers, with each center collecting 500 radiology reports across five diagnostic categories (liver metastases, subarachnoid hemorrhage, pneumonia, cervical spine fracture and glioma progression).

“We purposely kept the dataset’s labels diverse to capture the unique characteristics of each label and to see how those differences might affect the results,” Dr. Moassefi explained.

A high-level programming language with a human-optimized prompt was developed and distributed to each site. The script instructed a locally hosted model to answer either ‘yes’ or ‘no’ regarding the presence of the target finding. 

“The idea was that if we can get reliable labels from the massive amount of existing radiology data, then medical data will no longer be seen as being too rare and too expensive. If we achieve that, then we can start building powerful and effective AI models using the data we already have.”  

Mana Moassefi, MD

LLMs a Reliable Tool for Labeling Radiology Reports

The standardized human-optimized prompt proved highly adaptable across diverse institutional practices, illustrating the power of well-designed prompt engineering. At one site, where eight LLMs were systematically compared, the model achieved the highest level of accuracy (~95%).

The study further found that model performance correlated with report structure quality and achieved near-perfect accuracy. However, diagnostic categories such as pneumonia proved to be more challenging due to interpretive ambiguity in free-text reports.

“These findings demonstrate that LLMs can serve as reliable tools for labeling radiology reports, helping us scale data annotation, generate AI datasets and create retrospective research cohorts—all tasks that traditionally require extensive manual review,” Dr. Moassefi said.

More Accurate Diagnoses, Better Patient Outcomes

By showing cross-institutional reproducibility with only prompt-based customization, the study moves radiology closer to automated, standardized information extraction—an essential step toward achieving AI-ready data pipelines and structured reporting adoption.

“Larger, more diverse datasets make models more generalizable and reduce uncertainty, which in turn leads to more accurate diagnoses and better patient outcomes,” Dr. Moassefi concluded.

Access the presentation, “Engineering Prompts, Extracting Diagnoses: A Multi-Institutional Assessment of LLMs in Radiology,” (M3-SSIN02-1) on demand at RSNA.org/MeetingCentral