By Katherine Anderson
As multimodal generative AI models become increasingly capable of producing draft radiology reports directly from imaging data, the need for strategies to assess the accuracy and reliability of these reports is gaining in urgency.
Multimodal generative AI models integrate various data types—such as text and images—to generate outputs such as draft radiology reports.
However, these reports often contain subtle, and sometimes not so subtle, inaccuracies, noted Scott J. Adams, MD, PhD, a radiologist in the Department of Medical Imaging at Royal University Hospital in Saskatchewan, Canada, who presented an education exhibit on this topic on behalf of co-lead author Samantha Leech, PhD, an MD candidate at the University of Saskatchewan’s College of Medicine.
Dr. Adams and his colleagues explored evaluation methods for AI-generated radiology reports, offering an overview of the current generative AI landscape—highlighting both the promise and the challenges of this rapidly evolving field.
“Traditional metrics for evaluating AI-generated reports often prioritize textual similarity or factual correctness against a limited set of findings,” Dr. Adams said. “Real progress requires radiology-specific metrics that can accurately capture clinically significant nuances in complex radiology reports.”
Without clinically grounded evaluation frameworks, it’s difficult to determine whether improvements in AI models actually translate into better diagnostic performance. The exhibit outlined six key categories of evaluation metrics used to assess AI-generated radiology reports:
• Textual similarity metrics
• Clinical concept and relation metrics
• Composite and correlation-based metrics
• Clinically grounded error and outcome metrics
• Generative and expert-aligned evaluation metrics
• Workflow and real-world performance metrics
The team advocates for a more rigorous approach that includes domain-specific metrics, structured error analysis and expert radiologist review.
“Radiologists are essential to AI validation, providing expert oversight to assess errors, adjudicate inconsistencies, and evaluate clinical impact,” Dr. Adams said. “Their involvement ensures that AI systems perform safely and effectively in patient care.”
Radiologist-in-the-loop validation not only improves the accuracy of AI-generated reports but also ensures that these tools are safe and trustworthy for clinical use. By keeping radiologists engaged in the evaluation process, institutions can safeguard patient care while exploring the efficiency gains that generative AI may offer.
The exhibit also emphasized the importance of datasets and benchmarking resources, including public datasets, evaluation leaderboards and real-world monitoring tools that support continuous performance tracking.
“By recognizing the gaps in current assessment methods, radiologists can more confidently integrate AI tools into their workflows—enhancing productivity without compromising diagnostic quality,” Dr. Leech added.
As generative AI continues to evolve, the exhibit underscores the importance of developing evaluation strategies that reflect the realities of clinical practice. Radiologists, with their domain expertise and diagnostic insight, are essential to shaping the future of AI in radiology.
Access the education exhibit, “Evaluation Methods and Metrics for Automated Radiology Report Generation,” (INEE-24) on demand at RSNA.org/MeetingCentral.
Does your institution use AI to generate radiology reports?
— RSNA (@RSNA) November 30, 2025
© 2025 RSNA.
The RSNA 2025 Daily Bulletin is the official publication of the 110th Scientific Assembly and Annual Meeting of the Radiological Society of North America. Published online Sunday, November 30 — Thursday, December 4.
The RSNA 2025 Daily Bulletin is owned and published by the Radiological Society of North America, Inc., 820 Jorie Blvd., Suite 200, Oak Brook, IL 60523.