DL Models Show Bias in Knee Osteoarthritis Diagnosis

Deep learning (DL) models are able to diagnose knee osteoarthritis with high accuracy, but can also exhibit biases based on sex and, to a lesser extent, race, according to a digital poster.

Khosravi

While AI has shown the potential to transform medical imaging, bias can be built into the models. Previous research has shown that DL models for chest X-ray diagnosis demonstrate biases against historically disadvantaged groups across sex and race, raising concerns about the equitable use of these tools.

It is unclear, however, if similar biases exist for DL models in other body parts like the knee.

To find out more, researchers led by Bardia Khosravi, MD, MPH, from the Mayo Clinic in Rochester, MN, used the publicly available Osteoarthritis Initiative (OAI) dataset of knee radiographs to develop and test a DL model. They first trained a model to localize the right and left knees and then used it to test for knee osteoarthritis severity based on the Kellgren-Lawrence Grading (KLG) system, a common method that grades osteoarthritis severity on a scale of 0 to 4.

Overall, the DL osteoarthritis severity grading model performed at a state-of-the-art level. However, subgroup analysis showed biases favoring males in four of five KLG categories, echoing previous findings in DL models for chest X-ray diagnosis.

"Across all groups, we see that there was not much difference in the average performance, but when we dug into the subgroups, we found some huge differences," Dr. Khosravi said. "For example, the model showed significantly better performance in a subgroup of KLG 1 males. These models are preferring one group over the other, but this is not consistent."

Racial Bias Less Evident Than Sex Bias

The performance gap between racial groups was much lower, with no difference between white and non-white patients for KLG 0 and 2, and only slight differences for KLG categories 1, 3, and 4. This finding suggests that demographic-based biases in DL models may vary between specific diagnostic use cases, Dr. Khosravi said.

The model's better performance overall for males over females was not directly related to the population size; in fact, the datasets included considerably more females than males.

"The distribution of features, not the representation, is the main problem," Dr. Khosravi said. "When we plotted the distribution of the model features for different race and sex groups, we saw that those groups that have the wider distribution of features will have lower performance."

Dr. Khosravi is presenting a similar study on Tuesday that examines the ability of generative DL models to aid in detecting previously unrecognized anatomical differences between races in medical imaging datasets. He plans to assess the data from the two studies to learn more about the different features that the DL models are recognizing.

"Ultimately, understanding the mechanisms behind these demographic biases will lead to the development of more transparent and unbiased AI models in radiology," Dr. Khosravi said.

Access the presentation, "Knee Osteoarthritis Deep Learning Models Demonstrate Greater Biases Based on Sex Than Race," (T5B-SPIN) on demand at Meeting.RSNA.org.