Validation methods for artificial intelligence imaging models

Physician trust in the accuracy and efficacy of clinical models is paramount.
Doctor looking at a computer.

As artificial intelligence (AI) and machine learning (ML) models enter clinical practice, clinicians understandably ask how they can trust the conclusions and recommendations of these systems. Complicating matters, there have been healthcare data breaches, some including data from radiological image databases. Are AI models or their underlying training data at risk?

Physician trust in the accuracy and efficacy of clinical models is paramount. Without it, these tools have little chance of gaining traction. Only 20% of physicians say AI has changed the way they practice medicine, according to a recent survey of 1,500 doctors across Europe, Latin America, and the U.S. by Medscape.[1] Indeed, a majority of the surveyed physicians said they are anxious or uncomfortable with AI, though 70% of respondents said they believe it could make their decisions more accurate.

Establishing tools for validating and monitoring the performance of AI algorithms in clinical practice, to facilitate regulatory approval, was cited as one of four priorities for AI research in medical imaging. The report, published online May 28 in the Journal of the American College of Radiology, is a follow-up of an initial medical imaging artificial intelligence roadmap published April 16 in Radiology,[2] written by a team of authors led by Dr. Bibb Allen Jr. of the American College of Radiology Data Science Institute.

A recent perspective about life sciences computer system validation noted that current computer system validation (CSV) practices, relying on manual approaches, are typically slow, cumbersome, and unreliable.

According to a Deloitte white paper: “Fragmented automation employs multiple systems across the validation life cycle, raising the need to painfully stitch the validation assets together to form a coherent, traceable, and chronological history of validation. These disjointed and incomplete approaches to CSV can be inefficient, costly, and error-prone.”[3]

Steps to validation

Model validation involves two tests:

  1. Does the model learn properly from the training data?
  2. Is the final model generalizable? That is, can it work with similar, but unseen, data?

As Willem Meints points out in his article on the topic of ML model validation, simply testing an ML model for proper functioning, at a code level, is insufficient: “Unit-tests for AI solutions don’t test the actual performance of the model that you’re using. It can only test whether data flows through the model correctly and that the model doesn’t raise any exceptions. To measure if the model is predicting the correct output, you need a different test approach. In fact, it is not so much a correctness test, but rather a performance test.”[4]

Encouragingly, there are now numerous efforts to create shareable, open-source training datasets. The availability of these datasets should help smooth the process of training and validating ML models. (See the article, “The Impact of Open-Source Image Datasets.”)

 

Hsi-Ming Chang, a senior data scientist at GE Healthcare, explains the three elements to creating a machine learning model: training, validation, and testing. “You typically split the data set into these three parts,” he said, “and after training, you have a lot of output. So, if you train 10 different models, the next step is to use validation data against these models, and then select the best one.”

This process, however, can create a problem called overfitting, which happens when a model learns the detail and noise in the training data to the extent that it negatively affects the performance of the model on new data.

“The last stage is to run the model against the test dataset, which is unseen data,” Chang said. “The final number you get from your test dataset more or less reflects your model’s capability.” Testing against the training set, for instance, wouldn’t accurately reflect the model’s performance, because it ought to do well against this previously used dataset.

One exciting, new technique in validation, according to Chang, is using datasets from a variety of sources, such as different hospitals. Due to slight institutional variations — such as different ways of capturing and annotating images — training a model this way can create a more robust ML algorithm.

Typically, the two, annotated datasets are combined before training. But another approach is called domain adaptation, which addresses the problem without merging the datasets. Chang explains: “If you have a trained model from one source (institution A), and you have a target domain (institution B), you take unannotated data from the target domain and let the model conduct unsupervised learning.” The result, he said, is a model that comes to learn the property of the data in the second set, from which it can adjust the original model. The advantages of domain adaptation are two-fold: It’s faster and less expensive than combining the datasets (and does not require expensive, annotated data for the second set).

Other validation challenges

In an article published in the New England Journal of Medicine, researchers at the Stanford University of Medicine noted that bias can creep into health data via humans, introduced by design, or in the ways healthcare systems use the data.[5]

 

Differences between vendor AI algorithms, as well as validation tools, are an issue too. “It has been shown that images from different scanners react differently with the same AI model,” said Safwan S. Halabi, MD, a clinical assistant professor at the Stanford University School of Medicine and medical director of radiology informatics at Stanford Children’s Health, in the article.

Meints concludes his article with three recommendations:

  1. Pick the right validation strategy

 Ask yourself, will I be using model selection in my project? If so, use the appropriate data processing to get the right datasets for training, validation, and testing.

  1. Pick the right performance metrics

 Ask yourself, what kind of model am I testing? Are false positives a bad thing? Or do I want to be as precise as possible?

  1. Ask the people who use the model

User feedback is the most important performance metric of all.

Observing that deep learning for medical imaging is still young, Chang said he’s optimistic. As applications permeate healthcare, and time goes by, “radiologists will see more applications of deep learning technology, and they will gain more and more trust about this new technology,” he concluded.


[1] Marcia Frellick, “AI Use in Healthcare Increasing Slowly Worldwide,” Medscape Medical News, May 6, 2019.

[2] “A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging: From the 2018 NIH/RSNA/ACR/The Academy Workshop,” Radiology, April 16, 2019, https://pubs.rsna.org/doi/10.1148/radiol.2019190613.

[3] Srikanth  Narayana Mangalam, Satyanarayana Patloori, Duraisamy Palani, Colleen Healy, “Life sciences computer system validation: An end-to-end solution is needed,” Deloitte, 2019, https://www2.deloitte.com/us/en/pages/risk/articles/computer-system-validation-in-life-sciences.html?nc=1.

[4] Willem Meints, “Adventures in AI part 3: How do I know my ML model is any good?Fizzy Logic, January 22, 2019, https://fizzylogic.nl/2018/01/22/adventures-in-ai-part-3-how-do-i-know-my-ml-model-is-any-good/.

[5] Danton Char, “Researchers say use of artificial intelligence in medicine raises ethical questions,” Stanford Medicine, March 14, 2019, https://med.stanford.edu/news/all-news/2018/03/researchers-say-use-of-ai-in-medicine-raises-ethical-questions.html.