The impact of open-source image datasets

Doctor looking at x-ray on a tablet

To train AI and ML models, large datasets are required. But this data is often inaccessible in healthcare settings, due to a variety of patient privacy laws and institutional policies. To address this obstacle and make data-sharing easier, several open-source projects have emerged.

“The ability to use that large amount of data to inform our clinical decision-making is huge,” said Joyce Sensmeier, vice president of informatics, technology, and innovation at HIMSS, about the open-source initiatives. The ability to leverage this already-captured data is a great next step, she said. “It’s really going to be a powerful, positive thing, and a wonderful opportunity.”

One of the most active hubs for this work is the Radiology Informatics Lab at Stanford Medicine.[1]

Its Langlotz Lab is currently working with imaging datasets from within and outside of Stanford Medicine. These include:

  • 1,000 ICU chest radiographs;
  • 831 bone tumor radiographs annotated by an expert radiologist with 18 features and the pathologic diagnosis;
  • 4,000 digital mammograms annotated with 13 quality attributes;
  • 4,000 pediatric hand radiographs with radiologist bone age; and
  • soon: 4.4 million Stanford exams, each with a narrative report.

“We hope to increase open access to some of these datasets by way of novel infrastructure and sharing methodology,” according to the Lab’s website.

Or take the Open Access Series of Imaging Studies (OASIS),[2] a project, according to its homepage, “aimed at making neuroimaging datasets of the brain freely available to the scientific community.”  OASIS-3, the third iteration of the project, is a longitudinal neuroimaging, clinical, cognitive, and biomarker dataset for normal aging and Alzheimer’s disease.

Like other open-source projects, users of OASIS data must abide by the Creative Commons Attribution 4.0 license, along with other terms.

At the federal level, the Centers for Medicare and Medicaid Services (CMS) has been pushing for better data interoperability. CMS has released a series of proposed rules and programs to drive open data, believing this will lower costs and improve outcomes.

And the American College of Radiology Data Science Institute, formed in 2017, is working on open-source software to promote the creation of clinical AI models, essentially making it easier for institutions to build AI models by providing them with the necessary training data and computational power.

All this is good news for AI researchers, who can turn to an expanding number of image databases and  participate in a growing number of medical image analysis challenges (see: Grand Challenges in Medical Image Analysis https://grand-challenge.org/).

Limitations, obstacles

To protect patient confidentiality, open-source healthcare data relies on anonymized data. More than that, many of these projects explicitly limit use to academic studies, and so cannot be used for commercial products by software and hardware vendors.

Moreover, patient privacy and data governance rules vary, institution by institution and country by country.

One way to end these sharing restrictions is to use so-called federated learning techniques. In federated learning, clinical data stays where it currently resides, such as within the hospital. Instead of moving this data from different repositories and devices to a central location (the traditional approach), federated learning lets endpoints download a model, train it locally using local data, and then summarize the changes and send a summary data to update the central model. The end result is the same: an AI model that improves, iteratively, over time. But because the underlying data does travel outside of the clinical domain where it was captured, the technique does not run afoul of data governance or security procedures.

That’s not to say federated learning is without its own challenges. Architecting such systems at scale requires lots of collaboration among many technical and clinical stakeholders, all of whom want the system to interact correctly so that the resulting, centralized model is valid and trustworthy.

Difficulties with sharing datasets may not be the biggest obstacle, however.

“Annotation [of medical images] is very expensive, because it needs expertise,” said Hsi-Ming Chang, a senior data scientist at GE Healthcare. This means it is difficult to create sufficiently large, annotated datasets.

The need for radiological expertise also means annotating these datasets is a slow process that can’t easily scale. That is, the required annotation cannot be handled via crowdsourcing, as is the case with AI training efforts in other domains.

Despite all the obstacles, Chang is optimistic about the AI imaging field, thanks to rapidly improving technology.

For example, to get around the annotation issue, a model can be trained on an annotated subset of the whole dataset, which can then create a model to predict training of a second part of the set. Human annotators will then need to conduct only some fine-tuning of this second set, and so on.

“This leads to an iterative improvement of the model’s quality,” he said.