Data quality: New tactics in AI, ML training

“Well-generalized data leads to well-generalized AI,”
Human shaking hands with a robot arm.

Data scientists building artificial intelligence and machine learning models spend most of their time cleaning and normalizing training data. For instance, radiological images — say, a kidney ultrasound — vary significantly, depending on device, geography, and practitioner.

Nevertheless, it is crucial to build ML algorithms from diverse image sets.

“Well-generalized data leads to well-generalized AI,” said Matt Wickesberg, senior product manager of the Edison AI Workbench, the algorithm development component of the Edison platform, GE Healthcare's intelligence offering. Unveiled last November, Edison is composed of applications and smart devices, and offers seamless AI services on device, edge, and cloud. It is named for GE's co-founder, Thomas Edison.

Failure to use a sufficiently diverse training set can lead to image-recognition errors, as Apple learned with its first face-recognition product, which worked well on Western faces but not Asian ones.

“A best practice is to obtain data from many sites,” Wickesberg said. Likewise, it is important to make sure this dataset is varied inside of those sources, with different vendors, protocols, modalities, demographics, pathologies, and devices.

Curation and more

Given the nuances in how the original images were classified, GE Healthcare establishes partnerships with hospital systems, where hundreds of subject experts annotate thousands of the images, using a template that constrains the terms that can be used. The result? An image set that uses a harmonized, common lexicon.

Wickesberg said this work, ongoing since 2017, is behind all the current and next-generation AI products at GE Healthcare. Dozens of ML algorithms have been produced this way, he said.

“People think about data science today as the processes they are following for a given product,” said Travis Frosch, global data strategy leader at GE Healthcare. “People, I don’t believe, are thinking about the process of data science to increase the value of the data.” GE Healthcare, Frosch said, recently filed IP patents on methodologies and techniques that can be performed on the data itself to enrich it.

“This is a big paradigm shift,” Frosch said, “because people typically look at data as, ‘Here are the things I have access to, the data at hand, to build a feature set or product.’ Industry doesn’t, typically, turn the data science loose on the data itself.”

He said examples include having an algorithm churn over the existing dataset and request additional information, such as more annotation from an expert, or the inclusion of metadata, such as details from the EHR. The goal of this process is to give an algorithm increased accuracy and certainty about its conclusions.

There are other initiatives to improve the classification of AI training sets at the front end.

One of GE Healthcare’s partners, the American College of Radiology Data Science Institute, announced in April the launch of a free software platform, ACR AI-LAB, designed to encourage radiologists to participate in the creation, validation, and use of AI. Both GE Healthcare and GPU chip designer Nvidia are involved in ACR AI-LAB.[1]

Keith Dreyer, the DSI’s chief science officer, said one of the key goals of the software will be to give clinicians who don’t have a background in AI or ML a way to “create the challenge, annotate the data, and have it create a model.” He said it was important that this process doesn’t “add more burden or complexity to what the clinician is doing.”

In a statement at the launch of AI-LAB, Dreyer described the problem this way: “Acquiring the necessary large amounts of patient data for algorithm training has been a huge problem for developers up to this point.”[2]

Also, in April, the Consumer Technology Association launched an initiative to set standards for the use of artificial intelligence in healthcare, backed by more than 30 vendors and provider associations. The group will examine and advance AI technology in consumer health, fitness, and wellness technology, and recommend best practices — with major tech outfits such as Blackberry, Google, and IBM among the initial members.

“We did preliminary work on AI last year,” said Brian Markwalter, senior vice president of research and standards at CTA, where he oversees its ANSI-accredited standards-development operation and extensive market-research capability composed of more than 70 committees, subcommittees, and working groups.

Given the fast-growing markets for AI-enabled health and fitness, it isn’t surprising that CTA’s first working group in the space is in healthcare. He said the first interest for both the parent AI committee and the healthcare group is making sure there is a common taxonomy, “one more generally for AI, and probably one for healthcare.” Another early focus, Markwalter said, will be on “trustworthiness,” which will tackle issues that include bias, privacy, providence, normalization, and metadata.

Another problem (or opportunity) will be to establish, early on, working groups that cut across devices and uses, because the data from different sensors (sleep trackers, heart rate trackers, step counters in the consumer-tracker space) may be combined for new insights. Indeed, CTA’s sleep tracker, heart rate and step counter working groups are already working together, he said.

 


[1] GE Healthcare Accelerates AI Model Development and Deployment with Launch of Edison Integration to American College of Radiology AI-LAB™, Business Wire, April 8, 2019, https://www.businesswire.com/news/home/20190408005494/en/GE-Healthcare-Accelerates-AI-Model-Development-Deployment.

[2] American College of Radiology Launches ACR AI-LAB™ to Engage Radiologists in AI Model Development, Cision PR Newswire, April 5, 2019, https://www.prnewswire.com/news-releases/american-college-of-radiology-launches-acr-ai-lab-to-engage-radiologists-in-ai-model-development-300825336.html.