Real big data progress depends on top quality data

Inaccuracy and inconsistency still plagues healthcare Big Data, says one data expert, not to mention that sometimes the data being analyzed aren’t even the right kind of information.
Jeff Rowe

Conventional wisdom holds that healthcare is either in the middle of, or about to begin, a data revolution.

But while few stakeholders doubt the potential advances this revolution will bring, some are suggesting there’s a least one major obstacle to be cleared: inconsistent data quality.

Writing recently at the BMJ, Kiret Dhindsa, Ph.D., a postdoctoral fellow specializing in healthcare at Toronto’s Vector Institute for Artificial Intelligence, Kiret Dhindsa, Ph.D., pointed out that, currently, the massive amount of healthcare data being generated is riddled with inconsistencies and inaccuracies. And sometimes, even if the data are good, they’re not the right kind of information.

“As a machine-learning scientist wanting to work with health data, the picture that is painted in all of these editorials is nothing like what I experience,” he said in a related interview with HCANews.

For example, when it comes to medical imaging, many doctors write diagnostic notes directly on the images they take. “Given a set of such images, a machine-learning algorithm might learn to be highly accurate in analyzing the images. But what happens in the real world where not every doctor makes such annotations and there’s little consistency in annotations among different physicians?” The short answer: failure.

“One of the first things you discover when you work with multi-institutional data sets is that it is usually easier for a machine-learning algorithm to identify which hospital a sample came from than whether the sample is from a healthy or sick individual,” Dhindsa said. “That should make clear the extent of the problem.”

In his BMJ editorial, co-written with two colleagues, Dhindsa argues that “the big data ‘revolution’ won’t happen in healthcare until we find ways to standardize health data collection across institutions and improve the overall quality of the data.”

According to Dhindsa, the premise of his editorial is widely accepted within the machine-learning community, but “what’s happening is that our voices are being drowned out by researchers who are not actually at the intersection of machine learning and healthcare, because they paint an exciting vision for the future and don’t really talk about why the challenges in the field don’t simply boil down to overcoming technical problems.”

Dhindsa’s editorial doesn’t dispute that machine learning can affect healthcare. Rather, “it’s a call to arms, asking those in the healthcare and machine learning communities to work together to devise and adopt consistent standards around healthcare data.” 

Ironically, there’s also the potential for artificial intelligence to help solve the problem of bad data, by analyzing existing data and transforming it into a standardized format. That won’t be a holistic solution, Dhindsa said, but rather something of a temporary fix that will buy healthcare institutions time to restructure their data management systems.

“I suspect that in the meantime, data scientists will end up throwing out huge amounts of data due to quality issues,” he said.