‘Clean’ and unbiased Big Data key to effective AI algorithms

AI algorithms developed using Big Data could help liberate providers to care more directly for their patients, say researchers, but developers need to guard against replicating potential data biases.
Jeff Rowe

Technologies including new AI and machine learning have the potential to address numerous health issues across entire populations, but only if researchers and developers ensure the data used are balanced and fully representative.

That’s the caution voiced in a recent commentary at npj Digital Medicine by researchers from Stanford University and the NYU School of Medicine.

Noting the “tsunami of big data” that is rapidly being gobbled up for purposes of developing AI algorithms, among other applications, the researchers point out that “such algorithms—agnostic to the sources, or validity, of the big data used for training—have the potential to worsen preexisting demographic disparities in healthcare. Racial biases anchored in historically biased training datasets have led to racially biased predictive models for criminal justice, hiring decisions, allocation of social services/benefits, issuance of supportive housing, and evaluation of child abuse.”

In broad terms, they argue, “(a)wareness of data deficiencies, structures for data inclusiveness, strategies for data sanitation, and mechanisms for data correction can help realize the potential of big data for a personalized medicine era. Applied deliberately, these considerations could help mitigate risks of perpetuation of health inequity amidst widespread adoption of novel applications of big data.”

In short, they say, "the concept of 'garbage in, garbage out' is of the utmost importance for medical algorithms trained on healthcare datasets and impacting patients downstream.”

In response, the researchers recommend a number of strategies developers can take to protect data integrity, including increasing data transparency by annotating “training datasets with labeling metadata, (thus) documenting biases intrinsic to them;” redesigning data collection methods to ensure data variety beyond simply the volume, and providing practitioners attempting to interpret emerging studies with transparency into the characteristics of datasets.

Moreover, they note, given the increased use of ever-greater amounts of patient data, “across all of these strategies, privacy of patient health information (PHI) must be prioritized.  . . . Compromise of privacy amidst accelerating data generation and use threatens the medical, financial, and social wellbeing of patients: for instance, discrimination in health insurance and job employment on the basis of PHI can perpetuate health disparities by impacting access to services and medications.”