While new AI is being rapidly introduced across the healthcare sector, there are still no still no established best practices to help stakeholders ensure the new technologies’ reliability and safety.
That’s according to a team of scholars from Stanford University who expressed their concerns in a recent review in Nature.
“Are medical devices able to demonstrate performance that can be generalized to the entire intended population? Are commonly faced shortcomings of AI (overfitting to training data, vulnerability to data shifts, and bias against underrepresented patient subgroups) adequately quantified and addressed?”
Those were just two of the regulatory questions that, in their view, need more definitive answers.
To highlight their concerns, the team set out to understand how the FDA is addressing issues of test-data quality, transparency, bias, and algorithm monitoring by creating a database of 130 AI devices approved by the FDA between January 2015 and December 2020. For each algorithm, researchers assessed the number of patients enrolled in the evaluation study; the number of sites used in the evaluation; whether the test data were collected and evaluated retrospectively or prospectively; and whether stratified performance by disease subtypes or across demographic subgroups was reported.
The review showed that 126 of the 130 AI devices underwent only retrospective studies at their submission. None of the 54 high-risk devices were evaluated by prospective studies.
For most devices,” they wrote, “the test data for the retrospective studies were collected from clinical sites before evaluation, and the endpoints measured did not involve a side-by-side comparison of clinicians’ performances with and without AI.
“More prospective studies are needed for full characterization of the impact of the AI decision tool on clinical practice, which is important, because human–computer interaction can deviate substantially from a model’s intended use. For example, most computer-aided detection diagnostic devices are intended to be decision-support tools rather than primary diagnostic tools,” the researchers stated, adding, “A prospective randomized study may reveal that clinicians are misusing this tool for primary diagnosis and that outcomes are different from what would be expected if the tool were used for decision support.”
In addition, the researchers found that of the 130 AI devices analyzed, 93 devices did not have publicly reported multi-site assessment included as part of the evaluation study. Of the 41 devices with the number of evaluation sites reported, four devices were evaluated in only one site and eight devices were evaluated in only two sites.
“This suggests that a substantial proportion of approved devices might have been evaluated only at a small number of sites, which often tend to have limited geographic diversity,” the researchers noted.