MIT study finds “error riddled” data sets used for AI testing

The team encouraged AI developers to create cleaner data sets for evaluating models and tracking the field’s progress, while also recommending that researchers improve their own data hygiene.
Jeff Rowe

Data sets are a critical element in testing new AI models, but many of the most commonly used data sets are riddled with label errors that could result in faulty assumptions about the quality of the AI.

That’s according to a new study from MIT that looked at the 10 most cited AI data sets “that researchers use to evaluate machine-learning models as a way to track how AI capabilities are advancing over time.”

“Large labeled data sets have been critical to the success of supervised machine learning across the board in domains such as image classification, sentiment analysis, and audio classification,” the report explained. “Yet, the processes used to construct datasets often involve some degree of automatic labeling or crowd- sourcing, techniques which are inherently error-prone. Even with controls for error correction, errors can slip through.”

The underlying problem, the report says, is that “(r)esearchers rely on benchmark test datasets to evaluate and measure progress in the state-of-the- art and to validate theoretical findings. If label errors occurred profusely, they could potentially undermine the framework by which we measure progress in machine learning. Practitioners rely on their own real-world datasets which are often more noisy than carefully-curated benchmark datasets. Label errors in these test sets could potentially lead practitioners to incorrect conclusions about which models actually perform best in the real world.”

For the study, explained writer Karen Hao at MIT Technology Review,  MIT graduate students Curtis G. Northcutt and Anish Athalye and alum Jonas Mueller used “training data sets to develop a machine-learning model and then used it to predict the labels in the testing data. If the model disagreed with the original label, the data point was flagged up for manual review. Five human reviewers on Amazon Mechanical Turk were asked to vote on which label—the model’s or the original—they thought was correct. If the majority of the human reviewers agreed with the model, the original label was tallied as an error and then corrected.”

Previous studies, she noted, have found flaws in data sets such as racist and sexist labels and photos of people’s faces obtained without consent, but the purpose of this study was to determine the extent to which labels are often simply wrong.

“Our findings,” wrote the researchers, “imply ML practitioners might benefit from correcting test set labels to benchmark how their models will perform in real-world deployment, and by using simpler/smaller models in applications where labels for their datasets tend to be noisier than the labels in gold-standard benchmark datasets.”