A team of computer scientists from MIT investigated ten frequently cited datasets used to evaluate machine learning systems and discovered that approximately 3.4% of the data was inaccurate or mislabeled. This error rate poses significant challenges for AI systems relying on these datasets.
The datasets, each cited over 100,000 times, include text-based sources from platforms like newsgroups, Amazon, and IMDb. Common errors involved Amazon product reviews being misclassified—positive reviews labeled as negative and vice versa. In image datasets, issues arose from confusing animal species or mislabeling images based on less prominent objects (for instance, calling a mountain bike attached to a water bottle simply a "water bottle"). A notable mistake included misidentifying a baby as a nipple.
One dataset, derived from YouTube videos, featured a clip where a YouTuber spoke for three and a half minutes, yet was labeled as "church bell," with that sound only appearing in the last 30 seconds. Another misclassification mistakenly identified a Bruce Springsteen performance as an orchestra.
To uncover these errors, the researchers employed a framework called confident learning, which detects label noise within datasets. Validation through Mechanical Turk revealed that roughly 54% of flagged labels were indeed incorrect. The QuickDraw test set exhibited the highest error rate, with about 5 million inaccuracies, roughly 10% of its total.
The team established a website for users to explore these label errors. While some mistakes are minor, others raise concerns; for example, a close-up of a Mac command key labeled as a "computer keyboard" remains accurate, yet the confident learning method also misidentified a correctly labeled image of tuning forks as a menorah.
Even slight inaccuracies in labeling can have significant consequences for machine learning outcomes. If an AI cannot distinguish between a grocery item and a bunch of crabs, it undermines trust in its ability to perform tasks, such as pouring a drink accurately.