A recent report from the Stanford Internet Observatory has revealed that the LAION-5B dataset, a significant open-source AI resource used in training popular text-to-image generators like Stable Diffusion 1.5 and Google’s Imagen, contains at least 1,008 instances of child sexual abuse material (CSAM), with thousands more suspected. Released in March 2022, this extensive dataset comprises over 5 billion images and associated captions sourced from the internet. The report raises concerns that the presence of CSAM in the dataset could lead AI systems trained on this data to generate new and potentially realistic depictions of child abuse.
In response, LAION announced to 404 Media that it is temporarily removing its datasets “out of an abundance of caution” to ensure their safety before they are republished.
LAION’s datasets have faced scrutiny before. In October 2021, cognitive scientist Abeba Birhane published a paper that analyzed LAION-400M, an earlier dataset. Her findings highlighted the presence of highly problematic content, including explicit images and text related to rape and pornography.
In September 2022, artist Lapine discovered her private medical record photos, taken by her doctor in 2013, listed in the LAION-5B dataset while using the Have I Been Trained website, which helps users find their work in AI training datasets.
A class-action lawsuit, Andersen et al. v. Stability AI LTD et al., filed in January 2023, included LAION in the allegations against Stability AI, Midjourney, and DeviantArt. The plaintiffs claimed that Stability AI illegally downloaded billions of copyrighted images, with LAION allegedly providing the scraped data for the creation of Stable Diffusion.
Award-winning artist Karla Ortiz, who has worked with leading companies such as Industrial Light & Magic and Marvel Studios, spoke at an FTC panel in October about concerns related to the LAION-5B dataset. She noted, "LAION-5B contains 5.8 billion text and image pairs that include my work and that of almost everyone I know. Beyond intellectual property, it also contains deeply concerning material like private medical records, non-consensual pornography, and images of children."
Andrew Ng, a prominent figure in AI and former head of Google Brain, has expressed concern over the potential impact of restricting access to datasets like LAION. In his DeepLearning.ai newsletter, he emphasized that the success of recent machine learning advances has relied on access to abundant, freely available data. Ng believes limiting access to critical datasets would hinder progress in various fields, such as art, education, and drug development, while urging the AI community to enhance transparency in data collection and usage.
LAION, which stands for Large-scale AI Open Network, was co-founded by Christoph Schuhmann, who was inspired while engaging with AI enthusiasts on Discord. He aimed to establish an open-source dataset for training image-to-text models. Within weeks, LAION amassed 3 million image-text pairs, eventually expanding to over 5 billion.
LAION has also engaged in discussions about open-source AI, advocating for an acceleration of research and a collaborative international computing cluster for large-scale AI models. Notably, LAION sourced visual data from online shopping platforms like Shopify, eBay, and Amazon, which researchers from the Allen Institute for AI recently examined in a study of LAION-2B-en, a subset of LAION-5B. They discovered that approximately 6% of the dataset's documents originated from Shopify, highlighting the need for further investigation into the sources of image data used in training AI models.