LAION, the German research organization behind the dataset used to train generative AI models like Stable Diffusion, has launched an updated dataset it claims is "thoroughly cleaned of known links to suspected child sexual abuse material (CSAM)."
The newly released dataset, Re-LAION-5B, represents a revised version of the older dataset, LAION-5B, with improvements based on feedback from respected organizations such as the Internet Watch Foundation, Human Rights Watch, the Canadian Centre for Child Protection, and the now-disbanded Stanford Internet Observatory. Re-LAION-5B is available in two formats: Re-LAION-5B Research and Re-LAION-5B Research-Safe, with the latter also removing additional not safe for work (NSFW) content. Both versions have been filtered for thousands of links to known and "likely" CSAM, according to LAION.
“From the start, LAION has committed to eliminating illegal content from its datasets, implementing measures to achieve this,” the organization stated in a blog post. “We strictly follow the principle of removing illegal content as soon as it becomes known.”
It's crucial to understand that LAION's datasets do not contain images themselves, but are instead curated indexes of links and alt text sourced from another dataset known as Common Crawl, which comprises scraped websites and web pages.
The announcement of Re-LAION-5B follows a December 2023 Stanford Internet Observatory investigation that highlighted issues with a subset of the original LAION-5B dataset, specifically LAION-5B 400M, which was found to include at least 1,679 links to illegal images sourced from social media and adult websites. The report also noted the presence of "a wide range of inappropriate content, including pornographic imagery, racist language, and harmful stereotypes."
While the Stanford researchers acknowledged the challenge of removing the offending materials and clarified that the presence of CSAM does not inherently affect the output of models trained on such datasets, LAION chose to take LAION-5B offline temporarily. The report suggested that models utilizing LAION-5B should be deprecated and their distribution halted where possible. Coincidentally, AI startup Runway recently removed its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face, prompting inquiries for further details. In 2023, Runway collaborated with Stability AI, the creator of Stable Diffusion, to develop the model.
The new Re-LAION-5B dataset contains approximately 5.5 billion text-image pairs and is released under an Apache 2.0 license. LAION asserts that the accompanying metadata allows third parties to cleanse their existing copies of LAION-5B by eliminating links to illegal content.
LAION emphasizes that these datasets are intended solely for research, not commercial usage. However, past behavior suggests that some organizations may still utilize them. Notably, Google has previously leveraged LAION datasets in training its image-generating models.
“In total, 2,236 links to suspected CSAM have been removed by matching our lists with those provided by our partners,” LAION added in the blog post. “This also includes 1,008 links identified in the Stanford Internet Observatory's December 2023 report. We strongly encourage all research labs and organizations using the outdated LAION-5B to transition to the Re-LAION-5B datasets immediately.”