Org Behind Dataset for Stable Diffusion Asserts Removal of CSAM Content

Home AI News Org Behind Dataset for Stable Diffusion Asserts Removal of CSAM Content

Updated on October 20 2024

LAION, the German research organization behind the dataset used to train generative AI models like Stable Diffusion, has launched an updated dataset it claims is "thoroughly cleaned of known links to suspected child sexual abuse material (CSAM)."

The newly released dataset, Re-LAION-5B, represents a revised version of the older dataset, LAION-5B, with improvements based on feedback from respected organizations such as the Internet Watch Foundation, Human Rights Watch, the Canadian Centre for Child Protection, and the now-disbanded Stanford Internet Observatory. Re-LAION-5B is available in two formats: Re-LAION-5B Research and Re-LAION-5B Research-Safe, with the latter also removing additional not safe for work (NSFW) content. Both versions have been filtered for thousands of links to known and "likely" CSAM, according to LAION.

“From the start, LAION has committed to eliminating illegal content from its datasets, implementing measures to achieve this,” the organization stated in a blog post. “We strictly follow the principle of removing illegal content as soon as it becomes known.”

It's crucial to understand that LAION's datasets do not contain images themselves, but are instead curated indexes of links and alt text sourced from another dataset known as Common Crawl, which comprises scraped websites and web pages.

The announcement of Re-LAION-5B follows a December 2023 Stanford Internet Observatory investigation that highlighted issues with a subset of the original LAION-5B dataset, specifically LAION-5B 400M, which was found to include at least 1,679 links to illegal images sourced from social media and adult websites. The report also noted the presence of "a wide range of inappropriate content, including pornographic imagery, racist language, and harmful stereotypes."

While the Stanford researchers acknowledged the challenge of removing the offending materials and clarified that the presence of CSAM does not inherently affect the output of models trained on such datasets, LAION chose to take LAION-5B offline temporarily. The report suggested that models utilizing LAION-5B should be deprecated and their distribution halted where possible. Coincidentally, AI startup Runway recently removed its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face, prompting inquiries for further details. In 2023, Runway collaborated with Stability AI, the creator of Stable Diffusion, to develop the model.

The new Re-LAION-5B dataset contains approximately 5.5 billion text-image pairs and is released under an Apache 2.0 license. LAION asserts that the accompanying metadata allows third parties to cleanse their existing copies of LAION-5B by eliminating links to illegal content.

LAION emphasizes that these datasets are intended solely for research, not commercial usage. However, past behavior suggests that some organizations may still utilize them. Notably, Google has previously leveraged LAION datasets in training its image-generating models.

“In total, 2,236 links to suspected CSAM have been removed by matching our lists with those provided by our partners,” LAION added in the blog post. “This also includes 1,008 links identified in the Stanford Internet Observatory's December 2023 report. We strongly encourage all research labs and organizations using the outdated LAION-5B to transition to the Re-LAION-5B datasets immediately.”

California Legislature Passes AI Bill SB 1047: Why Some Are Urging the Governor to Veto It

Google Implements Enhanced Safeguards for Its AI Products in Preparation for the US Presidential Election

Most people like

uPass

134.4K

In today's rapidly evolving educational landscape, students face unique challenges when it comes to writing assignments and ensuring academic integrity. With the rise of artificial intelligence tools, it's crucial to have reliable AI detectors that can identify AI-generated content while also utilizing advanced AI rewriters that enable students to create original, high-quality work without detection. This powerful combination empowers learners to enhance their writing skills and maintain their academic integrity, all while navigating the complexities of modern education.

AI detector AI Rewriter

Wirestock

373.3K

Unlock your creative potential and start earning today! Effortlessly sell your stunning photos, captivating AI art, and engaging videos. Monetize your passion and turn your creativity into income.

monetization AI Content Generator

Semantic Scholar

7.9M

Semantic Scholar is an innovative, free AI research tool designed to assist scholars in discovering relevant scientific literature efficiently. By leveraging advanced algorithms, it streamlines the research process, making it easier for users to access the information they need.

AI-powered research tool AI Search Engine

Dice la cancion

25.8K

Unlock the meanings behind your favorite song lyrics and discover the stories that resonate within the music. Delve into the themes, emotions, and inspirations that give depth to the words, and enhance your appreciation of the artistry involved. With our insights, you can connect on a deeper level with the songs you love. Join us in this journey to explore song lyrics meanings today!

song lyrics AI Lyrics Generator

Find AI tools in YBX