Spawning Aims to Create More Ethical AI Training Datasets

Jordan Meyer and Mathew Dryhurst launched Spawning AI with the mission of empowering artists to manage the online use of their works. Their newest initiative, Source.Plus, is designed to curate “non-infringing” media for training AI models.

Initially, Source.Plus will feature a dataset containing nearly 40 million images from the public domain and those licensed under Creative Commons’ CC0, which allows creators to relinquish most legal rights to their works. Meyer asserts that, although it is significantly smaller than many existing generative AI training datasets, the Source.Plus dataset is “high-quality” enough to effectively train advanced image-generating models.

“With Source.Plus, we’re building a universal ‘opt-in’ platform,” Meyer explained. “Our aim is to simplify the process for rights holders to contribute their media for generative AI training on their terms, while also making it easy for developers to integrate that media into their training workflows.”

Rights Management

The ethics of training generative AI models, especially in art generation with tools like Stable Diffusion and OpenAI’s DALL-E 3, continues to stir debate and holds significant implications for artists, regardless of the outcome.

Generative AI models learn to create outputs (like photorealistic art) by training on an extensive array of data — in this case, images. Some developers contend that fair use permits them to scrape data from public sources without considering the copyright status. Others attempt to strike a balance, compensating or attributing content owners for their contributions to training datasets.

Meyer, as CEO of Spawning, believes that a universally accepted best approach has yet to emerge. “AI training often defaults to the most accessible data — which isn’t always the most fair or responsibly sourced,”, “Artists and rights holders have lacked control over how their data is utilized for AI training, while developers haven’t had high-quality alternatives that respect data rights.”

Available in limited beta, Source.Plus builds upon Spawning’s existing tools for managing art provenance and usage rights. In 2022, Spawning introduced HaveIBeenTrained, a platform allowing creators to opt out of training datasets used by partners like Hugging Face and Stability AI. After securing $3 million in funding from investors including True Ventures and Seed Club Ventures, Spawning launched ai.text, which enables websites to "set permissions" for AI, along with Kudurru, a system to combat data-scraping bots.

Source.Plus marks Spawning’s initial venture into creating and curating an in-house media library. Meyer notes that the initial PD/CC0 image dataset can be utilized for both commercial and research purposes.

“Source.Plus isn’t just a repository for training data; it’s an enrichment platform designed to support the entire training pipeline,” Meyer continued. “Our ultimate goal is to develop a high-quality, non-infringing CC0 dataset capable of supporting a robust AI model within the year.”

Setting Higher Standards

Organizations like Getty Images, Adobe, Shutterstock, and AI startup Bria claim to exclusively use fairly sourced data for their model training (With Getty going as far as deeming its generative AI products “commercially safe.”) However, Meyer asserts that Spawning seeks to elevate the standard for what constitutes fair data sourcing.

Source.Plus carefully filters images based on artist opt-outs and training preferences, providing provenance details about how and from where images originated. It specifically excludes images not licensed under CC0, including those with a Creative Commons BY 1.0 license, which requires attribution. Furthermore, Spawning actively monitors for copyright disputes from sources where others may inaccurately indicate a work's copyright status, such as Wikimedia Commons.

“We have thoroughly validated the reported licenses of the images we’ve gathered, excluding any questionable licenses — a step often overlooked by other ‘fair’ datasets,” Meyer emphasized.

Historically, problematic images, encompassing violent, explicit, and sensitive personal content, have plagued both open and commercial training datasets. For instance, the maintainers of the LAION dataset were compelled to remove a library after reports revealed it contained medical records and instances of child abuse; recently, a study from Human Rights Watch noted a LAION repository that contained the faces of Brazilian children without their consent. Additionally, Adobe’s stock media library, used to train its generative AI models like Firefly, was found to include AI-generated images from competitors such as Midjourney.

To tackle these issues, Spawning is utilizing classifier models to identify nudity, gore, personal identification, and other undesirable content in images. Acknowledging that no classifier is foolproof, Spawning plans to enable users to “flexibly” filter the Source.Plus dataset by adjusting the classifiers’ detection thresholds.

“We employ moderators to ensure data ownership verification,” Meyer added. “We also integrate remediation features allowing users to flag any objectionable or potentially infringing works, with an audit trail for how that data was used.”

Fair Compensation for Creators

Most compensation programs for creators contributing to generative AI training data have faced challenges. Many rely on opaque metrics for calculating payouts, while others offer amounts deemed inadequate by artists.

Take Shutterstock as an example. The stock media library, which has secured deals with AI vendors worth millions, contributes to a “contributors fund” for the artwork used in training its generative AI models or licensed to third parties. However, Shutterstock lacks transparency regarding expected earnings for artists and does not allow them to set their pricing terms; estimates suggest earnings of only $15 for 2,000 images, an amount many consider unsatisfactory.

Once Source.Plus moves beyond its beta phase this year and expands to datasets beyond PD/CC0, it will take a distinct approach, allowing artists and rights holders to determine pricing on a per-download basis. Spawning will impose a minimal flat fee — “a tenth of a penny,” according to Meyer.

Additionally, customers can choose to subscribe to Source.Plus Curation for $10 per month — along with the typical per-image download fee. This plan enables users to manage private image collections, download the dataset up to 10,000 times monthly, and access new features, such as “premium” collections and data enrichment early on.

“While we offer recommendations based on industry standards and our own metrics, ultimately, it is the contributors to the dataset who decide what makes it worthwhile for them,” Meyer explained. “We have designed this pricing model intentionally to ensure artists receive the majority of revenue while allowing them to set their own terms for participation. We believe this revenue model is far more favorable for artists compared to traditional percentage splits, leading to higher payouts and greater transparency.”

Should Source.Plus gain the traction Spawning hopes for, there are intentions to broaden its offerings beyond images to other media types like audio and video. Spawning is currently in discussions with unnamed firms to include their data in Source.Plus and hopes to explore creating its own generative AI models using the Source.Plus datasets.

“We aspire to provide rights holders interested in the generative AI economy with fair compensation opportunities,” Meyer stated. “We also hope to open avenues for artists and developers who have felt conflicted about AI to engage in a manner respectful to their fellow creatives.”

It's clear that Spawning has a unique opportunity to shape the landscape here. Source.Plus appears to be one of the most promising ventures aimed at involving artists in generative AI development while allowing them to receive compensation for their contributions.

As my colleague Amanda Silberling noted, the rise of apps like the art-hosting community Cara, which gained traction after Meta announced potential AI training on Instagram content, signifies that the creative community is at a tipping point. Artists are urgently seeking alternatives to platforms they feel are exploiting their work — and Source.Plus may offer a viable solution.

However, given that Spawning is a VC-backed business, it raises questions about whether Source.Plus can genuinely uphold artists’ best interests while scaling successfully, especially considering the challenges in moderating extensive user-generated content.

Only time will tell.

Most people like

Find AI tools in YBX