California's AI Training Transparency Law: Why Many Companies Are Hesitant to Confirm Compliance

On Sunday, California Governor Gavin Newsom enacted AB-2013, a groundbreaking bill mandating that companies developing generative AI systems must publish high-level summaries of the data used for training their models. These summaries will include crucial information about data ownership, acquisition methods, and whether any copyrighted or personal information is included.

Responses from major AI companies regarding compliance have been mixed. Key players such as OpenAI, Anthropic, Microsoft, Google, Amazon, Meta, and several startups like Stability AI, Midjourney, Udio, Suno, Runway, and Luma Labs were contacted, but fewer than half provided feedback, with Microsoft notably declining to comment.

Only Stability, Runway, and OpenAI confirmed their intention to comply with AB-2013. An OpenAI spokesperson stated, “OpenAI complies with the law in jurisdictions we operate in, including this one.” Stability's representative expressed support for regulations that protect the public without hindering innovation.

It’s worth noting that the disclosure requirements of AB-2013 will not be immediate. The law applies to systems launched after January 2022 — examples include ChatGPT and Stable Diffusion — but companies have until January 2026 to start publishing their training data summaries. Furthermore, the law pertains only to systems accessible to Californians, providing some leeway.

However, the lack of comprehensive responses from other vendors may also reflect the complexities involved in how generative AI systems are trained. Often, training data is sourced from the web, with companies scraping extensive amounts of images, music, videos, and other content.

In the past, AI developers routinely disclosed their training data sources, typically in accompanying technical papers. For instance, Google has previously indicated that it utilized the public LAION dataset for its early image generation model, Imagen. Earlier academic papers often referenced The Pile, an open-source text collection encompassing various academic studies and codebases.

Today’s competitive landscape has transformed the treatment of training datasets into a closely guarded trade secret, contributing to the hesitance among companies to disclose this information. Detailed knowledge of training data also presents potential legal risks. The LAION dataset includes copyrighted and privacy-infringing material, while The Pile contains Books3, a collection of pirated works by authors like Stephen King.

Numerous lawsuits have already emerged over alleged misuse of training data, with more cases surfacing each month. Authors and publishers claim that OpenAI, Anthropic, and Meta have used copyrighted texts — including those from Books3 — without permission. Music labels have taken legal action against Udio and Suno for allegedly utilizing music tracks without compensating the artists. Meanwhile, artists have initiated class-action lawsuits against Stability and Midjourney, citing data scraping practices akin to theft.

Given these ongoing legal battles, the implications of AB-2013 for vendors are significant. The legislation requires public disclosure of various potentially incriminating details about training datasets, such as the initial use date and ongoing data collection status.

The scope of AB-2013 is extensive. It mandates that any organization that “substantially modifies” an AI system — meaning any fine-tuning or retraining — must also disclose their training data sources. While a few exemptions exist, they mainly relate to AI systems used in national cybersecurity and defense contexts, such as for aircraft operations.

Many vendors are optimistic that the legal principle of fair use will protect them, and they are actively asserting this defense in court and public statements. Companies like Meta and Google have adjusted their platform settings and terms of service to enable broader data collection for training purposes.

In an environment driven by competition and bolstered by the belief that fair use defenses will ultimately prevail, some companies have aggressively pursued training on IP-protected data. Reports have indicated that, at one point, Meta utilized copyrighted books for training despite internal legal concerns. Evidence also suggests that Runway extracted training data from Netflix and Disney films, while OpenAI reportedly utilized YouTube videos without creators’ permission to train its models, including GPT-4.

As previously discussed, there is a possibility that generative AI companies could avoid repercussions regardless of AB-2013's requirements. Courts may support fair use advocates and determine that generative AI is sufficiently transformative, contrary to claims of plagiarism made by The New York Times and other plaintiffs.

Conversely, AB-2013 could force vendors to either withhold certain models from California or to release modified versions that are specifically trained using only fair use and licensed datasets. Given the potential legal ramifications, some companies might opt for a strategy that minimizes liability and avoids contentious disclosures.

Assuming the law faces no challenges or stays, we can expect to have a clearer understanding of its implications by the January 2026 deadline for compliance.

Most people like

Find AI tools in YBX