“It would be impossible to train today’s leading AI models without using copyrighted materials” stated OpenAI in its filing to the UK House of Lords which made headlines across the web earlier this year.
In fact, this argument is at the crux of the company’s public and legal defense for its controversial mass data scraping practices used to train its AI models, including the GPT-3.5/4 large language models (LLMs) that power its hit product ChatGPT, as well as, implicitly, even competitors such as Google, Mistral, Meta, Anthropic, and Cohere. Critics argue OpenAI should have sought affirmative express consent and/or paid out licensing fees to owners for use of copyrighted data, but the company says its practices are fair transformative use and that they operate under the longstanding norms of the internet, where content has been scraped for many years by many other companies to power search engine indexes and other useful features, without mass complaint. The fight continues in various ongoing lawsuits.
0:03/14:43Are you ready for AI agents?But a new model is challenging that assumption — at least, challenging the notion that it’s impossible to create a useful model without relying on copyrighted data.
The new LLM is called KL3M (Kelvin Legal Large Language Model, pronounced “Clem”), and it is the work of 273 Ventures, a two-year-old startup co-founded by Daniel Martin Katz, a law professor at the Illinois Institute of Technology and chief strategy officer (CSO) of the venture, and his “frequent collaborator” Michael Bommarito, a legal technology entrepreneur who serves as 273 Ventures’ CEO. The duo previously co-founded LexPredict, an older AI legal startup and sold it to global law company Elevate.
KL3M was released in late February 2024 but today, it earned the distinction of being the first LLM to receive a “Licensed Model (L) Certification” from independent auditing company Fairly Trained, a non-profit founded and led by former Stability AI executive Ed Newton-Rex earlier this year. Wired magazine, where my wife works as editor-in-chief, was first to report the news.
Fairly Trained (L) certification is awarded only to those companies who can prove through an application and review process, that their AI model training data was obtained and used under “a contractual agreement with a party that has the rights required to enter such an agreement” or is public domain/open license. It also costs a fee ranging between $150 upfront and $500 annually to $500 upfront/$6,000 annually. Clearly, KL3M qualified for these requirements.
“Today we are very excited to announce that the Kelvin Legal Large Language Model (KL3M) is now Certified as Fairly Trained,” wrote Katz on his account on the social network X. “KL3M is the very first LLM (in any category) to obtain such a certification.”
“Generative AI can exist without exploiting copyrighted work without permission,” wrote Fairly Trained in a blog post announcing the certification of K3LM and four other entities — Voicemod which offers AI speech and singing models, music companies Infinite Album and Lemonaide, and AI-driven group Frostbite Orckings.
How was KL3M trained?
According to Katz, who spoke to a media in a brief telephone interview today, 273 Ventures has since its inception been “painstakingly collecting data that would be not problematic” from sources including U.S. government document releases and old legal filings — all in the public domain.
“We weren’t sure that you could do such a thing [training an AI model] without using enormous amounts of copyrighted information,” said Katz. “We thought it would be possible in at least a certain scope to have success, particularly in the legal, financial, and regulatory arenas where there is a reasonably large amount of material that does not have copyright on it.”
Katz noted that not all of these industries offer uniform public domain documents and that it varies dramatically by country — for example, in the UK, some governmental entities or agencies can exert Crown Copyright over documents and data they produce.
A big part of the early months of 273 Ventures was sorting out which documents and data could be used to train KL3M without infringing or even risking infringement. That data was itself eventually bundled into a product as well, the Kelvin Legal DataPack, which contains more than 150 billion tokens and was released in August 2023.
KL3M, for its part, was trained on a “high-quality, curated English subset of the Kelvin Legal DataPack,” including a manual review of 10,000 documents and “a dataset with approximately 350 billion tokens.” 273 Ventures describes its training regime for KL3M in more detail here.
The results are, so far, two versions of KL3M: kl3m-170m with 170 million parameters (the attributes that govern an AI model) and the larger kl3m-1.7b with 1.7 billion parameters. Kl3m-170m is less performant, but can be run on hardware as low powered and cheap as a Macbook Air with M1 chip, compared to the NVidia RTX 4060 8GB chip required for the larger model (and many other competing LLMs).
Chart comparing the two versions of KL3M from 273 Ventures. Credit: 273 Ventures.
273 Ventures is also preparing to release a 3.7-billion parameter variant of KL3M next month.
What is KL3M good for and how much does it cost?
On its product webpage, KL3M is advertised as helpful for “drafting and revising time entries and invoices, drafting and revising contract clauses, drafting and revising SEC filings like 10-K and 8-K report sections, [and] drafting obvious patents…”
Though designed with law firms and the legal industry in mind — where customers are especially sensitive to questions of data provenance and legality — Katz told a media he was actually shocked by how well KL3M generalizes beyond this target sector.
“Just think about it this way: the law touches on pretty much every topic in society,” Katz explained. “And governments put out a lot of source material that teaches you concepts and the use of language…I’m a little personally surprised, but it really does have a broader reach than we would have would have thought.”
When initially announcing the model last month, 273 Ventures produced several charts benchmarking and comparing KL3M’s performance to other models in its class, finding that the 1.7-billion parameter version had lower (and thus better) perplexity, or token predicting errors, than 10 other leading models, including GPT-2 Large and openllama3b_v2 — at least in writing legal material and Wiki entries.
Chart showing KL3M’s performance on perplexity benchmark compared to other AI models named. Credit: 273 Ventures.
KL3M’s 1.7-billion parameter model also scored much lower (and better) on toxic outputs than other small models in its class, including Microsoft’s much vaunted Phi-2.
Chart showing KL3M-1.7b’s performance in toxicity measurements compared to other AI models. Credit: 273 Ventures
Right now, Katz said that the model was already in use among several law-firm customers who he declined to name specifically due to confidentiality reasons.
The cost of the model is also not publicly available, though Katz invited interested parties to email 273 Ventures for more information at: [email protected].