A New Model Challenges Big Tech’s Core AI Argument

By 24matins.uk, published 7 June 2025 at 17h45, updated on 7 June 2025 at 17h45.

Tech

A new model is challenging the primary claim put forward by leading AI companies, raising questions about established assumptions in artificial intelligence development and prompting renewed debate within the tech industry over current approaches and their potential implications.

Tl;dr

Researchers built an AI model using only open data.

Model matches commercial AI from two years ago.

Challenges industry claims about copyright necessity.

Challenging Industry Assumptions

A longstanding assertion by major tech players is now under scrutiny. The idea that crafting a high-performing AI language model requires tapping into copyright-protected content has been, until recently, largely uncontested. However, a group of researchers from fourteen international institutions—including the prestigious MIT, Carnegie Mellon, and the Vector Institute—set out to question this supposed impossibility. Their experiment yielded a new language model that may not rival today’s industry leaders in raw power, but makes a significant stride toward ethical standards.

A Herculean Effort with Open Data

At the heart of this endeavor lies an enormous dataset sourced solely from the public domain or under open licenses. Compiling over 8 terabytes of documents—including an impressive trove of 130,000 books from the Library of Congress—was no simple feat. Automation alone couldn’t suffice; much of the material demanded manual annotation, legal verification, and painstaking review. As researcher Stella Biderman put it, this was «incredibly tedious work», with every website’s licensing terms requiring close examination to ensure compliance.

The Results: A Modest Yet Significant Achievement

While this ethically trained LLM, equipped with seven billion parameters—a scale matching the Llama 2-7B released by Meta in 2023—doesn’t reach the latest performance peaks, it stands on par with major commercial models from just two years ago. Notably, the research team refrained from publishing exhaustive benchmarks against today’s top-tier AI systems, emphasizing instead the demonstration of feasibility over direct competition.

For clarity, here are some notable impacts:

This project counters recent statements from companies like OpenAI, which told a UK parliamentary committee that such development would be «impossible without protected content».

An expert witness for Anthropic similarly asserted that «LLMs likely wouldn’t exist if every work needed licensing».

Sparking Legal and Ethical Debate

Of course, progress came at a cost: building the model took longer and required more effort than approaches reliant on freely scraped content—at least for now. Industrial scalability remains elusive. Nevertheless, this project chips away at a central argument frequently invoked in regulatory debates over artificial intelligence. While its immediate impact on prevailing business models may be limited, it seems likely that these findings will resurface as evidence in future high-profile legal cases involving AI development and copyright law. In an industry often dominated by technical bravado and sweeping declarations, this study’s nuanced approach could prove unexpectedly influential.

Le Récap

Tl;dr
Challenging Industry Assumptions
A Herculean Effort with Open Data
The Results: A Modest Yet Significant Achievement
Sparking Legal and Ethical Debate