A New Model Challenges Big Tech’s Core AI Argument

A new model is challenging the primary claim put forward by leading AI companies, raising questions about established assumptions in artificial intelligence development and prompting renewed debate within the tech industry over current approaches and their potential implications.
Tl;dr
Challenging Industry Assumptions
A longstanding assertion by major tech players is now under scrutiny. The idea that crafting a high-performing AI language model requires tapping into copyright-protected content has been, until recently, largely uncontested. However, a group of researchers from fourteen international institutions—including the prestigious MIT, Carnegie Mellon, and the Vector Institute—set out to question this supposed impossibility. Their experiment yielded a new language model that may not rival today’s industry leaders in raw power, but makes a significant stride toward ethical standards.
A Herculean Effort with Open Data
At the heart of this endeavor lies an enormous dataset sourced solely from the public domain or under open licenses. Compiling over 8 terabytes of documents—including an impressive trove of 130,000 books from the Library of Congress—was no simple feat. Automation alone couldn’t suffice; much of the material demanded manual annotation, legal verification, and painstaking review. As researcher Stella Biderman put it, this was «incredibly tedious work», with every website’s licensing terms requiring close examination to ensure compliance.
The Results: A Modest Yet Significant Achievement
While this ethically trained LLM, equipped with seven billion parameters—a scale matching the Llama 2-7B released by Meta in 2023—doesn’t reach the latest performance peaks, it stands on par with major commercial models from just two years ago. Notably, the research team refrained from publishing exhaustive benchmarks against today’s top-tier AI systems, emphasizing instead the demonstration of feasibility over direct competition.
For clarity, here are some notable impacts:
Sparking Legal and Ethical Debate
Of course, progress came at a cost: building the model took longer and required more effort than approaches reliant on freely scraped content—at least for now. Industrial scalability remains elusive. Nevertheless, this project chips away at a central argument frequently invoked in regulatory debates over artificial intelligence. While its immediate impact on prevailing business models may be limited, it seems likely that these findings will resurface as evidence in future high-profile legal cases involving AI development and copyright law. In an industry often dominated by technical bravado and sweeping declarations, this study’s nuanced approach could prove unexpectedly influential.