But if you’re not very familiar with the AI industry and copyright, you might be wondering: Why would a company spend millions of dollars to destroy them on books? Behind these weird legal exercises is a more fundamental driver: the AI industry is indefinitely hungry for high-quality texts.
Competition for high-quality training data
To understand why humans want to scan millions of books, it is important to know that AI researchers build large language models (LLMs) just like those who feed billions of words into neural networks. During the training period, the AI system repeatedly processes the text, establishing a statistical relationship between words and concepts in the process.
The quality of training data powered to the neural network directly affects the functionality of the resulting AI model. Models that train well-trained books and articles tend to produce models that are trained than those trained in YouTube random comments (such as random YouTube comments).
Publishers legally control what AI companies are desperately looking for, but AI companies don’t always want to negotiate licenses. The first doctrine of selling offers a solution: Once you buy a physics book, you can do what you want with that copy – including destroying it. This means buying physical books provides a legal solution.
However, even if it is legal, buying things is expensive. Therefore, like many AI companies before, humans initially chose a quick and easy path. The court filed a lawsuit that in order to seek high-quality training data, humans first chose to accumulate digital pirated books to avoid what CEO Dario Amodei calls “legal/practice/business/business barriers” (complex licensing negotiations with publishers. But by 2024, for legal reasons, anthropomorphism has become “less than gungho” to use “using pirated e-books” and requires a safer source.