Tech News

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Published

11 months ago

April 2, 2025

OpenAI has faced accusations of training its AI on copyrighted content without permission. A new paper by an AI watchdog organization alleges that the company used non-public books it didn’t have licenses for to train more advanced AI models.

AI models act as prediction engines, learning patterns and ways to extrapolate from data sources like books, movies, and TV shows. However, using solely synthetic data poses risks, such as compromising a model’s performance.

The paper from the AI Disclosures Project suggests that OpenAI may have trained its GPT-4o model on paywalled books from O’Reilly Media without a licensing agreement. The study used a method called DE-COP to detect copyrighted content in the training data of language models.

The co-authors of the paper examined the knowledge of O’Reilly Media books in OpenAI models, with GPT-4o showing greater recognition of paywalled content compared to older models like GPT-3.5 Turbo.

While the findings are not definitive proof, they raise concerns about OpenAI’s training practices. The company has been criticized for its approach to using copyrighted data and may face further legal challenges.

It’s important to note that OpenAI does pay for some of its training data and has mechanisms in place for copyright owners to request content removal. Despite these efforts, the company continues to face scrutiny over its data practices.

As OpenAI navigates legal battles and public scrutiny, the allegations raised in the O’Reilly paper add to the ongoing debate surrounding AI ethics and copyright law.

OpenAI has not provided a response to the allegations at this time.