New York Times successfully removes copyrighted content from AI training dataset
NYT is concerned that AI models provide answers directly, diverting users from original sources
Many online content creators have become aware that tech companies have used their work to train AI models without permission or compensation. Some are now taking steps to address this issue.
The New York Times discovered that one of the largest AI training datasets, Common Crawl, contained links to their paid articles and copyrighted content. Common Crawl has been accumulating web data since 2007, serving as a foundation for training various large language models, including OpenAI's GPT-3. Approximately 12.5% of Google's Infiniset data comes from a refined version of Common Crawl, known as C4.
Although AI models benefit significantly from this training data, The New York Times has concerns. These models provide answers directly, diverting users from the original source of information, which, in this case, uses NYT's copyrighted content.
"We simply asked that our content be removed, and were pleased that Common Crawl complied with our request and recognized The Times's ownership of our quality journalistic content," Charlie Stadtlander, a spokesman at The New York Times, told Business Insider.
As a result, The New York Times reached out to the Common Crawl Foundation earlier this year, requesting the removal of their content from the dataset. Common Crawl complied with their request and acknowledged the ownership of The Times's quality journalistic content. Furthermore, Common Crawl committed not to scrape any more content from The New York Times in the future, as detailed in a letter sent to the US Copyright Office.
The New York Times also discovered its restricted articles behind a paywall and other copyrighted material in various widely used AI training datasets. The NYT mentioned in a letter to the US Copyright Office that about 1.2% of the recreated WebText, previously utilized to train OpenAI's ChatGPT-2, contained content from their publication.
It's unclear if The New York Times has managed to get its content removed from WebText and other AI training datasets, reports Business Insider.