The company quietly funneling paywalled articles to ai developers the atlantic / alex reisner / nov 5, 2025 “a search for nytimes.com in any crawl from 2013 through 2022 shows a ‘no captures’ result, when in fact there are articles from nytimes.com in most of these crawls. Common crawl’s massive internet archive may be giving ai companies access to paywalled journalism, according to a new report. The nonprofit doing the ai industry’s dirty work “the web archive common crawl has been quietly funneling paywalled articles to ai companies—and lying to publishers about it.” “t he common crawl foundation is little known outside of silicon valley. The common crawl foundation has been scraping the internet for over a decade, creating a vast archive used by ai companies to train models, including paywalled content. In the process, my reporting has found, common crawl has opened a back door for ai companies to train their models with paywalled articles from major news websites
And the foundation appears to be lying to publishers about this—as well as masking the actual contents of its archives.
WATCH