“The popularity of generative AI has made the demand for content used to train models or run inference on skyrocket, and although some AI companies clearly identify their web scraping bots, not all AI companies are being transparent,” Cloudflare staff wrote in a blog post.
According to authors of the post, “Google reportedly paid $60 million a year to license Reddit’s user generated content, Scarlett Johansson alleged OpenAI used her voice for their new personal assistant without her consent, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher.”
Last year, Cloudflare introduced a way for any of its customers, on any plan, to block specific categories of bots, including certain AI crawlers. These bots, said Cloudflare, observe requests in sites’ robots.txt files, and do not use unlicensed content to train their models, nor gather to feed for retrieval-augmented generation (RAG) applications.