Training-Data Scraping
LegalDefinition
Training-data scraping is the automated collection of content from a platform in order to train artificial-intelligence models, typically without a licence or the platform's permission. Reddit has become a focal point of disputes over the practice because its large archive of human conversation is highly valuable for training large language models, and because the company has chosen to license that data to some AI firms while pursuing those it accuses of taking it without authorisation.
In June 2025 Reddit sued Anthropic, alleging that automated systems accessed Reddit content more than a hundred thousand times to train its Claude models despite being told not to and without user consent. Notably, the suit was framed around breach of Reddit's terms of service and unfair competition rather than copyright. Reddit has contrasted such conduct with the paid licensing deals it struck with companies including OpenAI and Google, which include privacy and content-deletion protections. The disputes matter because they test how platforms can assert control over publicly visible but commercially valuable user content, and because the outcomes may shape the legal status of scraping for AI more broadly.