Pushshift
TechnicalDefinition
Pushshift was a large-scale data service that continuously archived Reddit posts and comments and made them searchable by time range, keyword, author, and subreddit. Created by Jason Baumgartner, it captured historical Reddit content that the official API did not easily expose, and it became an essential resource for academic researchers, journalists, and especially moderators, who used it and tools built on it to investigate user history, detect ban evasion, and identify trolls and coordinated abuse.
Pushshift's public access was cut off during Reddit's 2023 API changes. Reddit stated that the service violated its API rules—noting its data had been used to train large language models—and revoked its access, after which Pushshift was no longer publicly available. The loss was a significant point of contention in the broader API controversy: researchers reported that work on topics such as radicalization, harassment, and online recovery communities was disrupted, and moderators lost archival tools they relied on to keep communities safe. Reddit later restored limited Pushshift access for vetted moderators through an application process, but the episode underscored how dependent Reddit's research and moderation ecosystem had become on third-party archives.
Sources
- 01The Pushshift Reddit Dataset — arXivAcademic2020
- 02Pushshift API — GitHubOfficial / Reddit2023
- 03