Biography
Jason Baumgartner is a data scientist who created Pushshift, the large-scale Reddit data-collection and archiving service that became the de facto standard corpus for academic research about Reddit. He launched Pushshift in 2015 to address the limitations of Reddit's official API for systematic study, continuously ingesting submissions and comments and making them available through searchable APIs and downloadable monthly data dumps.
The archive grew to encompass billions of comments and hundreds of millions of submissions across millions of subreddits, reaching back to the site's early years. The foundational reference is the peer-reviewed paper 'The Pushshift Reddit Dataset,' co-authored by Baumgartner with Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn and published at the International AAAI Conference on Web and Social Media in 2020. The dataset has since been cited across a very large body of computational social-science work spanning disinformation analysis, extremism detection, health informatics, and online-governance research.
Pushshift was directly affected by Reddit's 2023 API changes. Reddit stated that Pushshift had violated its API terms and cut off its access, a move that disrupted a resource relied on by academics, journalists, and moderators. A coalition of researchers and civil-society groups published an open letter warning that restricting access threatened public-interest research and online-safety work.
Unlike many figures in the field, Baumgartner is a practitioner and data-infrastructure builder rather than a tenured academic, but the dataset he assembled underpinned much of the era's empirical Reddit scholarship, including studies of deplatforming and mass bans. His work is a frequent reference point in debates over researcher access to platform data.