Reddit comment scraper

Experimental-clinical psychologists have finally started to capitalise on naturalistic internet data sources for research. For example, recent papers using data from the social news aggregator Reddit (e.g., Hipp et al., 2015), and the social networking site Tumblr (e.g., Cavazos-Rehg et al., 2016).

The AskReddit subreddit is a particularly useful source. Here, hundreds or thousands of people provide answers to single questions. In many cases these data may be difficult to come by through other means. For example, Hipp et al. (2015) “scraped” data from a thread asking rapists what their motivations were and whether they regretted it or not.

Luckily, Reddit provides access to their data freely through their API, and several libraries have been written to make accessing data via the API extremely easy.

I’ve used PRAW, an API wrapper written in Python, to write a script that allows researchers to easily emulate Hipp et al.’s data acquisition methodology. It scrapes all comments from a given thread and saves the object as a Python pickle, and then selects only the first-level comments and saves the contents of the comment as a .csv file. You can then feed this into your qualitative or quantitative analysis of choice. For example, a sentiment analysis via TidyText in R.

Download the current release or check out the project page if you’d like to see the most recent code or contribute to the project.