For the past couple of weeks I've been helping a student gather a collection of historical tweets, based on specific hashtags. You may be aware that Twitter's search API limits searches to only the past week or so, which sucks for those doing historical research. I'm still on the hunt for a tool that would allow me to easily gather older tweets, so please let me know if you have one! I tried Octoparse and Import.io, but neither ended up being reliable for this purpose.
At this point I reached out to Ed Summers, creator of the awesome command-line tool twarc, who suggested I try scraping with Webrecorder.io (protip, use the autoscroll button at the top). Webrecorder.io did a pretty decent job of capturing my historical search, but it took a fair amount of work. It seemed to bog down a bit after a while, so I ended up chunking my searches into 2-3 days, necessitating running the scrape about a dozen times to capture the entire event. And then I was knee-deep in learning about WARC files. I was unable to find a tool that I could make work that would allow me to extract either the full tweets, or the tweet-ids from the WARC files, so simply having the entire search results at my disposal still didn't help me.
Then I did what I should've done first. I searched to see if anyone had posted an archive of the tweets my student was looking for. And someone had. If ever you're doing social media research around crisis situations, you'll want to know about CrisisLex.org.
So now we almost had what we needed. But, Twitter also says that if you've collected a pile of tweets, you can't post them for someone else to download, you can only post a file of the tweet-ids. Ed again has some good thoughts on this rule. This is why I needed to "hydrate" the tweet-ids contained in the CrisisLex files in order to get the actual details of the original tweets.
twarc does this, but somehow I screwed up my extraction of the tweet-ids from the .csv provided by CrisisLex and it didn't seem to be working correctly for me. So I went crying back to Ed to see if he knew what I was doing wrong, and he showed me a screen shot of a tool he used to prove that it was all working for him.
Hey, if I can find a "program" to do something so I don't have to do it in the command line, I am one happy camper! Turned out he was using Hydrator, which works on OS X, Windows and LINUX. You feed it a file of tweet-ids and it spits out a JSON file AND a CSV file of the actual tweets. Golden. My student is tickled and I can move on to something new, but with a wicked new tool at my disposal. I owe you several drinks, Ed!
Update: April 8, 2020 - see this post for tips on prepping a dataset to work within Hydrator.