A week or so ago I came across a series of posts from Spencer Greenhalgh in which he described how he used R to take a large group of collected tweets around the terrorist attacks in Paris in order to geolocate them on a map. The trouble is that very few users actually include geolocation information in their tweets, so Spencer had to figure out a way to grab an approximation of that information by mining each user's twitter profile page for his/her stated home location. This wouldn't indicate where the users were when they actually tweeted, but it gave him a good idea where the people usually were when they tweeted (if the profile page could be trusted), which was good enough for his experiments.
I've been playing with Open Refine of late, so thought I'd take a stab at this same experiment with that hammer. I figured out all the pieces, but was unsuccessful in the end, and I can't tell you exactly why. I got Spencer's archive of tweets, and was easily able to import it into Open Refine. The problem came when I started to pull in the user profile pages - no matter how I sliced and diced the archive I would always time out on some section or another. The full data set contains just over 6,000 lines, so I tried breaking that in half by date, into smaller pieces by client used to send the tweet, and by unique vs duplicate users. I was able to get a full scrape of the user profile pages for about 600 users, but every other group I tried would simply time out (I tried all sorts of variations on the throttle delay, from 250 - 10,000 milliseconds, but my scrape would always hang, depending on the set I was trying, either around 67% or 89%. Maybe it was a memory issue? Because it simply hung, I never got an error message to work with. If you have an idea what might've been the problem I'd LOVE to know!
The GREL expression I did eventually use to dig out the user location from those I could get was as follows:
value.parseHtml().select("span.ProfileHeaderCard-locationText")[0].toString()
That searches through the scraped profile page HTML for the CSS class around the user-stated location and just returns that bit of info, which would look like this:
<span class="ProfileHeaderCard-locationText u-dir" dir="ltr"> Atlanta, GA </span>
It did take me a while to figure this bit out, so for future reference I found this page most useful.
So now we've got a column of messy data, which I cleaned with the following expression:
replace(value,/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')
I found that one at https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML and once run it leaves behind the nice clean string of text we're after:
Atlanta, GA
And from there it would be trivial to run that column through any number of free online mapping tools in order to generate a map of where users say they're from.
So the biggest headache was getting all the HTML from the user profile page. JSON would've been MUCH cleaner, but like Spencer I wanted to try to do this without using the Twitter API, so AFAIK I was stuck with HTML.
Fast forward to today, when I see Nick Ruest's post, A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets. I was having trouble with 6,000 tweets, and Nick was churning through almost 15 million tweets with a far more appropriate tool called twarc.
twarc is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that is exactly what was returned from the Twitter API. Tweets are stored as line-oriented JSON. It twarc runs in three modes: search, filter stream and hydrate. When running in each mode twarc will stop and resume activity in order to work within the Twitter API's rate limits.
I had tried to work with twarc a few months ago, but couldn't get it running on the shared server I was on, but after seeing just what it's capable of I will get it up and running, hog frog or dog!
My takeaway from this is that while there may be many different ways to solve a problem, they're certainly not all equal. Spencer stuck with what he knew and got something usable out of it. I tried with what I knew and failed, and Nick used what he knew and kicked both our butts. Spencer, do look at twarc too! :-)
Upward and onward!