I first read about it on Reddit, followed shortly by the CANLIB-DATA Listserv, but as of today Google has a new search engine dedicated to research data sets, the cleverly-named Google Dataset Search.
The good: Surfacing this stuff is great! Google is using schema.org to discover stuff, and has a pretty extensive page on how this all works. Results link through to Google Scholar to show who has cited a dataset. Likely not comprehensive, but a good start. Oh wait, it's not very good at all - I thought it was linking to the DOI, but it's just some sort of keyword linking. I just found a declassified Los Alamos report from 1957 in the top spot that supposedly links to one of these datasets :-/ Right idea, totally the wrong approach.
The annoying: Just as with Google Scholar, there's no way to know exactly what is and isn't being indexed. Also annoying not to have a count on the number of results. I can't get the "share" button to work, but that may very well be something specific with my browser and some extension - not a huge deal right now.
The weird: Of course I did a search for MPOW, and "University of Calgary" auto-suggests to a record about our institutional repository, but nothing from within our IR. Do we not conform with schema.org? (entirely possible). Why is that link from the French version of the National Research Council Canada?
The bad: No filters or facets of any sort - boo!
I have already found a couple of datasets of interest, and one that eventually led through to a deleted dataset, making me wonder how fresh the index is.
Definitely one to watch!
I subscribe to The Daily, from Statistics Canada, and was intrigued earlier this week when they announced that Canadian Cancer Statistics 2018 would be released the following day. This was the first time I remembered seeing StatsCan announce that something would be released, rather than that something had been released. Also, I couldn't recall a time when had they pointed to another organization, and not to their own content.
I expected to find a simple spreadsheet or two containing numbers around cancer, so was confused to see that it seemed to be much more of a complete information package, including a Media Release page. So I checked it out, and I must say I was thoroughly impressed! This is a very well-written report, easily understood by laypeople, yet meaty enough to be of interest to physicians and others who have been paying a lot more attention to cancer than I have. Early on in the report they do an excellent job explaining how the stages of cancer (the main focus of this special report) are classified, and I now feel I have a much better understanding of this system. It's not hard; I just never paid attention.
The report itself is only about 25 pages (PDF), though is double that including appendices, references and such. Each section of the report does an excellent job of describing what's being presented, why, and how to interpret the results. I was pleased to see lots of information, again well-presented, at the end of the report on where to go for those statistics I had initially sought. They include a shorter version of this PPT presentation that describes how to use CANSIM, StatsCanada's socio-economic database, which I'll probably borrow for other presentations I do.
So why am I bothering to come out of hibernation to write a blog post on cancer? Not for me, but for you. Well, all of us. I hadn't realized until reading this report that nearly 50% of Canadians (and possibly Americans / humans?) will be diagnosed with cancer in their lifetime (holy crap!), and that cancer is the leading cause of death in this country. But I also learned that regardless of the type of cancer, our chances of living are far greater if the cancer is diagnosed early, in stage I or II. Nobody likes tests, but hey, if you're approaching or are over the age of 50, do yourself and your loved ones a favour by asking your doctor what screening tests can be done. Here's some background on screening for Canadians and Americans.
One of my all-time favourite open source tools is Open Refine, which "is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data." Today I'm talking about that extending bit. One of the nice features of Open Refine is that it can run on Linux, Windows, or MacOS. I only have experience with the latter two, but am pretty sure what follows applies to Linux as well.
So here it is in a nutshell. When you install an extension in Open Refine, sometimes there will be an indication in the upper right of your Open Refine page that the extension is installed, but just as often there is no indication other than a new choice within a drop-down, or a new function that's now available in GREL. I think maybe the first few extensions I ever installed resulted in a drop-down choice in the upper right, so I made the assumption that's how all extensions behaved. I'm embarrassed to tell you how much time I have wasted installing, uninstalling, then trying a different location all in vain as I restarted Refine only to not see the expected shiny new extension greeting me in the upper right corner.
How many extensions do I have correctly installed here? (The pull-down only relates to the named extension you see)
Wrong - I have four installed! Until earlier today I too would've sworn it was only one, but these other three operate elsewhere within the program, either through a drop-down, or as part of GREL:
And how did I finally figure this out? I finally read the frickin' readme file within one of those extension folders and it explained all. Hope this helps someone else out there. I've gone ahead and clarified the instructions on how to install extensions as well.
As an aside, if you haven't looked at it lately, Open Refine seems to have a new lease on life, with a couple of recent point releases and a new push on looking for new contributors (even non-coders).
For the past couple of weeks I've been helping a student gather a collection of historical tweets, based on specific hashtags. You may be aware that Twitter's search API limits searches to only the past week or so, which sucks for those doing historical research. I'm still on the hunt for a tool that would allow me to easily gather older tweets, so please let me know if you have one! I tried Octoparse and Import.io, but neither ended up being reliable for this purpose.
At this point I reached out to Ed Summers, creator of the awesome command-line tool twarc, who suggested I try scraping with Webrecorder.io (protip, use the autoscroll button at the top). Webrecorder.io did a pretty decent job of capturing my historical search, but it took a fair amount of work. It seemed to bog down a bit after a while, so I ended up chunking my searches into 2-3 days, necessitating running the scrape about a dozen times to capture the entire event. And then I was knee-deep in learning about WARC files. I was unable to find a tool that I could make work that would allow me to extract either the full tweets, or the tweet-ids from the WARC files, so simply having the entire search results at my disposal still didn't help me.
Then I did what I should've done first. I searched to see if anyone had posted an archive of the tweets my student was looking for. And someone had. If ever you're doing social media research around crisis situations, you'll want to know about CrisisLex.org.
So now we almost had what we needed. But, Twitter also says that if you've collected a pile of tweets, you can't post them for someone else to download, you can only post a file of the tweet-ids. Ed again has some good thoughts on this rule. This is why I needed to "hydrate" the tweet-ids contained in the CrisisLex files in order to get the actual details of the original tweets.
twarc does this, but somehow I screwed up my extraction of the tweet-ids from the .csv provided by CrisisLex and it didn't seem to be working correctly for me. So I went crying back to Ed to see if he knew what I was doing wrong, and he showed me a screen shot of a tool he used to prove that it was all working for him.
Hey, if I can find a "program" to do something so I don't have to do it in the command line, I am one happy camper! Turned out he was using Hydrator, which works on OS X, Windows and LINUX. You feed it a file of tweet-ids and it spits out a JSON file AND a CSV file of the actual tweets. Golden. My student is tickled and I can move on to something new, but with a wicked new tool at my disposal. I owe you several drinks, Ed!
As I did last year, I have captured all the tweets from this year's Semantic Web in Libraries conference using the TAGS tool for Google Sheets. As of this posting there are 630 tweets with the hashtag #SWIB, 573 of them unique. Last year had 1,736 tweets! with 1,633 of them unique.
Here's the archive for 2016. Have fun!
The other day Google Earth released updated historical imagery, so I thought it'd be neat to take a peek at two spots I know have grown a lot since 1985, the date of the earliest imagery.
I also wanted a quick and dirty way to make a before/after image slider, as shown on the above post, so I found JustaposeJS, which does just the trick! There are other options if you do want to get fancy.
So here's Calgary:
And here's Ft. McMurray:
From the original post, "Note that the new data is created by blending all Landsat/Sentinel 2 data for a whole year to remove clouds and snow cover. The result is that changes that happen on timescales less than a year, such as seasonal changes, will not be visible."
I could look at stuff like this all day!
Yesterday was a pretty good day for Open Data in Alberta. First, the City of Calgary went live with a Socrata version of their Open Data portal. Previously they had been running on some sort of Microsoft SharePoint (my guess) site, and it was not a particularly pretty or useful thing. Really looking forward to working with their data in this new format, and with sweet APIs.
Second up, the announcement of Open Data Areas Alberta (ODAA). From their landing page:
Open Data Areas Alberta (ODAA) is a new undertaking being spearheaded by Alberta Data Partnerships to put extensive data in the hands of those who can use it. Datasets from six key rural areas across Alberta is available for no cost to inspired entrepreneurs, SMEs and creative problem solvers. These will include earth observation, remote sensing, geospatial data, environmental data, and social and economic datasets from private industry and government.
Those six areas include Beaver Hills, Fox Creek, Taber, Fort McMurray, RMH Sylvan and Utikuma Lake. It's good, current (for the most part) data, too! Want 17GB of orthophotos of the Fort McMurray Oilsands? Here you go. Wonder why Taber grows such great corn? Maybe the Agricultural Regions of Alberta Soil Information Database (AGRASID) will tell the tale.
Mita Williams has an excellent blog post titled Why Libraries Should Maintain the Open Data of Their Communities. It's a long (for a blog post), but important read, and includes an excellent history of how Canadian government data has evolved, and how it compares (poorly) to US government data .
I've been making the same case for Libraries and Open Data myself this year, but in a much less eloquent and scholarly way :-) While Mita's post is based on slightly older research (2014), the bibliography is still a great place to learn pretty much all you need to know on the subject. Having similarly researched over the course of this year, I sadly don't think much has come out in the interim, except maybe Brian Jackson's 2015 article, The State of Canadian Library Data.
If you're at all interested in the role libraries can or should play when it comes to Open Data, you owe it to yourself to carve out some time to give Mita's post a read.
A few weeks ago two new data sets appeared on the Open Data Canada website, WD - Grants and Contributions over $25,000 and Project Geo-Information, which "provides geographical information on projects issued by or on behalf of Western Economic Diversification Canada." Not particularly well-named data sets, but that's another issue. I thought it would be a fun exercise to combine the two, since the geo information is kinda worthless without the actual project data to go with it.
First I imported both files into Google Sheets to see if I could merge the data sets there. I found an add-in called Merge Sheets that should've done it, but perhaps the files were too large, so no dice. Knock yourself out if you want to try (note the two sheet tabs at the bottom).
So I ended up using VLOOKUP in Excel to combine the two files. I always have a hard time wrapping my head around VLOOKUP, but I got it done. Had to save the result as a .csv in order to import to other tools; if saved as .xlsx there are errors as the formula continues to reference a separate sheet. Maybe if I'd worked with them together rather than as separate files. Anyhoo, the .csv worked when I then brought that new file into Google Fusion Tables for mapping fun.
And the end result is:
Seeing it on a map immediately makes me wonder why some big money is being spent in Ontario. I could also immediately saw one record had an incorrect lat/lon assigned to it, putting a BC project in ON, but I fixed that, so it doesn't appear on the above map.
Next up, if I make the time, is to assign a different icon to different types of projects, or perhaps different sizes / colours for different grant amounts. And explore what other online mapping tools might make this easy to work with now that the full data set exists.
A new data set was recently released by Agriculture and Agri-Food Canada, Land Use for 2010. As I was poking at it I saw that this data set in turn leads to an interactive map in which one can explore not only the 2010 data set, but also 2000 and 1990. As I was poking at that, I saw that one of the land-use classifications was for "settled areas", and I thought it would be interesting to see how certain spots have changed over the past 20 years covered in the data set. Knowing that Fort McMurray has grown a bit lately I zoomed in there and captured the following changes, which I found flabbergasting! I'm not familiar with that part of the Province, but have always heard "Fort McMurray" and "the tar sands" together, so hadn't realized the tar sands are actually a fair bit north of Ft. McMurray. So that's what you're seeing here, the growth of the Alberta Tar Sands over the past 20 years. Wow.
Two seemingly-contradictory open data stories crossed my path today.
First, The Scholarly Kitchen notes that Scientific Reports On Track To Become Largest Journal In The World, and one of the reasons may be that the PLOS ONE, the current Largest Journal, requires
authors to make “all data underlying the findings described in their manuscript fully available without restriction, with rare exception.” All PLOS manuscripts must include a data availability statement and authors are strongly encouraged to make their data available in a public archive before publication. In contrast, Scientific Reports’ policy merely states that authors should share upon request.
So as recently noted in the New England Journal of Medicine, apparently scientists just really don't wanna be bothered to share the data upon which their publications are based. FFS.
A couple of great quotes from the announcement, "The Libraries supports the freedom of inquiry of our patrons and is dedicated to providing our community with access to information. We are keen to contribute to our community, and the broader open data movement, by allowing all citizens to access, share and reuse data that we produce.” and “While we can’t anticipate how people will make use of this open data set, we are excited to provide access to it in support of continued creativity and innovation within our community”
Was this a burdensome task to make this data available? Rumour around here is that UofA has been working for a couple of years to remove confidentiality clauses from their subscription agreements, so yeah, it was, but it's still worth it! See scientists, that's how to do it.
Which other libraries are making this type of data open, anyone know?
Last week I was intrigued when this paper came out: Trust, tribalism and tweets: has political polarization made science a “wedge issue”?. In it, "Helmuth and his Northeastern colleagues analyzed the Twitter accounts of U.S. senators to see which legislators followed research-oriented science organizations, including those covering global warming. Democrats, they found, were three times more likely than Republicans to follow them, leading the researchers to note that “overt interest in science may now primarily be a ‘Democrat’ value.”" Interesting approach, I thought. I was very pleased to learn that the data sets for the article are available at Northwestern University's data repository (yay open science!).
Then for fun I thought I'd check the datasets to see whether any US senators followed Edward Snowden. None did! Then I realized that's 'cause the data sets are from February 2015, and @Snowden didn't appear on twitter until the fall of 2015! Doh!
Now I'm on a rabbit hunt to figure out how to compare @Snowden's 2.12 million followers to see if any of them are US senators. Since I had the list of senators from the aforementioned data sets, I thought it would be easiest to just get a list of all @Snowden's followers and compare the two. Surely somebody has a utility that will allow me to download all 2.12 million followers, right? Not so much :-(
I started with something in php I found on Github that sounded absolutely perfect, but it flat out wouldn't work, and so far no feedback from the developer. Next checked with Nick Ruest to see if twarc might be able to do it, and he suggested I look at tweepy (python) instead. That looked really promising, except I couldn't figure out how to work around the rate limits imposed by Twitter over gathering such a large follower list. Next up, a random kind stranger suggested a program written in ruby called simply "t". I now LOVE this program, in part because it's very simple and very well documented. But it still falls short because of the rate limits. :-( And then my knight in shining armour, Ed Summers, came through with some real life working examples in tweepy! I'm now 3 hours in to gathering the full list of @Snowden's followers, and napkin math suggests I'm only halfway though. Fingers crossed for when I come back in on Monday!
I'd already spent too much time on this, though I've learned tons along the way, and will be able to utilize a lot of this in the future. But I still wanted to find out which US senators followed @snowden. Back to t and I find the command "t does_follow". Hell, there's only 100 senators, so I ended up doing it manually, which in the end only took about 10-15 minutes. And here's what I got for each and every one:
ppival$ t does_follow chriscoons snowden
No, @chriscoons does not follow @snowden.
Talk about an echo chamber! I guess if he's still officially a traitor it looks bad for a senator to be acknowledging him? Sure, if he says something useful I'm sure someone would share it with the VIP, but c'mon, you can't even hear what he has to say or acknowledge his existence by following him on twitter? That's pretty lame, IMHO. btw, neither Hillary nor Donald follow @snowden either. Neither do Hillary or Donald follow each other. Whatever.
Thanks also to John Brosz for helping me work though code, and Andrew Pasterfield for the same.
My colleague Christie pointed me to this interesting bit of research, Survey of scholarly communication tool usage. I'm pretty sure I didn't contribute to the original survey, but over 20,000 scholars did! The survey was run by two librarians from Utrecht University Library, Bianca Kramer and Jeroen Bosman. I've only started to poke at the data, but I'm so very impressed at how much of the data they've made available for exploration, and the tools they've used to help visualize the data:
The data are available in various ways:
- DATA: The full dataset of the Innovations in Scholarly Communication Survey with raw (anonymized) and cleaned data files, a list of variables and the original survey questionnaire is available through Zenodo.
- DASHBOARD: An interactive Silk dashboard is available to allow easy filtering and comparisons.
- DESCRIPTION: Description of the dataset is in a Data Note publication on F1000 Research.
- SCRIPT SHARING: We are still working to facilitate shared analysis through a platform for script sharing and execution. This is most useful for those who want to do their own research.
Here are some results to whet your appetite:
Definitely worth poking around - find out what the researchers in the disciplines you support are using to search, access, alert, read, analyze, share, write, manage references, archive, and publish!
I'm about halfway through an excellent Coursera MOOC called Getting and Cleaning Data, part of my plan to learn R. We were just shown a slide listing a bunch of collections where we could find data sets to play with, and the first one on the list had disappeared. Turns out it was originally created by Hilary Mason, who apparently is someone I should know about in the data world! Hilary used to be the chief scientist at bitly.com, and the disappearance of this excellent list makes me wonder if they had a falling out. Regardless, and even though it's available on Archive.org, I thought I'd recreate it myself as a Delicious Tag Bundle, so I did. I added annotations to many of them, weeded out one or two that were really dead and gone, and added a couple as well, so it's not quite canonical ;-) Of course Delicious has been pretty flaky of late, so fingers crossed IT doesn't disappear!