One of the topics of discussion at last week's Summon Advisory Board was the status of de-duping records returned by Summon. On the face of it it seems to be a simple issue - if the titles and authors match, throw the duplicate records out and you're good to go. The Summon technical team explained why it's a little harder than that though.
As we know, Summon collects metadata from multiple sources, and thus might pick up the same citation from 2, 3 or more publishers or aggregators. The problem is that different sources will include different information, and do you really want the tool (Summon) deciding which is the important information for you? I don't. Here's an example of a duplicate record:
There's no easy way to get a screen capture of the full record, so here's just the text of the first, then the second, and I've highlighted the unique components of each record.
Cloudy skies: assessing public understanding of global warming
Authors: Sterman, John D and Sweeney, Linda Booth
Publication Title: System Dynamics Review
Date: 22/2002
Volume: 18
Issue: 2
Pages: 207 - 240
ISSN: 0883-7066
DOI: 10.1002/sdr.242
Language: EnglishCloudy skies: assessing public understanding of global warming
Authors: John D. Sterman and Linda Booth Sweeney
Abstract
Surveys show that most Americans believe global warming is real. But many advocate delaying action until there is more evidence that warming is harmful. The stock and flow structure of the climate, however, means wait and see policies guarantee further warming. Atmospheric CO2 concentration is now higher than any time in the last 420,000 years, and growing faster than any time in the past 20,000 years. The high concentration of CO2 and other greenhouse gases (GHGs) generates significant radiative forcing that contributes to warming... [PUBLICATION ABSTRACT]
Publication Title: System Dynamics Review
Date: 07/2002
Volume: 18
Issue: 2
Start Page: 207
ISSN: 0883-7066
Genre: Feature, Feature
Subjects: Studies, Public good, Global warming, Public policy, Emissions, United States, Experimental/theoretical, Social policy, Pollution control
Language: English
While the second record is obviously much more complete, the first one contains at least two pretty vital pieces of information the second is missing; the end page and the DOI.
So the problem the Summon team is working on is a way not so much to deduplicate, but to automatically merge these records. Progress is being made, but seeing it laid out like that made it more obvious why it's not just done already.