« New on LLRX.com for March 2009 | Main | Law Librarian Guide to RSS Feeds »
March 31, 2009
Why Not Harvest Citation Data Before Papers Are Published?
José H. Canós Cerdá, Eduardo Mena Nieto and Manuel Llavador Campos argue that citation analysis needs a transformation because of such important shortcomings as lack of coverage of publications, low accuracy of the citation data and costly process of harvesting citation data after papers have been published. In What's Wrong with Citation Counts? D-Lib Magazine, March/April 2009, the authors propose a brilliant alternative:acquiring citation data automatically from papers before they are published, storing the data in a Global Citation Registry, and making the Registry's data readily available for bibliometic analysis. From the article:
A look at the internal processes of both commercial and free citation management systems shows that citation harvesting, which uses costly techniques such as optical character recognition, machine learning, and others, start after papers have been published. Notice, however, that in a high number of papers, citation information is generated, and hence can be collected, much earlier. In fact, most scientists prepare their papers using word processing systems that have accompanying bibliography management utilities. BibTeX, for instance, is able to generate bibliography lists in LaTeX documents from metadata stored in the so-called ".bib" files. Microsoft Office Word 2007 has a built-in bibliography manager, and users of earlier versions can manage their bibliographies using third-party applications such as EndNote or RefWorks. All these bibliography managers are aware of the citations included in papers, but such information is systematically discarded when the camera ready copies of the papers are sent for publication. At that time, citation records must be built again from scratch, which results in additional costs, errors and delays. Moreover, different companies maintain different citation databases with highly overlapping content, possibly in different formats that complicate interoperability.
Better management of the citation data throughout the lifecycle of a paper will improve data quality and significantly reduce the cost of citation generation. Instead of viewing scientific publishing as a number of disconnected activities, we claim that a framework should be defined for a global workflow, from document creation to publication, involving different actors who would participate collaboratively. Citation data would be generated only once – at the time of document creation – after which such data could flow from one activity to the next. Consequently, there would no longer be a need to harvest citation data again after a paper's publication. The citation records thus generated should be stored in a Global Citation Registry (GCR), maintained by independent organizations similarly to the way in which Internet domain names or ISBN codes are managed. As envisioned, the GCR would be freely accessible for queries; and updates to it would be made by the entities responsible for the publications of papers, that is, companies or organizations acting as publishers.
[JH]
March 31, 2009 in Professional Readings | Permalink
TrackBack
TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341bfae553ef01156f7a90d8970b
Listed below are links to weblogs that reference Why Not Harvest Citation Data Before Papers Are Published?: