Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

Julie A McMurry; Nick Juty; Niklas Blomberg; Tony Burdett; Tom Conlin; Nathalie Conte; Mélanie Courtot; John Deck; Michel Dumontier; Donal K Fellows; Alejandra Gonzalez-Beltran; Philipp Gormanns; Jeffrey Grethe; Janna Hastings; Jean-Karim Hériché; Henning Hermjakob; Jon C Ison; Rafael C Jimenez; Simon Jupp; John Kunze; Camille Laibe; Nicolas Le Novère; James Malone; Maria Jesus Martin; Johanna R McEntyre; Chris Morris; Juha Muilu; Wolfgang Müller; Philippe Rocca-Serra; Susanna-Assunta Sansone; Murat Sariyar; Jacky L Snoep; Stian Soiland-Reyes; Natalie J Stanford; Neil Swainston; Nicole Washington; Alan R Williams; Sarala M Wimalaratne; Lilly M Winfree; Katherine Wolstencroft; Carole Goble; Christopher J Mungall; Melissa A Haendel; Helen Parkinson

doi:10.1371/journal.pbio.2001414

Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

PLoS Biol. 2017 Jun 29;15(6):e2001414. doi: 10.1371/journal.pbio.2001414. eCollection 2017 Jun.

Authors

Julie A McMurry¹, Nick Juty², Niklas Blomberg³, Tony Burdett², Tom Conlin¹, Nathalie Conte², Mélanie Courtot², John Deck⁴, Michel Dumontier⁵, Donal K Fellows⁶, Alejandra Gonzalez-Beltran⁷, Philipp Gormanns⁸, Jeffrey Grethe⁹, Janna Hastings¹⁰, Jean-Karim Hériché¹¹, Henning Hermjakob², Jon C Ison¹², Rafael C Jimenez², Simon Jupp², John Kunze¹³, Camille Laibe², Nicolas Le Novère¹⁰, James Malone², Maria Jesus Martin², Johanna R McEntyre², Chris Morris¹⁴, Juha Muilu¹⁵, Wolfgang Müller¹⁶, Philippe Rocca-Serra⁷, Susanna-Assunta Sansone⁷, Murat Sariyar¹⁷, Jacky L Snoep^{18

19}, Stian Soiland-Reyes⁶, Natalie J Stanford⁶, Neil Swainston²⁰, Nicole Washington²¹, Alan R Williams⁶, Sarala M Wimalaratne², Lilly M Winfree¹, Katherine Wolstencroft²², Carole Goble⁶, Christopher J Mungall²¹, Melissa A Haendel¹, Helen Parkinson²

Affiliations

¹ Department of Medical Informatics and Epidemiology and OHSU Library, Oregon Health & Science University, Portland, Oregon, United States of America.
² European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom.
³ ELIXIR Hub, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom.
⁴ Berkeley Natural History Museums, University of California at Berkeley, Berkely, California, United States of America.
⁵ Institute of Data Science, Maastricht University, Maastricht, the Netherlands.
⁶ School of Computer Science, The University of Manchester, Manchester, United Kingdom.
⁷ Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom.
⁸ Institute of Experimental Genetics, Helmholtz Centre Munich, German Research Center for Environmental Health, Neuherberg, Germany.
⁹ Center for Research in Biological Systems, University of California San Diego, La Jolla, California, United States of America.
¹⁰ Babraham Institute, Cambridge, United Kingdom.
¹¹ European Molecular Biology Laboratory, Heidelberg, Germany.
¹² Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark.
¹³ California Digital Library, Oakland, California, United States of America.
¹⁴ Science and Technology Facilities Council, Daresbury Laboratory, Warrington, United Kingdom.
¹⁵ Genomics Coordination Center, Department of Genetics, University Medical Center Groningen and Groningen Bioinformatics Center, University of Groningen, Groningen, the Netherlands.
¹⁶ Scientific Databases and Visualization at Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
¹⁷ Institute for Medical Informatics, Bern University of Applied Sciences, Engineering and Information Technology, Bern, Switzerland.
¹⁸ Manchester Institute of Biology, University of Manchester, Manchester, United Kingdom.
¹⁹ Department of Biochemistry, Stellenbosch University, Stellenbosch, South Africa.
²⁰ Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals, University of Manchester, Manchester, United Kingdom.
²¹ Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America.
²² Leiden Institute of Advanced Computer Science, Leiden University, Leiden, the Netherlands.

Abstract

In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

MeSH terms

Biological Science Disciplines / methods*
Biological Science Disciplines / statistics & numerical data
Biological Science Disciplines / trends
Computational Biology / methods*
Computational Biology / trends
Data Mining / methods*
Data Mining / statistics & numerical data
Data Mining / trends
Databases, Factual / statistics & numerical data
Databases, Factual / trends
Forecasting
Humans
Internet
Software Design*
Software*

Abstract

MeSH terms

Grants and funding