Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data

Alex M Clark; Antony J Williams; Sean Ekins

doi:10.1186/s13321-015-0057-7

Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data

J Cheminform. 2015 Mar 22:7:9. doi: 10.1186/s13321-015-0057-7. eCollection 2015.

Authors

Alex M Clark¹, Antony J Williams², Sean Ekins³

Affiliations

¹ Molecular Materials Informatics, 1900 St. Jacques #302, Montreal, H3J 2S1, QC Canada.
² Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC 27587 USA.
³ Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526 USA ; Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010 USA.

Abstract

The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Graphical AbstractLab notebook entries must target both visualisation by scientists and use by machine learning algorithms.

Keywords: Cheminformatics; File formats; Machine learning; Open lab notebooks; Public data.