Deep learning for single-molecule science

Tim Albrecht; Gregory Slabaugh; Eduardo Alonso; S M Masudur R Al-Arif

doi:10.1088/1361-6528/aa8334

Deep learning for single-molecule science

Nanotechnology. 2017 Oct 20;28(42):423001. doi: 10.1088/1361-6528/aa8334. Epub 2017 Aug 1.

Authors

Tim Albrecht¹, Gregory Slabaugh, Eduardo Alonso, S M Masudur R Al-Arif

Affiliation

¹ Department of Chemistry, Imperial College London, Exhibition Road, London SW7 2AZ, United Kingdom.

PMID: 28762339
DOI: 10.1088/1361-6528/aa8334

Abstract

Exploring and making predictions based on single-molecule data can be challenging, not only due to the sheer size of the datasets, but also because a priori knowledge about the signal characteristics is typically limited and poor signal-to-noise ratio. For example, hypothesis-driven data exploration, informed by an expectation of the signal characteristics, can lead to interpretation bias or loss of information. Equally, even when the different data categories are known, e.g., the four bases in DNA sequencing, it is often difficult to know how to make best use of the available information content. The latest developments in machine learning (ML), so-called deep learning (DL) offer interesting, new avenues to address such challenges. In some applications, such as speech and image recognition, DL has been able to outperform conventional ML strategies and even human performance. However, to date DL has not been applied much in single-molecule science, presumably in part because relatively little is known about the 'internal workings' of such DL tools within single-molecule science as a field. In this Tutorial, we make an attempt to illustrate in a step-by-step guide how one of those, a convolutional neural network (CNN), may be used for base calling in DNA sequencing applications. We compare it with a SVM as a more conventional ML method, and discuss some of the strengths and weaknesses of the approach. In particular, a 'deep' neural network has many features of a 'black box', which has important implications on how we look at and interpret data.