SPECTRa-T: machine-based data extraction and semantic searching of chemistry e-theses

Jim Downing; Matt J Harvey; Peter B Morgan; Peter Murray-Rust; Henry S Rzepa; Diana C Stewart; Alan P Tonge; Joe A Townsend

doi:10.1021/ci9003688

SPECTRa-T: machine-based data extraction and semantic searching of chemistry e-theses

J Chem Inf Model. 2010 Feb 22;50(2):251-61. doi: 10.1021/ci9003688.

Authors

Jim Downing¹, Matt J Harvey, Peter B Morgan, Peter Murray-Rust, Henry S Rzepa, Diana C Stewart, Alan P Tonge, Joe A Townsend

Affiliation

¹ Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, UK.

PMID: 20088574
DOI: 10.1021/ci9003688

Abstract

The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (e-theses). Although NCEs were readily identified within the two major document formats studied, only the use of structured documents enabled identification of chemical objects and their association with the relevant chemical entity (e.g., systematic chemical name). A corpus of theses was analyzed and it is shown that a high degree of semantic information can be extracted from structured documents. This integrated information has been deposited in a persistent Resource Description Framework (RDF) triple-store that allows users to conduct semantic searches. The strength and weaknesses of several document formats are reviewed.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Academic Dissertations as Topic*
Chemistry / education*
Data Mining / methods*
Databases, Factual
Electronic Data Processing
False Positive Reactions
Software*