Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences

Luu Ho Thanh Lam; Ngoc Hoang Le; Le Van Tuan; Ho Tran Ban; Truong Nguyen Khanh Hung; Ngan Thi Kim Nguyen; Luong Huu Dang; Nguyen Quoc Khanh Le

doi:10.3390/biology9100325

Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences

Biology (Basel). 2020 Oct 6;9(10):325. doi: 10.3390/biology9100325.

Authors

Luu Ho Thanh Lam^{1

2}, Ngoc Hoang Le³, Le Van Tuan⁴, Ho Tran Ban⁵, Truong Nguyen Khanh Hung^{1

4}, Ngan Thi Kim Nguyen⁶, Luong Huu Dang⁷, Nguyen Quoc Khanh Le^{1

8

9}

Affiliations

¹ International Master/PhD Program in Medicine, College of Medicine, Taipei Medical University, Taipei City 110, Taiwan.
² Children's Hospital 2, Ho Chi Minh City 700000, Vietnam.
³ Graduate Institute of Biomedical Materials and Tissue Engineering, College of Biomedical Engineering, Taipei Medical University, Taipei City 110, Taiwan.
⁴ Orthopedic and Trauma Department, Cho Ray Hospital, Ho Chi Minh City 700000, Vietnam.
⁵ Department of Pediatric Surgery, University of Medicine and Pharmacy, Ho Chi Minh City 700000, Vietnam.
⁶ School of Nutrition and Health Sciences, Taipei Medical University, Taipei City 110, Taiwan.
⁷ Department of Otolaryngology, University of Medicine and Pharmacy, Ho Chi Minh City 700000, Vietnam.
⁸ Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City 106, Taiwan.
⁹ Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City 106, Taiwan.

Abstract

Antioxidant proteins are involved importantly in many aspects of cellular life activities. They protect the cell and DNA from oxidative substances (such as peroxide, nitric oxide, oxygen-free radicals, etc.) which are known as reactive oxygen species (ROS). Free radical generation and antioxidant defenses are opposing factors in the human body and the balance between them is necessary to maintain a healthy body. An unhealthy routine or the degeneration of age can break the balance, leading to more ROS than antioxidants, causing damage to health. In general, the antioxidant mechanism is the combination of antioxidant molecules and ROS in a one-electron reaction. Creating computational models to promptly identify antioxidant candidates is essential in supporting antioxidant detection experiments in the laboratory. In this study, we proposed a machine learning-based model for this prediction purpose from a benchmark set of sequencing data. The experiments were conducted by using 10-fold cross-validation on the training process and validated by three different independent datasets. Different machine learning and deep learning algorithms have been evaluated on an optimal set of sequence features. Among them, Random Forest has been identified as the best model to identify antioxidant proteins with the highest performance. Our optimal model achieved high accuracy of 84.6%, as well as a balance in sensitivity (81.5%) and specificity (85.1%) for antioxidant protein identification on the training dataset. The performance results from different independent datasets also showed the significance in our model compared to previously published works on antioxidant protein identification.

Keywords: Random Forest; antioxidant proteins; computational modeling; feature selection; machine learning; protein sequencing.

Abstract

Grants and funding