AgTC and AgETL: open-source tools to enhance data collection and management for plant science research

Luis Vargas-Rojas; To-Chia Ting; Katherine M Rainey; Matthew Reynolds; Diane R Wang

doi:10.3389/fpls.2024.1265073

AgTC and AgETL: open-source tools to enhance data collection and management for plant science research

Front Plant Sci. 2024 Feb 21:15:1265073. doi: 10.3389/fpls.2024.1265073. eCollection 2024.

Authors

Luis Vargas-Rojas¹, To-Chia Ting¹, Katherine M Rainey¹, Matthew Reynolds², Diane R Wang¹

Affiliations

¹ Department of Agronomy, Purdue University, West Lafayette, IN, United States.
² Wheat Physiology Group, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico.

Abstract

Advancements in phenotyping technology have enabled plant science researchers to gather large volumes of information from their experiments, especially those that evaluate multiple genotypes. To fully leverage these complex and often heterogeneous data sets (i.e. those that differ in format and structure), scientists must invest considerable time in data processing, and data management has emerged as a considerable barrier for downstream application. Here, we propose a pipeline to enhance data collection, processing, and management from plant science studies comprising of two newly developed open-source programs. The first, called AgTC, is a series of programming functions that generates comma-separated values file templates to collect data in a standard format using either a lab-based computer or a mobile device. The second series of functions, AgETL, executes steps for an Extract-Transform-Load (ETL) data integration process where data are extracted from heterogeneously formatted files, transformed to meet standard criteria, and loaded into a database. There, data are stored and can be accessed for data analysis-related processes, including dynamic data visualization through web-based tools. Both AgTC and AgETL are flexible for application across plant science experiments without programming knowledge on the part of the domain scientist, and their functions are executed on Jupyter Notebook, a browser-based interactive development environment. Additionally, all parameters are easily customized from central configuration files written in the human-readable YAML format. Using three experiments from research laboratories in university and non-government organization (NGO) settings as test cases, we demonstrate the utility of AgTC and AgETL to streamline critical steps from data collection to analysis in the plant sciences.

Keywords: data aggregation; data pipeline; data processing; database; extract-transform-load; plant phenotyping.

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. LV-R was supported by a CONACYT graduate fellowship from the Mexican government. Funding for the experimental test cases was provided by HedWIC #DFs-19-0000000013 to MR and USDA NIFA #2022-67013-36205 to DW.