Best Practices for Data Workflows in Ag Science

R
data workflow
researcj reproducibility
Published

August 7, 2018

Image credit: Glowa Volta

I was struggling with the data set. The first 50 rows were summary stats (…I think), and then the actual data started. The file was over 250 columns wide, composed largely of phenotypic traits. Perhaps one third of the columns were unique variables gathered over several years, but, the variable names were inconsistent across years. Actually, there was no information on the variables, what they were measuring, and what scale they were on. The closest hint was a pattern of color coding applied to groups of columns: yellow, green, blue. There appeared to be partial duplication and/or combination of of some columns, but we could not reconstruct exactly what had happened to the data. No one could recall the full history of the file, what that color coding meant, who had pre-processed data, or what they had done. We did know that the raw data were long lost.

This was the not the first nor the last messy data set with an unknown history that complicated my capacity to organize and analyse the information.

Other problems I encountered with poorly managed data sets:

There’s no way to sugarcoat it: management of raw experimental data and the analysis workflow is largely abysmal in the agricultural sciences, my academic home for 14 years. The process of cleaning, rearranging, filtering and preparing experimental data for downstream analyses is poorly described, if at all, in most papers. As many of know, much of data conditioning and preparation is done by students and technicians with little formal guidance or documentation of their work.

Since data sets can and should live on for decades - as part of long-term agriculture research, for meta-analyses or other uses, understanding how the raw data were parsed, cleaned and interpreted is important. Assumptions that made sense in the initial analysis may need re-evaluation later. Intermediate files can turn out to be useful. Organization of an analysis workflow can help researchers conduct their research in a regimented and reproducible way.

Here are a few concrete recommendations for improving workflow:

The core reason I am interested in these issues is to improve reproducibility. As I dealt with the aforementioned issues with data, I could not help but wonder if I was making the right choices with the data, and wish that I had more information on the background of these legacy data sets. This is particularly important as meta-studies continue to gain in popularity. Unifying data sets and leveraging their combined power is one of the priorities for agricultural productivity identified by the National Academy of Sciences in their recent report, Science Breakthroughs to Advance Food and Agricultural Research by 2030.

For more information, see this excellent article by Kara Woo and Karl Broman - it’s short and gives easy-to-follow best practices when working with spreadsheets.

Scott Long at the University of Indiana put together an informative set of slides delving further into the principles of data workflow. Patrick Schloss, of the bioinformatics program mothur, recently published an article delving into the problem of reproducibility in microbiome work and how to fix it. Among many things, he recommended implementing robust workflow practices.