Software developers craft the tools of modern science, and CyVerse provides the collaboration space where they can get creative, connect with end users, and publish their products. One such collaboration led to an open source software tool for ecologists everywhere, described in a recent special methods edition of the journal Ecosphere.
In spring of 2021, Ryan Bartelme, then a digital agriculture science analyst at CyVerse, was working with Tyson Swetnam, a University of Arizona research assistant professor at the BIO5 Institute and now a CyVerse co-principal investigator, and other UArizona recipients of a National Science Foundation grant for Harnessing the Data Revolution.
The award provided funding to investigate unanalyzed troves of publicly-available ecological datasets hosted by the National Ecological Observatory Network, or NEON. The group found a set of soil microbiome datasets, for which an R software package for statistical computing had been created. However, it was difficult to run the package on different types of computers, which might not have compatible software versions installed.
Bartelme, who has a background in microbial ecology, offered to build a Docker software container housing all of the code to analyze the datasets in the NEON R package, so that it would work for anyone.
How to do data science, collaboratively
Meanwhile at the University of California, Santa Cruz, Clara Qin, a PhD candidate in the laboratory of environmental scientist Kai Zhu, was struggling to understand why her collaborators couldn’t use the NEON R package she had created for them.
Qin studies soil fungal biogeography, investigating where different types of fungi exist and thrive. She integrates data systems from NEON and soil microbe DNA sequencing studies to look for patterns in fungal biology and the geographic locations where they are found.
“This work is important because it has long been assumed that all bacteria and fungi were not limited by dispersal, that they could be found anywhere,” Qin said. “It was believed only the environment would select where they would actually grow.”
With advancements in molecular analysis methods enabling DNA sequencing of environmental samples, ecologists now have shown that many species of fungi and bacteria can be found only in certain parts of the world.
“One reason this matters is because there are very specific relationships between disease-causing fungi or symbiotic fungi and their hosts,” Qin explained.
Qin’s main research question is: in what context or geographies do different fungal species fail to appear, even though the habitat is suitable for them, and why? “Does it depend on species’ traits, and if so, which traits? That remains an open question.”
Qin has a master’s degree in statistics, which she was eager to bring to her work in ecology. “I wanted to have more quantitatively rigorous ways of showing the significance of ecological patterns,” she said.
Her background had given her an introduction to programming, which she now used to develop an R package, based on a software program called dada2, for researchers to process the NEON soil microbe data. “I realized most ecology researchers would probably struggle with the bioinformatics behind the processing pipeline.”
But she found that even with her package, her collaborators’ computer systems were often missing some vital component, and so the program would fail to run. “It wouldn’t universally work,” she said.
That’s when one of Qin’s colleagues mentioned that they knew someone at CyVerse who might be working on the same problem.
Finding friends in scientific places
“I had been developing this for my own use,” Bartelme said, when Dawson Fairbanks, a UArizona PhD candidate in environmental science studying under Rachel Gallery who knew Bartelme as an instructor for CyVerse’s Foundational Open Science Skills course, told him that there was an entire working group tackling exactly this problem: trying to create a viable R package, called neonMicrobe, to analyze the NEON microbial datasets.
As it turned out, the connection ran deeper: Qin also had been a student in CyVerse FOSS. “There were a lot of gaps in my computing background, and some were somewhat foundational,” Qin mused. “I had assumed this was a simple task because we have the data, so it’s just a matter of processing it. I had some inflated confidence about it.”
Bartelme found himself contributing a vital component of cyberinfrastructure to a project he hadn’t even known about. His container brings all of the tools in Qin’s NEON R package together so that anyone on any computer can run the application and complete an analysis. It will also work on a high-performance computer, or HPC system.
“It’s been awesome to make this transition from working on my own, to working with a group, to finding out the group was all formerly my students,” Bartelme said.
“Hopefully, this publication will give microbial ecologists everywhere an easy way to process these data,” Qin said.