Workflow systems turn raw data into scientific knowledge

“Finn is head of the sequence-families team at the European Bioinformatics Institute (EBI) in Hinxton, UK; Meyer is a computer scientist at Argonne National Laboratory in Lemont, Illinois. Both run facilities that let researchers perform a computationally intensive process called metagenomic analysis, which allows microbial communities to be reconstructed from shards of DNA. It would be helpful, they realized, if they could try each other’s code. The problem was that their analytical ‘pipelines’ — the carefully choreographed computational steps required to turn raw data into scientific knowledge — were written in different languages. Meyer’s team was using an in-house system called AWE, whereas Finn was working with nearly 9,500 lines of Python code.

“It was a horrible Python code base,” says Finn — complicated, and difficult to maintain. “Bits had been bolted on in an ad hoc fashion over seven years by at least four different developers.” And it was “heavily tied to the compute infrastructure”, he says, meaning it was written for specific computational resources and a particular way of organizing files, and thus essentially unusable outside the EBI. Because the EBI wasn’t using AWE, the reverse was also true. Then Finn and Meyer learnt about the Common Workflow Language (CWL).

CWL is a way of describing analytical pipelines and computational tools — one of more than 250 systems now available, including such popular options as Snakemake, Nextflow and Galaxy. Although they speak different languages and support different features, these systems have a common aim: to make computational methods reproducible, portable, maintainable and shareable. CWL is essentially an exchange language that researchers can use to share pipelines for whichever system. For Finn, that language brought sanity to his codebase, reducing it by around 73%. Importantly, it has made it easier to test, execute and share new methods, and to run them on the cloud….”