“Holes in the Tree of Life”: Why and how phylogenetic data must be published

Ross Mounce @rmounce and Joseph W Brown have been tweeting about the lack of data to support published phylogenetic studies. (Readers of this blog will know that Ross and I start work in October to extract trees from published PDFs – an awful statement of how bad the situation is.)

Very simply, phylogenetic data is key to our understanding of the history, ecology and biodiversity of the planet. If we don’t understand species then we shall lose them, and if we don’t understand how species interact we shall lose ecosystems. Look into the details of pollination and often the loss of one species affects others directly. (Though Darwin was wrong about cats->->-> clover http://triscience.com/Species/Field/the-cats-to-clover-chain/doculite_view ).

Most peer-reviewed phylogenetics is in closed journals (40 USD for a 1 day read). It’s appallingly arrogant to assume that anyone who needs it (academics) can get the info. But worse, almost none of the data are published. Phylogenetic trees are mainly computed using molecular information (DNA of key genes) and are costly. Yet the data are relatively simple. They are well understood (30+ years of sequence / gene repositories) and they are compact (accession numbers are often fine). An uncompressed tree costs perhaps a few Kb and with indexing/compression a complete study could be published in ca 1 Mb. That’s less than the size of many single images!

Here’s what sparked the discussion. http://www.botanyconference.org/engine/search/index.php?func=detail&aid=167 I’ll give it in full, and argue that any reasonably literate person could understand it. I have highlighted some parts

Missing data lead to holes in the tree of life.

The fundamental importance of archiving scientific datasets has received increasing attention over the past several years, and failure to properly archive data can adversely affect study reproducibility. However, in plant systematics (or evolutionary biology) there has been no comprehensive review that examines the deposition practices of the underlying phylogenetic datasets and trees that are the foundation of the discipline. Furthermore, there is little understanding of how the deposition rate of DNA sequence alignments and phylogenetic trees has changed over time. In the process of gathering data to build the first tree of life for all ~1.9 million named species (the Open Tree of Life Project), we sifted through over 7200 peer-reviewed phylogenetic studies published between the years 2000 and 2012. Our survey covered over 100 journals and included publications focusing on green plants, animals, fungi, microbial eukaryotes, bacteria, and archaea. This broad survey included 1243 seed plant publications. Overall, we found that only 17% of examined studies made nucleotide alignment data and/or trees available in an accessible repository such as TreeBASE or Dryad. Within seed plants, only 24% of studies from the past 12 years have been archived. Furthermore, most corresponding authors (54% for seed plants) that we contacted for un-deposited datasets and trees did not respond to our repeated (2) requests for data. Thus, most of the trees and alignments produced during the past several decades is essentially lost forever. The plant systematics community needs to significantly improve data deposition practices to ensure that crucial data (trees, alignments) are archived and thus freely available to other interested scientists. Our results illustrate that voluntary data submission policies have not worked, and dictate the urgent need to adopt new policies requiring public archiving of DNA sequence alignments and trees in a routine manner as is done routinely with raw sequence data. These stark findings should encourage the systematic community as well as journal editorials to adopt data sharing policies that require deposition of alignments and resulting phylogenetic trees in established databases prior to publication.

Very simply (this applies to many subjects):

Many/most authors don’t care about making their science available to the world. The final result of their work is a “scholarly article”, not useful, reusable, verifiable science that can be built on, re-used by policy makers and citizens. The authors do not feel that being publicly funded gives them any obligations to the public. The ivory tower only rewards their work in the torrid market of scholarship, not the wider value to the world.

It has worked in some subjects – sequences/genes, crystal structures, galaxies. Here the disciplines have developed cultures where scientists are expected and then mandated to deposit data. The commonest ways are (a) on publishers’s websites (e.g. crystallography) and (b) in domain repositories (e.g. sequences).

Making phylogenetic data available for each study is technically straightforward. The bytecount is insignificant in today’s world. The standards and protocols (e.g. nexml) exist. The problem is 99% a people problem.

The problem is community. In some cases the learned societies are more concerned to generate income than to service science (Where are the publishers that actually make subscription material available to the world within – say – 6 months of publication?) Many are actually making it more difficult. The last 12 months have confirmed that most legacy publishers are part of the problem, not the solution.

So how, if publishers are antagonistic or indifferent to requiring publication data do we manage it? The Universities are totally vapid today – they have shown no leadership. So the only clear path is funders mandates.

And that will work. I’ve seen the pressure in the US that the NSF mandate on data management has applied and I think it’s starting to work. That’s got to happen everywhere. So my message to funders is:

Mandate the deposition of data at time of publication. And if not, chop 10% off the grant.

That works. It’s a lot of work, but it’s a trivial amount compared with the current loss of data (which I estimate as >> 100 Billion USD per year).