The broken promise that undermines human genome research

“Data sharing was a core principle that led to the success of the Human Genome Project 20 years ago. Now scientists are struggling to keep information free….

So in 1996, the HGP [Human Genome Project] researchers got together to lay out what became known as the Bermuda Principles, with all parties agreeing to make the human genome sequences available in public databases, ideally within 24 hours — no delays, no exceptions.

 

Fast-forward two decades, and the field is bursting with genomic data, thanks to improved technology both for sequencing whole genomes and for genotyping them by sequencing a few million select spots to quickly capture the variation within. These efforts have produced genetic readouts for tens of millions of individuals, and they sit in data repositories around the globe. The principles laid out during the HGP, and later adopted by journals and funding agencies, meant that anyone should be able to access the data created for published genome studies and use them to power new discoveries….

The explosion of data led governments, funding agencies, research institutes and private research consortia to develop their own custom-built databases for handling the complex and sometimes sensitive data sets. And the patchwork of repositories, with various rules for access and no standard data formatting, has led to a “Tower of Babel” situation, says Haussler….”

The SHRUG; Development Data Lab

“The Socioeconomic High-resolution Rural-Urban Geographic Platform for India (SHRUG) is a geographic platform that facilitates data sharing between researchers working on India. It is an open access repository currently comprising dozens of datasets covering India’s 500,000 villages and 8000 towns using a set of a common geographic identifiers that span 25 years….”

Curating a COVID-19 Data Repository and Forecasting County-Level Death Counts in the United States · Special Issue 1 – COVID-19: Unprecedented Challenges and Chances

Abstract:  As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this article we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative death counts at the county level in the United States up to 2 weeks ahead. Using data from January 23 to June 20, 2020, we develop and combine multiple forecasts using ensembling techniques, resulting in an ensemble we refer to as combined linear and exponential predictors (CLEP). Our individual predictors include county-specific exponential and linear predictors, a shared exponential predictor that pools data together across counties, an expanded shared exponential predictor that uses data from neighboring counties, and a demographics-based shared exponential predictor. We use prediction errors from the past 5 days to assess the uncertainty of our death predictions, resulting in generally applicable prediction intervals, maximum (absolute) error prediction intervals (MEPI). MEPI achieves a coverage rate of more than 94% when averaged across counties for predicting cumulative recorded death counts 2 weeks in the future. Our forecasts are currently being used by the nonprofit organization Response4Life to determine the medical supply need for individual hospitals and have directly contributed to the distribution of medical supplies across the country. We hope that our forecasts and data repository at https://covidseverity.com can help guide necessary county-specific decision making and help counties prepare for their continued fight against COVID-19.

Doing it Right: A Better Approach for Software & Data | Dryad news and views

“The Dryad and Zenodo teams are proud to announce the launch of our first formal integration. As we’ve noted over the last years, we believe that the best way to support the broad scientific community in publishing their outputs is to leverage each other’s strengths and build together. Our plan has always been to find ways to seamlessly connect software publishing and data curation in ways that are both easy enough that the features will be used but also beneficial to the researchers re-using and building on scientific discoveries. This month, we’ve released our first set of features to support exactly that….”

Generalist Repositories

“While NIH encourages the use of domain-specific repositories where possible, such repositories are not available for all datasets. When investigators cannot locate a repository for their discipline or the type of data they generate, a generalist repository can be a useful place to share data. Generalist repositories accept data regardless of data type, format, content, or disciplinary focus. NIH does not recommend a specific generalist repository and the list below, which is not exhaustive, is provided as a guide for locating generalist repositories….”

Generalist Repositories

“While NIH encourages the use of domain-specific repositories where possible, such repositories are not available for all datasets. When investigators cannot locate a repository for their discipline or the type of data they generate, a generalist repository can be a useful place to share data. Generalist repositories accept data regardless of data type, format, content, or disciplinary focus. NIH does not recommend a specific generalist repository and the list below, which is not exhaustive, is provided as a guide for locating generalist repositories….”

The Federated Research Data Repository (FRDR) is Now in Full Production! – Portage Network

“Portage’s Federated Research Data Repository (FRDR) has officially launched into full production! Full production offers many new features and benefits:

Publish research data in a Canadian-owned, bilingual national repository option
1 TB of repository storage available to all faculty members at Canadian post-secondary institutions – more storage may be available upon request
Secure repository storage, distributed geographically across multiple Compute Canada Federation hosting sites
Data curation support provided by Portage
Ability to work with multiple collaborators on a single submission 
Your data will be discoverable alongside other Canadian collections in the FRDR Discovery Portal…”

The Federated Research Data Repository (FRDR) is Now in Full Production! – Portage Network

“Portage’s Federated Research Data Repository (FRDR) has officially launched into full production! Full production offers many new features and benefits:

Publish research data in a Canadian-owned, bilingual national repository option
1 TB of repository storage available to all faculty members at Canadian post-secondary institutions – more storage may be available upon request
Secure repository storage, distributed geographically across multiple Compute Canada Federation hosting sites
Data curation support provided by Portage
Ability to work with multiple collaborators on a single submission 
Your data will be discoverable alongside other Canadian collections in the FRDR Discovery Portal…”

A Review of Open Research Data Policies and Practices in China

Abstract:  This paper initially conducts a literature review and content analysis of the open research data policies in China. Next, a series of exemplars describe data practices to promote and enable the use of open research data, including open data practices in research programs, data repositories, data journals, and citizen science. Moreover, the top four driving forces are identified and analyzed along with their responsible guiding work. In addition, the “landscape of open research data ecology in China” is derived from the literature review and from observations of actual cases, where the interaction and mutual development of data policies, data programs, and data practices are recognized. Finally, future trends of research data practices within China and internationally are discussed. We hope the analysis provides perspective on current open data practices in China along with insight into the need for additional research on scientific data sharing and management.

 

Data Repository Platforms: A Primer | Ithaka S+R

“Since there is a robust landscape of research data sharing spaces, we decided to conduct exploratory, high-level research on a number of data repositories, primarily to inform our own data deposit protocols. We regularly deposit data from the US Faculty Survey, Library Director Survey, as well as several other research projects with ICPSR. Recognizing that our research on a variety of characteristics of data repositories may yield utility for other researchers, today we are publishing a summary of our findings.

Below you can find seven repositories compared side-by-side in tabular format. We have highlighted particular factors that are key for informing decision-making: disciplinary scope, typical timelines for processing datasets, associated costs, and services offered (such as data curation)….”

#ASAPpdb: Structural biologists commit to releasing data with preprints – ASAPbio

“The Protein Data Bank (PDB) was established as the first open access repository for biological data, and the datasets it hosts have been invaluable to research in fundamental biology and the understanding of health and disease. Just this month, we witnessed the announcement of the AlphaFold2 results toward structure prediction, made possible thanks to the more than 170,000 freely accessible structures in the PDB which provided “training data” for the structure prediction software.

It was not always the case that such structural biology data were freely available, even upon journal publication. From the founding of the PDB in 1971 until the late 1980s, most journals did not require deposition of structures in a public database. A key moment was a petition, circulated in 1987 by a group of leading structural biologists, demanding that the data created be made openly available upon journal publication. This petition led to major journals adopting data deposition standards. In the early 1990s, the National Institute of General Medical Sciences (NIGMS) imposed similar requirements on all grantees. 

The revolution in publishing made possible by preprints calls for a re-evaluation of data disclosure practices in structural biology. While journal review processes take weeks, months, or even years, preprints allow researchers to rapidly communicate their findings to the community. However, withholding access to PDB files that accompany preprints inhibits the progress towards scientific discovery which preprints can enable. 

Commitment

We pledge to publicly release our PDB files (and associated structure factor, restraint, and map files) with deposition of our preprints.

We encourage all structural biologists to also deposit raw data in appropriate resources (e.g. EMPIAR, proteindiffraction.org, https://data.sbgrid.org/, etc). …”

Scientific Data recommended repositories

“Spreadsheet listing data repositories that are recommended by Scientific Data (Springer Nature) as being suitable for hosting data associated with peer-reviewed articles. Please see the repository list on Scientific Data’s website for the most up to date list….”

What Data Repository Should I use? – a help article for using figshare

There is a home for every dataset. 

Ideally each subject would have a subject specific data repository with custom metadata capture and a subject specialist to curate the data. If there is a subject specific repository that is suitable for your datasets, use that one.

A list of subject specific repositories that are recommended by Scientific Data (Springer Nature) as being suitable for hosting data associated with peer-reviewed articles can be found here.

If there is no subject specific repository, you should check with your Institution’s Library as to whether they have a data repository. If they do, they will most likely have a team of data librarians to guide you through your dataset publication.
 

If there is no subject specific repository, you should upload and publish your data on Figshare.com, or another suitable generalist repository. How to upload and publish your data on Figshare.com
 

For more information on which repository to use, head to Re3Data ….”

Open Context: Web-based research data publishing

“Open Context reviews, edits, annotates, publishes and archives research data and digital documentation. We publish your data and preserve it with leading digital libraries. We take steps beyond archiving to richly annotate and integrate your analyses, maps and media. This links your data to the wider world and broadens the impact of your ideas….”