Openness in Big Data and Data Repositories | SpringerLink

Abstract:  There is a growing expectation, or even requirement, for researchers to deposit a variety of research data in data repositories as a condition of funding or publication. This expectation recognizes the enormous benefits of data collected and created for research purposes being made available for secondary uses, as open science gains increasing support. This is particularly so in the context of big data, especially where health data is involved. There are, however, also challenges relating to the collection, storage, and re-use of research data. This paper gives a brief overview of the landscape of data sharing via data repositories and discusses some of the key ethical issues raised by the sharing of health-related research data, including expectations of privacy and confidentiality, the transparency of repository governance structures, access restrictions, as well as data ownership and the fair attribution of credit. To consider these issues and the values that are pertinent, the paper applies the deliberative balancing approach articulated in the Ethics Framework for Big Data in Health and Research (Xafis et al. 2019) to the domain of Openness in Big Data and Data Repositories. Please refer to that article for more information on how this framework is to be used, including a full explanation of the key values involved and the balancing approach used in the case study at the end.

 

Openness in Big Data and Data Repositories | SpringerLink

Abstract:  There is a growing expectation, or even requirement, for researchers to deposit a variety of research data in data repositories as a condition of funding or publication. This expectation recognizes the enormous benefits of data collected and created for research purposes being made available for secondary uses, as open science gains increasing support. This is particularly so in the context of big data, especially where health data is involved. There are, however, also challenges relating to the collection, storage, and re-use of research data. This paper gives a brief overview of the landscape of data sharing via data repositories and discusses some of the key ethical issues raised by the sharing of health-related research data, including expectations of privacy and confidentiality, the transparency of repository governance structures, access restrictions, as well as data ownership and the fair attribution of credit. To consider these issues and the values that are pertinent, the paper applies the deliberative balancing approach articulated in the Ethics Framework for Big Data in Health and Research (Xafis et al. 2019) to the domain of Openness in Big Data and Data Repositories. Please refer to that article for more information on how this framework is to be used, including a full explanation of the key values involved and the balancing approach used in the case study at the end.

 

Critics decry access, transparency issues with key trove of coronavirus sequences | Science | AAAS

“In December 2020, software developer Angie Hinrichs at the University of California, Santa Cruz (UCSC), applied for access to a labor-saving data feed from GISAID, a nonprofit database of viral sequences including those of the pandemic coronavirus, SARS-CoV-2. She wanted GISAID’s data so she could display mutations on UCSC’s coronavirus Genome Browser. That tool ties any position in the virus’ nearly 30,000-letter genome to other scientific information, much as Google Maps shows gas stations and restaurants near addresses.

With more than 700,000 genomes from more than 160 countries, GISAID is by far the world’s largest database of SARS-CoV-2 sequences. Access to the free, nonprofit repository has become vital to Hinrichs and thousands of other scientists and public health agencies tracking the virus’ alarmingly rapid evolution.

But instead of getting a direct data feed, Hinrichs lost her existing access to two conveniently packaged GISAID files that are the next best thing. She emailed GISAID repeatedly pleading for restored access, but hasn’t gotten it. Since December, she has had to download GISAID’s sequences 10,000 at a time, with no access to most of the metadata unless she looks at each of the 10,000 sequences individually. …

But critics complain about GISAID’s constraints on access, chief among them its prohibition on resharing of its data. Its agreement for access to the direct data feed also requires applicants to use only GISAID data in their websites and tools, as well as only GISAID-approved strain names. (GISAID says allowing users to mix data on their websites “would duplicate data already in GISAID, resulting in bias and distorted results.”)….”

Critics decry access, transparency issues with key trove of coronavirus sequences | Science | AAAS

“In December 2020, software developer Angie Hinrichs at the University of California, Santa Cruz (UCSC), applied for access to a labor-saving data feed from GISAID, a nonprofit database of viral sequences including those of the pandemic coronavirus, SARS-CoV-2. She wanted GISAID’s data so she could display mutations on UCSC’s coronavirus Genome Browser. That tool ties any position in the virus’ nearly 30,000-letter genome to other scientific information, much as Google Maps shows gas stations and restaurants near addresses.

With more than 700,000 genomes from more than 160 countries, GISAID is by far the world’s largest database of SARS-CoV-2 sequences. Access to the free, nonprofit repository has become vital to Hinrichs and thousands of other scientists and public health agencies tracking the virus’ alarmingly rapid evolution.

But instead of getting a direct data feed, Hinrichs lost her existing access to two conveniently packaged GISAID files that are the next best thing. She emailed GISAID repeatedly pleading for restored access, but hasn’t gotten it. Since December, she has had to download GISAID’s sequences 10,000 at a time, with no access to most of the metadata unless she looks at each of the 10,000 sequences individually. …

But critics complain about GISAID’s constraints on access, chief among them its prohibition on resharing of its data. Its agreement for access to the direct data feed also requires applicants to use only GISAID data in their websites and tools, as well as only GISAID-approved strain names. (GISAID says allowing users to mix data on their websites “would duplicate data already in GISAID, resulting in bias and distorted results.”)….”

DataCite Repository Selector

“Repository Finder, a pilot project of the Enabling FAIR Data Project led by the American Geophysical Union (AGU) in partnership with DataCite and the Earth, space and environment sciences community, can help you find an appropriate repository to deposit your research data. The tool is hosted by DataCite and queries the re3data registry of research data repositories….”

As part of the FAIRsFAIR project, which aims to supply practical solutions for the use of the FAIR data principles throughout the research data life cycle, the Repository Finder is extended to query for repositories relevant to FAIRsFAIR Project….”

The broken promise that undermines human genome research

“Data sharing was a core principle that led to the success of the Human Genome Project 20 years ago. Now scientists are struggling to keep information free….

So in 1996, the HGP [Human Genome Project] researchers got together to lay out what became known as the Bermuda Principles, with all parties agreeing to make the human genome sequences available in public databases, ideally within 24 hours — no delays, no exceptions.

 

Fast-forward two decades, and the field is bursting with genomic data, thanks to improved technology both for sequencing whole genomes and for genotyping them by sequencing a few million select spots to quickly capture the variation within. These efforts have produced genetic readouts for tens of millions of individuals, and they sit in data repositories around the globe. The principles laid out during the HGP, and later adopted by journals and funding agencies, meant that anyone should be able to access the data created for published genome studies and use them to power new discoveries….

The explosion of data led governments, funding agencies, research institutes and private research consortia to develop their own custom-built databases for handling the complex and sometimes sensitive data sets. And the patchwork of repositories, with various rules for access and no standard data formatting, has led to a “Tower of Babel” situation, says Haussler….”

The SHRUG; Development Data Lab

“The Socioeconomic High-resolution Rural-Urban Geographic Platform for India (SHRUG) is a geographic platform that facilitates data sharing between researchers working on India. It is an open access repository currently comprising dozens of datasets covering India’s 500,000 villages and 8000 towns using a set of a common geographic identifiers that span 25 years….”

Curating a COVID-19 Data Repository and Forecasting County-Level Death Counts in the United States · Special Issue 1 – COVID-19: Unprecedented Challenges and Chances

Abstract:  As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this article we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative death counts at the county level in the United States up to 2 weeks ahead. Using data from January 23 to June 20, 2020, we develop and combine multiple forecasts using ensembling techniques, resulting in an ensemble we refer to as combined linear and exponential predictors (CLEP). Our individual predictors include county-specific exponential and linear predictors, a shared exponential predictor that pools data together across counties, an expanded shared exponential predictor that uses data from neighboring counties, and a demographics-based shared exponential predictor. We use prediction errors from the past 5 days to assess the uncertainty of our death predictions, resulting in generally applicable prediction intervals, maximum (absolute) error prediction intervals (MEPI). MEPI achieves a coverage rate of more than 94% when averaged across counties for predicting cumulative recorded death counts 2 weeks in the future. Our forecasts are currently being used by the nonprofit organization Response4Life to determine the medical supply need for individual hospitals and have directly contributed to the distribution of medical supplies across the country. We hope that our forecasts and data repository at https://covidseverity.com can help guide necessary county-specific decision making and help counties prepare for their continued fight against COVID-19.

Doing it Right: A Better Approach for Software & Data | Dryad news and views

“The Dryad and Zenodo teams are proud to announce the launch of our first formal integration. As we’ve noted over the last years, we believe that the best way to support the broad scientific community in publishing their outputs is to leverage each other’s strengths and build together. Our plan has always been to find ways to seamlessly connect software publishing and data curation in ways that are both easy enough that the features will be used but also beneficial to the researchers re-using and building on scientific discoveries. This month, we’ve released our first set of features to support exactly that….”

Generalist Repositories

“While NIH encourages the use of domain-specific repositories where possible, such repositories are not available for all datasets. When investigators cannot locate a repository for their discipline or the type of data they generate, a generalist repository can be a useful place to share data. Generalist repositories accept data regardless of data type, format, content, or disciplinary focus. NIH does not recommend a specific generalist repository and the list below, which is not exhaustive, is provided as a guide for locating generalist repositories….”

Generalist Repositories

“While NIH encourages the use of domain-specific repositories where possible, such repositories are not available for all datasets. When investigators cannot locate a repository for their discipline or the type of data they generate, a generalist repository can be a useful place to share data. Generalist repositories accept data regardless of data type, format, content, or disciplinary focus. NIH does not recommend a specific generalist repository and the list below, which is not exhaustive, is provided as a guide for locating generalist repositories….”

The Federated Research Data Repository (FRDR) is Now in Full Production! – Portage Network

“Portage’s Federated Research Data Repository (FRDR) has officially launched into full production! Full production offers many new features and benefits:

Publish research data in a Canadian-owned, bilingual national repository option
1 TB of repository storage available to all faculty members at Canadian post-secondary institutions – more storage may be available upon request
Secure repository storage, distributed geographically across multiple Compute Canada Federation hosting sites
Data curation support provided by Portage
Ability to work with multiple collaborators on a single submission 
Your data will be discoverable alongside other Canadian collections in the FRDR Discovery Portal…”

The Federated Research Data Repository (FRDR) is Now in Full Production! – Portage Network

“Portage’s Federated Research Data Repository (FRDR) has officially launched into full production! Full production offers many new features and benefits:

Publish research data in a Canadian-owned, bilingual national repository option
1 TB of repository storage available to all faculty members at Canadian post-secondary institutions – more storage may be available upon request
Secure repository storage, distributed geographically across multiple Compute Canada Federation hosting sites
Data curation support provided by Portage
Ability to work with multiple collaborators on a single submission 
Your data will be discoverable alongside other Canadian collections in the FRDR Discovery Portal…”

A Review of Open Research Data Policies and Practices in China

Abstract:  This paper initially conducts a literature review and content analysis of the open research data policies in China. Next, a series of exemplars describe data practices to promote and enable the use of open research data, including open data practices in research programs, data repositories, data journals, and citizen science. Moreover, the top four driving forces are identified and analyzed along with their responsible guiding work. In addition, the “landscape of open research data ecology in China” is derived from the literature review and from observations of actual cases, where the interaction and mutual development of data policies, data programs, and data practices are recognized. Finally, future trends of research data practices within China and internationally are discussed. We hope the analysis provides perspective on current open data practices in China along with insight into the need for additional research on scientific data sharing and management.

 

Data Repository Platforms: A Primer | Ithaka S+R

“Since there is a robust landscape of research data sharing spaces, we decided to conduct exploratory, high-level research on a number of data repositories, primarily to inform our own data deposit protocols. We regularly deposit data from the US Faculty Survey, Library Director Survey, as well as several other research projects with ICPSR. Recognizing that our research on a variety of characteristics of data repositories may yield utility for other researchers, today we are publishing a summary of our findings.

Below you can find seven repositories compared side-by-side in tabular format. We have highlighted particular factors that are key for informing decision-making: disciplinary scope, typical timelines for processing datasets, associated costs, and services offered (such as data curation)….”