Data sharing practices across knowledge domains: a dynamic examination of data availability statements in PLOS ONE publications

Abstract:  As the importance of research data gradually grows in sciences, data sharing has come to be encouraged and even mandated by journals and funders in recent years. Following this trend, the data availability statement has been increasingly embraced by academic communities as a means of sharing research data as part of research articles. This paper presents a quantitative study of which mechanisms and repositories are used to share research data in PLOS ONE articles. We offer a dynamic examination of this topic from the disciplinary and temporal perspectives based on all statements in English-language research articles published between 2014 and 2020 in the journal. We find a slow yet steady growth in the use of data repositories to share data over time, as opposed to sharing data in the paper or supplementary materials; this indicates improved compliance with the journal’s data sharing policies. We also find that multidisciplinary data repositories have been increasingly used over time, whereas some disciplinary repositories show a decreasing trend. Our findings can help academic publishers and funders to improve their data sharing policies and serve as an important baseline dataset for future studies on data sharing activities.


Show your work: Tools for open developmental science – ScienceDirect

Abstract:  Since grade school, students of many subjects have learned to “show their work” in order to receive full credit for assignments. Many of the reasons for students to show their work extend to the conduct of scientific research. And yet multiple barriers make it challenging to share and show the products of scientific work beyond published findings. This chapter discusses some of these barriers and how web-based data repositories help overcome them. The focus is on, a data library specialized for storing and sharing video data with a restricted community of institutionally approved investigators. Databrary was designed by and for developmental researchers, and so its features and policies reflect many of the specific challenges faced by this community, especially those associated with sharing video and related identifiable data. The chapter argues that developmental science poses some of the most interesting, challenging, and important questions in all of science, and that by openly sharing much more of the products and processes of our work, developmental scientists can accelerate discovery while making our scholarship much more robust and reproducible.


Dear Colleague Letter: Effective Practices for Making Research Data Discoverable and Citable (Data Sharing) (nsf22055) | NSF – National Science Foundation

“This Dear Colleague Letter describes and encourages effective practices for publicly sharing research data, including the use of persistent digital identifiers (PDIs).

Datasets underpinning published research findings are expected to be shared with other researchers, at no more than incremental cost and within a reasonable time. Data-sharing holds numerous benefits, from enabling broader research collaboration, through facilitating transparency and solidifying confidence in scientific research, to providing increased resources for teaching and education purposes. Recent studies found that research articles containing a link to data in a repository have markedly higher usage and visibility. Discoverable and citable data also serve to reduce barriers to entry for junior researchers, scientists in under-served communities, and researchers from underrepresented and minority groups, thus enabling improved implementation of open science principles.

The nature of digital data produced during research may vary among the different topical disciplines encompassed by the field of Materials Research. Most often, digital research data comprise one or more of the following: raw data files collected using experimental instrumentation and converted into digital format; digital files of processed experimental data; video and animation files; numerical data produced by computer simulations or computational models; computer code, scripts, software, software documentation and user manuals developed as part of the research project; digital files of theoretical models, protocols, and methods; educational, instructional, and training materials.

Open-access data sharing platforms (data repositories) comprise the most efficient way to publish and share research data1. Moreover, as long-term data curation and preservation are core to their mission, data repositories provide a stable means for data preservation. Upon publication of a dataset, most repositories automatically generate a citation for the data, which includes identifying metadata such as the archiving repository, the data’s author(s), and a PDI such as a digital object identifier (DOI). A DOI is a unique and persistent digital identifier, which, when assigned to a digital entity such as a dataset, remains unchanged over the lifetime of the object. Having a DOI (or other form of PDI) from an open-access repository renders data findable, accessible, and readily citable. Searchable global registries of data repositories provide information on indexed repositories to help researchers identify the most appropriate ones2. In the case where a suitable repository is not available, researchers are strongly encouraged to use their institutional digital repositories, which typically issue DOIs to institutionally hosted content….”

The craft and coordination of data curation: complicating “workflow” views of data science

Abstract:  Data curation is the process of making a dataset fit-for-use and archiveable. It is critical to data-intensive science because it makes complex data pipelines possible, makes studies reproducible, and makes data (re)usable. Yet the complexities of the hands-on, technical and intellectual work of data curation is frequently overlooked or downplayed. Obscuring the work of data curation not only renders the labor and contributions of the data curators invisible; it also makes it harder to tease out the impact curators’ work has on the later usability, reliability, and reproducibility of data. To better understand the specific work of data curation — and thereby, explore ways of showing curators’ impact — we conducted a close examination of data curation at a large social science data repository, the Inter-university Consortium of Political and Social Research (ICPSR). We asked, What does curatorial work entail at ICPSR, and what work is more or less visible to different stakeholders and in different contexts? And, how is that curatorial work coordinated across the organization? We triangulate accounts of data curation from interviews and records of curation in Jira tickets to develop a rich and detailed account of curatorial work. We find that curators describe a number of craft practices needed to perform their work, which defies the rote sequence of events implied by many lifecycle or workflow models. Further, we show how best practices and craft practices are deeply intertwined.


Data Policies and Principles

“Recognizing the crucial role of open and effective data and information exchange to the Belmont Challenge, the Belmont Forum adopted open Data Policy and Principles based on the recommendations from the Community Strategy and Implementation Plan (CSIP) at its 2015 annual meeting of Principals in Oslo, Norway. The policy signals a commitment by funders of global environmental change research to increase access to scientific data, a step widely recognized as essential to making informed decisions in the face of rapid changes affecting the Earth’s environment….

Data should be:


Discoverable through catalogues and search engines
Accessible as open data by default, and made available with minimum time delay
Understandable in a way that allows researchers—including those outside the discipline of origin—to use them
Manageable and protected from loss for future use in sustainable, trustworthy repositories…

Research data must be:

Discoverable through catalogues and search engines, with data access and use conditions, including licenses, clearly indicated. Data should have appropriate persistent, unique and resolvable identifiers.
Accessible by default, and made available with minimum time delay, except where international and national policies or legislation preclude the sharing of data as Open Data. Data sources should always be cited.
Understandable and interoperable in a way that allows researchers, including those outside the discipline of origin, to use them. Preference should be given to non-proprietary international and community standards via data e-infrastructures that facilitate access, use and interpretation of data. Data must also be reusable and thus require proper contextual information and metadata, including provenance, quality and uncertainty indicators. Provision should be made for multiple languages.
Manageable and protected from loss for future use in sustainable, trustworthy repositories with data management policies and plans for all data at the project and institutional levels. Metrics should be exploited to facilitate the ability to measure return on investment, and can be used to implement incentive schemes for researchers, as well as provide measures of data quality.
Supported by a highly skilled workforce and a broad-based training and education curriculum as an integral part of research programs. …”

Home – NIH ODSS Search Workshop

“The goal of the Workshop is to explore current capabilities, gaps and opportunities for global data search across the data ecosystem. Workshop will explore selected science drivers across these main themes:

Using search to build cohorts: finding data across different platforms/repositories using patient attributes in order to create a cohort of patients for clinical analysis
Using search to find relevant data & repositories: finding data & repositories in order to access and analyze the data further, including its use for creating computational models.
Using search for (complex) information retrieval: answering specific questions without the additional burden of data download or analysis…”

Home – NIH ODSS Search Workshop

“The goal of the Workshop is to explore current capabilities, gaps and opportunities for global data search across the data ecosystem. Workshop will explore selected science drivers across these main themes:

Using search to build cohorts: finding data across different platforms/repositories using patient attributes in order to create a cohort of patients for clinical analysis
Using search to find relevant data & repositories: finding data & repositories in order to access and analyze the data further, including its use for creating computational models.
Using search for (complex) information retrieval: answering specific questions without the additional burden of data download or analysis…”

Tracing the footsteps of open research data in China

Abstract:  While the scientific research value, economic value and social value ofresearch data have become increasingly apparent, the significance of openresearch data has reached a consensus. This article gives an introductionto open research data policies and measures in China, and reports on thestatus of constructing necessary infrastructure, specifically open datarepositories. We compare open data repositories in China and Westerncountries in terms of scale, subject distribution, data policies, service andcontent operations. In addition, this article summarizes methods and moti-vations for data sharing among researchers in China. Finally, the paper dis-cusses the characteristics, potential problems and challenges of China’sopen research data practices. We conclude with some suggestions for thefuture development of open research data in China from data policy, infra-structure construction, compliance with international standards andnorms, credibility and influence improvement, incentives for data sharingand encouraging data sharing research practices.

An open repository of real-time COVID-19 indicators | PNAS

Abstract:  The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.


Incentivising research data sharing: a scoping review

Abstract:  Background: Numerous mechanisms exist to incentivise researchers to share their data. This scoping review aims to identify and summarise evidence of the efficacy of different interventions to promote open data practices and provide an overview of current research.

Methods: This scoping review is based on data identified from Web of Science and LISTA, limited from 2016 to 2021. A total of 1128 papers were screened, with 38 items being included. Items were selected if they focused on designing or evaluating an intervention or presenting an initiative to incentivise sharing. Items comprised a mixture of research papers, opinion pieces and descriptive articles.

Results: Seven major themes in the literature were identified: publisher/journal data sharing policies, metrics, software solutions, research data sharing agreements in general, open science ‘badges’, funder mandates, and initiatives.

Conclusions: A number of key messages for data sharing include: the need to build on existing cultures and practices, meeting people where they are and tailoring interventions to support them; the importance of publicising and explaining the policy/service widely; the need to have disciplinary data champions to model good practice and drive cultural change; the requirement to resource interventions properly; and the imperative to provide robust technical infrastructure and protocols, such as labelling of data sets, use of DOIs, data standards and use of data repositories.