“Optional Data Curation Feature Use by Harvard Dataverse Repository Users” by Ceilyn Boyd

Abstract:  Objective: Investigate how different groups of depositors vary in their use of optional data curation features that provide support for FAIR research data in the Harvard Dataverse repository.

Methods: A numerical score based upon the presence or absence of characteristics associated with the use of optional features was assigned to each of the 29,295 datasets deposited in Harvard Dataverse between 2007 and 2019. Statistical analyses were performed to investigate patterns of optional feature use amongst different groups of depositors and their relationship to other dataset characteristics.

Results: Members of groups make greater use of Harvard Dataverse’s optional features than individual researchers. Datasets that undergo a data curation review before submission to Harvard Dataverse, are associated with a publication, or contain restricted files also make greater use of optional features.

Conclusions: Individual researchers might benefit from increased outreach and improved documentation about the benefits and use of optional features to improve their datasets’ level of curation beyond the FAIR-informed support that the Harvard Dataverse repository provides by default. Platform designers, developers, and managers may also use the numerical scoring approach to explore how different user groups use optional application features.

Opening Your Scholarship: Why should I DASH and Dataverse?

“Learn practices and platforms to achieve your open access goals!

Highlights on Harvard DASH and Dataverse.


– Sonia Barbosa, Manager of Data Curation, Harvard Dataverse, Manager of the Murray Research Archive

– Julie Goldman, Research Data Services Librarian

– Colin Lukens, Senior Repository Manager, Harvard Library Office for Scholarly Communication

– Katie Mika, Data Services Librarian …”

Dataverse and OpenDP: Tools for Privacy-Protective Analysis in the Cloud | Mercè Crosas

“When big data intersects with highly sensitive data, both opportunity to society and risks abound. Traditional approaches for sharing sensitive data are known to be ineffective in protecting privacy. Differential Privacy, deriving from roots in cryptography, is a strong mathematical criterion for privacy preservation that also allows for rich statistical analysis of sensitive data. Differentially private algorithms are constructed by carefully introducing “random noise” into statistical analyses so as to obscure the effect of each individual data subject.    OpenDP is an open-source project for the differential privacy community to develop general-purpose, vetted, usable, and scalable tools for differential privacy, which users can simply, robustly and confidently deploy. 

Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others, and allows you to replicate others’ work more easily. Researchers, journals, data authors, publishers, data distributors, and affiliated institutions all receive academic credit and web visibility.  A Dataverse repository is the software installation, which then hosts multiple virtual archives called Dataverses. Each dataverse contains datasets, and each dataset contains descriptive metadata and data files (including documentation and code that accompany the data).

This session examines ongoing efforts to realize a combined use case for these projects that will offer academic researchers privacy-preserving access to sensitive data. This would allow both novel secondary reuse and replication access to data that otherwise is commonly locked away in archives.  The session will also explore the potential impact of this work outside the academic world.”

Dataverse Community Meeting 2020

“The annual Dataverse Community Meeting is an opportunity to build, grow, and enrich the global community. Like the open-source Dataverse product itself, the activities of the Dataverse Community Meetings are community-driven. Over three days of presentations, workshops, and working group meetings we aim to promote and learn about behavioral and technical solutions and standards for curating, sharing, and preserving data that can be discovered and reused across disciplines to reproduce and advance research.

The Dataverse Community Meeting is hosted by Harvard’s Institute for Quantitative Social Science. Learn more about The Dataverse Project at our dataverse.org site….”

Advancing computational reproducibility in the Dataverse data repository platform

Abstract:  Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computational environments for code encapsulation, thus enabling research portability and reproducibility. However, they do not often enable research discoverability, standardized data citation, or long-term archival like data repositories do. This paper addresses the shortcomings of data repositories and reproducibility tools and how they could be overcome to improve the current lack of computational reproducibility in published and archived research outputs.


COVID-19 Data Collection

“This is a general collection of COVID-19 data deposited in the Harvard Dataverse repository. The list in this collection is maintained by the Harvard Dataverse data curation team (IQSS and Harvard Library). Researchers who deposit their related data into Harvard Dataverse will have their data linked to this collection, to increase discoverability of their data. Please use the contact link if you have any questions about this collection.”

COVID-19 Data Collection

“This is a general collection of COVID-19 data deposited in the Harvard Dataverse repository. The list in this collection is maintained by the Harvard Dataverse data curation team (IQSS and Harvard Library). Researchers who deposit their related data into Harvard Dataverse will have their data linked to this collection, to increase discoverability of their data. Please use the contact link if you have any questions about this collection.”

Dataverse: Sign Up for a One-Day Workshop | Institute for Quantitative Social Science

“The Harvard Dataverse data management and curation team is holding a one day workshop on Monday, March 16th 2020, from 9-12 pm. Come learn about the Harvard Dataverse for data sharing and preservation. You will have an opportunity to discuss your research project and data sharing needs, including:

The purpose of research data sharing
How to organize research data for sharing
Options for sharing deidentified or sensitive data
Data analysis and visualization tools provided by Harvard Dataverse
Dataset and file level DOIs and data citations
How to manage a team project on Harvard Dataverse, and much more.

We will also discuss the importance of curating data to meet FAIR data guidelines when sharing your data on Harvard Dataverse.  Space is limited and laptops are required….”

The Big Data Challenge – Recommendations by Mercè Crosas – Big Data Value

“Currently, Mercè’s team is in the process of implementing datatags for datasets in the Harvard Dataverse repository. This has been a big task due to legal compliance issues, security requirements and the conditions set by various data agreements. These datasets often contain sensitive information about individuals and therefore safeguards need to be put in place to protect these individuals. Policies on data sharing play a critical role in balancing the benefits and risks. The average citizen wants privacy and safety of his data but has little time for data governance. As the amount of data driven products is only expected to increase, so is the demand of citizens for privacy management. It is important to map the data beforehand because the manner in which relevant regulation is to be attached to the data is dependent on the data itself. When regulation changes, the datatags will have to be adopted as well, for instance by providing an updated version of the tag. For these purposes, they teamed up with lawyers helping them with the verification of the datatags. More recently, Mercè has been involved with the OpenDP project as one of the co-PIs, an open-source platform for differential privacy libraries. This work would allow to mine and analyze sensitive datasets while preserving their privacy and never been accessed directly by the researchers. Dataverse, DataTags, and OpenDP will together provide a privacy-preserving platform for sharing and analyzing sensitive data….”

European Dataverse Workshop 2020

“Are you looking for a repository software to run your research data repository?

Are you already using Dataverse and want to exchange experiences and learn more about Dataverse?

>> Join us at the European Dataverse Workshop 2020!

Date: January 23-24, 2020 Venue: UiT The Arctic University of Norway

Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data.

For more information about the European Dataverse Workshop 2020, see the workshop webpage.


Save the date!”

One Step Closer to the “Paper of the Future” | Research Data Management @Harvard

“As a researcher who is trying to understand the structure of the Milky Way, I often deal with very large astronomical datasets (terabytes of data, representing almost two billion unique stars). Every single dataset we use is publicly available to anyone, but the primary challenge in processing them is just how large they are. Most astronomical data hosting sites provide an option to remotely query sources through their web interface, but it is slow and inefficient for our science….

To circumvent this issue, we download all the catalogs locally to Harvard Odyssey, with each independent survey housed in a separate database. We use a special python-based tool (the “Large-Survey Database”) developed by a former post-doctoral scholar at Harvard, which allows us to perform fast queries of these databases simultaneously using the Odyssey computing cluster….

To extract information from each hdf5 file, we have developed a sophisticated Bayesian analysis pipeline that reads in our curated hdf5 files and outputs best fits for our model parameters (in our case, distances to local star-forming regions near the sun). Led by a graduate student and co-PI on the paper (Joshua Speagle), the python codebase is publicly available on GitHub with full API documentation. In the future, it will be archived with a permanent DOI on Zenodo. Also on GitHub users will find full working examples of the code, demonstrating how users can read in the publicly available data and output the same style of figures seen in the paper. Sample data are provided, and the demo is configured as a jupyter notebook, so interested users can walk through the methodology line-by-line….”

Open Access Week at Harvard Library 2018 | Communications

“In celebration of OA Week, the Harvard Library Office for Scholarly Communicationwill share some great news about OA and the Harvard Community: 

  • The OSC will launch a new OA policy for staff, researchers, and scholars to use open-access licensing
  • we will share our annual statistics from around the world, highlighting Harvard’s scholarship’s impact
  • reveal the new and improved Harvard open-access repository, DASH (Digital Access to Scholarship at Harvard).

In addition, the Harvard Library OSC and the Research Data Management Programare teaming up to co-sponsor a series of events during OA week, including an open-access open house, interactive workshops on ORCID, reproducibility, Dataverse, and more. See the schedule for more details….”