Nine best practices for research software registries and repositories [PeerJ]

Abstract:  Scientific software registries and repositories improve software findability and research transparency, provide information for software citations, and foster preservation of computational methods in a wide range of disciplines. Registries and repositories play a critical role by supporting research reproducibility and replicability, but developing them takes effort and few guidelines are available to help prospective creators of these resources. To address this need, the FORCE11 Software Citation Implementation Working Group convened a Task Force to distill the experiences of the managers of existing resources in setting expectations for all stakeholders. In this article, we describe the resultant best practices which include defining the scope, policies, and rules that govern individual registries and repositories, along with the background, examples, and collaborative work that went into their development. We believe that establishing specific policies such as those presented here will help other scientific software registries and repositories better serve their users and their disciplines.

 

Author interview: Nine best practices for software repositories and registries

PeerJ talks to Daniel Garijo about the recently published PeerJ Computer Science article Nine best practices for research software registries and repositories. The article is featured in the PeerJ Software Citation, Indexing, and Discoverability Special Issue.


 

Can you tell us a bit about yourself?

This work would not have been possible without the SciCodes community, the participants of the 2019 Scientific Software Registry Collaboration Workshop and the FORCE11 Software Citation Implementation Working Group. It all started when a task force of that working group undertook the initial work that is detailed in the paper, and then formed SciCodes to continue working together. We are a group of software enthusiasts who maintain and curate research software repositories and registries from different disciplines, including geosciences, neuroscience, biology, and astronomy (currently more than 20 resources and 30 worldwide participants are members of the initiative) 

 

Can you briefly explain the research you published in PeerJ?

In examining the literature, we found best practices and policy suggestions for many different aspects of science, software, and data, but none that specifically addressed software repositories and registries. Our goal was to examine our own and other similar resources, share practices, discuss common challenges, and develop a set of basic best practices for these resources.  

 

What did you find? and how do these practices have such an impact?

We were surprised to find a lot of diversity between our resources. We expected that  our  domains, missions, and types of software in our collections would be different but we expected more commonality in the software metadata our  different resources collect! We had far  fewer fields in common than expected. For example, some resources might collect information on what operating system a software package runs on, other resources may not. In retrospect, this makes sense, since disciplines have different goals and expectations for sharing and reusability  of research software and different heterogeneities (or not) in technology used.

 

The practices outlined in our work aim to strengthen registries and repositories by including enacting policies that make our resources more transparent to our users and encourage us to think more about the long-term availability of software entries. They also provide a way for us to work cooperatively to establish a way for our metadata to be searched, as software that is useful in one field may have application in another.  

Our proposed practices are already having an impact. They have helped member registries audit their practices and start enacting policies and procedures to strengthen their practices. By doing so, they encourage long-term success for their communities. Through this paper, we hope that other registries find these useful in improving their practices and just maybe, contribute to the conversation by joining SciCodes.

 

What kinds of lessons do you hope your readers take away from the research?

We hope the proposed practices will help new and existing resources consider key aspects of their maintainability, metadata and future availability. We expected that the process of converging in common practices would be easy but developing policies and practices that cover a wide range of disciplines and missions was challenging. We are grateful to our funders that we could convene such a great group of experts together and of course, to the experts for contributing their time in helping make our initial draft better.

 

How did you first hear about PeerJ, and what persuaded you to submit to us?

An editor of this special issue on software citation, indexing and discoverability (https://peerj.com/special-issues/84-software)

mentioned that this would be an interesting paper for the community. While not fitting neatly into this category, we felt that workshop discussions and resulting best practices contribute substantially to the software citation ecosystem as repositories and registries are a mechanism to promote discovery, reuse, and credit for software.

 


You can find more PeerJ author interviews here.

GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Abstract:  Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of Open Source Software implements bleeding edge science into its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the link impact remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conducted a large-scale study of 20 thousand GitHub repositories to establish prevalence of references to academic papers. We use a mixed-methods approach to identify Open Access (OA), traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are OA. In terms of traceability, our analysis revealed that machine learning is the most prevalent topic of repositories. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. A case study of referenced arXiv paper shows that most of these papers are high-impact and influential and do align with academia, referenced by repositories written in different programming languages. From the evolutionary aspect, we find very few changes of papers being referenced and links to them.

 

GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Abstract:  Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of Open Source Software implements bleeding edge science into its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the link impact remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conducted a large-scale study of 20 thousand GitHub repositories to establish prevalence of references to academic papers. We use a mixed-methods approach to identify Open Access (OA), traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are OA. In terms of traceability, our analysis revealed that machine learning is the most prevalent topic of repositories. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. A case study of referenced arXiv paper shows that most of these papers are high-impact and influential and do align with academia, referenced by repositories written in different programming languages. From the evolutionary aspect, we find very few changes of papers being referenced and links to them.

 

Microsoft has reportedly acquired GitHub – The Verge

“Microsoft has reportedly acquired GitHub, and could announce the deal as early as Monday. Bloomberg reports that the software giant has agreed to acquire GitHub, and that the company chose Microsoft partly because of CEO Satya Nadella. Business Insider first reported that Microsoft had been in talks with GitHub recently.

GitHub is a vast code repository that has become popular with developers and companies hosting their projects, documentation, and code. Apple, Amazon, Google, and many other big tech companies use GitHub. Microsoft is the top contributor to the site, and has more than 1,000 employees actively pushing code to repositories on GitHub. Microsoft even hosts its own original Windows File Manager source code on GitHub. The service was last valued at $2 billion back in 2015, but it’s not clear exactly how much Microsoft has paid to acquire GitHub….”

Code Ocean | Discover & Run Scientific Code

“Our mission is to make the world’s scientific code more reusable, executable and reproducible

 

Code Ocean is a cloud-based computational reproducibility platform that provides researchers and developers an easy way to share, discover and run code published in academic journals and conferences.

More and more of today’s research includes software code, statistical analysis and algorithms that are not included in traditional publishing. But they are often essential to reproducing the research results and reusing them in a new product or research. This creates a major roadblock for researchers, one that inspired the first steps of Code Ocean as part of the 2014 Runway Startup Postdoc Program at the Jacobs Technion Cornell Institute. Today, the company employs more than 10 people and officially launched the product in February 2017.

For the first time, researchers, engineers, developers and scientists can upload code and data in 10 programming languages and link working code in a computational environment with the associated article for free. We assign a Digital Object Identifier (DOI) to the algorithm, providing correct attribution and a connection to the published research.

The platform provides open access to the published software code and data to view and download for everyone for free. But the real treat is that users can execute all published code without installing anything on their personal computer. Everything runs in the cloud on CPUs or GPUs according to the user needs. We make it easy to change parameters, modify the code, upload data, run it again, and see how the results change….”