The Rise of GitHub in Scholarly Publications

Abstract:  The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the issue threads, pull requests, and wikis that provide important context to the code while maintaining their original URLs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To understand and quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora from January 2007 to December 2021. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.66 million publications in the corpora. We found that GitHub, GitLab, SourceForge, and Bitbucket were collectively linked to 160 times in 2007 and 76,746 times in 2021. In 2021, one out of five publications in the arXiv corpus included a URI to GitHub. The complexity of GHPs like GitHub is not amenable to conventional Web archiving techniques. Therefore, the growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.

 

Open Data on GitHub: Unlocking the Potential of AI

Abstract:  GitHub is the world’s largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability — ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at this https URL.

 

 

GitHub is Sued, and We May Learn Something About Creative Commons Licensing – The Scholarly Kitchen

“I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)

I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.

In the TDM context, this raises interesting questions:

Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
Does the attribution need to be included in the data set at every stage?
Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated….”

Code citation was made possible by research software engineers in Germany and the Netherlands | eScience Center

Did you ever have to cite your work when writing an essay for school? Or are you a researcher who can attest to the importance of being cited in research papers? Or perhaps you are a journalist who wants to cite your source to support your story? If you can relate to any of these scenarios, then we can all agree that giving credit where credit is due is important. Well, did you know that up until recently, it was very difficult for software developers to receive credit for their code or for others to cite their work? Thanks to a group of research software engineers in Germany and right here at the Netherlands eScience Center, code citation is now possible! How did they make this happen? For the story behind the scenes, read on.

GitHub repositories with links to academic papers: Public access, traceability, and evolution – ScienceDirect

Abstract:  Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of open-source scientific software which implements bleeding-edge science in its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the current practice of establishing and maintaining such links remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conduct a large-scale study of 20 thousand GitHub repositories that make references to academic papers. We use a mixed-methods approach to identify public access, traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are public access. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. We find that academic papers from top-tier SE venues are not likely to reference a repository, but when they do, they usually link to a GitHub software repository. In a network of arXiv papers and referenced repositories, we find that the most referenced papers are (i) highly-cited in academia and (ii) are referenced by repositories written in different programming languages.

 

New feature: built-in citation support in GitHub

Nat Friedman on Twitter:

“We’ve just added built-in citation support to GitHub so researchers and scientists can more easily receive acknowledgments for their contributions to software. Just push a CITATION.cff file and we’ll add a handy widget to the repo sidebar for you. Enjoy!”

Just push a CITATION.cff file and we’ll add a handy widget to the repo sidebar for you.

 

Julia Reda – GitHub Copilot is not infringing your copyright

GitHub is currently causing a lot of commotion in the Free Software scene with its release of Copilot. Copilot is an artificial intelligence trained on publicly available source code and texts. It produces code suggestions to programmers in real time. Since Copilot also uses the numerous GitHub repositories under copyleft licences such as the GPL as training material, some commentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence, but is to be offered as a paid service after a test phase. The controversy touches on several thorny copyright issues at once. What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.

US: Archivists’ Victory over Overbroad Copyright Claim | Human Rights Watch

“A decision by GitHub, a leading software development platform, to reinstate a popular free software tool for downloading videos, means that human rights groups will be able to continue to use the software without interruption to preserve documentation of human rights abuses, Human Rights Watch, Mnemonic, and WITNESS said today. GitHub had removed the code for the software, youtube-dl, from its platform in response to a request by the Recording Industry Association of America Inc (RIAA)….”

GitHub preserves its open-source software code deep in the arctic for future generations – SiliconANGLE

“GitHub Inc. said today it has delivered a copy of all of the open-source software code stored on its website to a data repository at the Arctic World Archive, which is a very long-term archival facility buried 250 meters deep in the permafrost of an Arctic mountain.

The operation is part of the GitHub Archive Program, which is a project announced last year that aims to preserve today’s open-source software for future generations. To do that, GitHub said, it will store its code in an archive called the GitHub Arctic Code Vault, which it says has been built to last for a thousand years….”

Octopub

“An ODI experiment, Octopub offers simple way to prepare and check a dataset, and publish it online onto the GitHub platform….

Data isn’t open until an open licence has been applied. You can choose a licence that suits your needs.

If you know nothing about licences, nothing to worry about, we’ll help you choose….

Want your data to be high quality? Reusable? Machine readable? We encourage you to apply schemas to your files, and we can help you get started….

Octopub can check the quality of your CSV files for common errors.

We’ll give you quality feedback, and you can review and re-upload as often as you need to until the data you want to publish is of the highest standard….”

GitHub Archive Program: the journey of the world’s open source code to the Arctic – The GitHub Blog

“At GitHub Universe 2019, we introduced the GitHub Archive Program along with the GitHub Arctic Code Vault. Our mission is to preserve open source software for future generations by storing your code in an archive built to last a thousand years.

On February 2, 2020, we took a snapshot of all active public repositories on GitHub to archive in the vault. Over the last several months, our archive partners Piql, wrote 21TB of repository data to 186 reels of piqlFilm (digital photosensitive archival film). Our original plan was for our team to fly to Norway and personally escort the world’s open source code to the Arctic, but as the world continues to endure a global pandemic, we had to adjust our plans. We stayed in close contact with our partners, waiting for the time when it was safe for them to travel to Svalbard. We’re happy to report that the code was successfully deposited in the Arctic Code Vault on July 8, 2020. …”

Investigating the Scholarly Git Experience Survey

“You have been invited to take part in a research study to learn more about how people in academia interact with version control, specifically Git, and source code hosting platforms (e.g. GitLab, SourceForge, GitHub). This study will be conducted by Vicky Steeves and Sarah Nguyen of NYU Division of Libraries.

If you agree to be in this study, you will be asked to complete a 30 question survey about version control and source code hosting….”

GitHub is now free for all teams | TechCrunch

“GitHub today announced that all of its core features are now available for free to all users, including those that are currently on free accounts. That means free unlimited private repositories with unlimited collaborators for all, including teams that use the service for commercial projects, as well as up to 2,000 minutes per month of free access to GitHub Actions, the company’s automation and CI/CD platform….”

GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Abstract:  Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of Open Source Software implements bleeding edge science into its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the link impact remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conducted a large-scale study of 20 thousand GitHub repositories to establish prevalence of references to academic papers. We use a mixed-methods approach to identify Open Access (OA), traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are OA. In terms of traceability, our analysis revealed that machine learning is the most prevalent topic of repositories. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. A case study of referenced arXiv paper shows that most of these papers are high-impact and influential and do align with academia, referenced by repositories written in different programming languages. From the evolutionary aspect, we find very few changes of papers being referenced and links to them.