An Open-Publishing Response to the COVID-19 Infodemic

Abstract:  The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript’s figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.


Sci-Hub Celebrates 10 Years Of Existence, With A Record 88 Million Papers Available, And A Call For Funds To Help It Add AI And Go Open Source | Techdirt

“To celebrate ten years offering a large proportion of the world’s academic papers for free — against all the odds, and in the face of repeated legal action — Sci-Hub has launched a funding drive:

Sci-Hub is run fully on donations. Instead of charging for access to information, it creates a common pool of knowledge free for anyone to access.

The donations page says that “In the next few years Sci-Hub is going to dramatically improve”, and lists a number of planned developments. These include a better search engine, a mobile app, and the use of neural networks to extract ideas from papers and make inferences and new hypotheses. Perhaps the most interesting idea is for the software behind Sci-Hub to become open source. The move would address in part a problem discussed by Techdirt back in May: the fact that Sci-Hub is a centralized service, with a single point of failure. Open sourcing the code — and sharing the papers database — would allow multiple mirrors to be set up around the world by different groups, increasing its resilience….”

Harnessing Scholarly Literature as Data to Curate, Explore, and Evaluate Scientific Research

Abstract:  There currently exist hundreds of millions of scientific publications, with more being created at an ever-increasing rate. This is leading to information overload: the scale and complexity of this body of knowledge is increasing well beyond the capacity of any individual to make sense of it all, overwhelming traditional, manual methods of curation and synthesis. At the same time, the availability of this literature and surrounding metadata in structured, digital form, along with the proliferation of computing power and techniques to take advantage of large-scale and complex data, represents an opportunity to develop new tools and techniques to help people make connections, synthesize, and pose new hypotheses. This dissertation consists of several contributions of data, methods, and tools aimed at addressing information overload in science. My central contribution to this space is Autoreview, a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss. I show that this approach can work to recommend relevant literature, and can also be used to systematically compare different features used in the recommendations. I also present the design, implementation, and evaluation of several visualization tools. One of these is an animated network visualization showing the influence of a scholar over time. Another is SciSight, an interactive system for recommending new authors and research by finding similarities along different dimensions. Additionally, I discuss the current state of available scholarly data sets; my work curating, linking, and building upon these data sets; and methods I developed to scale graph clustering techniques to very large networks.


Update: 1201 Exemption to Enable Text and Data Mining Research | Authors Alliance

“Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, is petitioning the Copyright Office for a new three-year exemption to the Digital Millennium Copyright Act (“DMCA”) as part of the Copyright Office’s eighth triennial rulemaking process. If granted, our proposed exemption would allow researchers to bypass technical protection measures (“TPMs”) in order to conduct text and data mining (“TDM”) research on literary works that are distributed electronically and motion pictures. Recently, we met with representatives from the U.S. Copyright Office to discuss the proposed exemption, focusing on the circumstances in which access to corpus content is necessary for verifying algorithmic findings and ways to address security concerns without undermining the goal of the exemption….”

Words Algorithm Collection – finding closely related open access books using text mining techniques | LIBER Quarterly: The Journal of the Association of European Research Libraries

Open access platforms and retail websites are both trying to present the most relevant offerings to their patrons. Retail websites deploy recommender systems that collect data about their customers. These systems are successful but intrude on privacy. As an alternative, this paper presents an algorithm that uses text mining techniques to find the most important themes of an open access book or chapter. By locating other publications that share one or more of these themes, it is possible to recommend closely related books or chapters.

The algorithm splits the full text in trigrams. It removes all trigrams containing words that are commonly used in everyday language and in (open access) book publishing. The most occurring remaining trigrams are distinctive to the publication and indicate the themes of the book. The next step is finding publications that share one or more of the trigrams. The strength of the connection can be measured by counting – and ranking – the number of shared trigrams. The algorithm was used to find connections between 10,997 titles: 67% in English, 29% in German and 6% in Dutch or a combination of languages. The algorithm is able to find connected books across languages.

It is possible use the algorithm for several use cases, not just recommender systems. Creating benchmarks for publishers or creating a collection of connected titles for libraries are other possibilities. Apart from the OAPEN Library, the algorithm can be applied to other collections of open access books or even open access journal articles. Combining the results across multiple collections will enhance its effectiveness.

Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries – ACRL Insider

“ACRL announces the publication of a new white paper, Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries, from authors Megan Senseney, Eleanor Dickson Koehl, Beth Sandore Namachchivaya, and Bertram Ludäscher.

This report from the IMLS National Forum on Data Mining Research Using In-Copyright and Limited-Access Text Datasets seeks to build a shared understanding of the issues and challenges associated with the legal and socio-technical logistics of conducting computational research with text data. It captures preparatory activities leading up to the forum and its outcomes to (1) provide academic librarians with a set of recommendations for action and (2) establish a research agenda for the LIS community….”

FAIR Island: Networked, Machine-Actionable DMPs for Open Science | RDA

“Imagine a dream scenario for Open Data advocates: A working field station that supports scientists with research data management practices that allow for their data to be used beyond the initial purpose of the project! Tetiaroa is such a place and the FAIR Island Project supports researchers as we translate the broader FAIR principles into optimal data policies and technical infrastructure by leveraging RDA outputs including standards that support networked, machine-actionable Data Management Plans (DMPs), and Persistent Identifiers (PIDs). Leveraging the global research data management community’s work, FAIR Island provides a real-world example where data and knowledge collected on Tetiaroa will be curated and made openly available as quickly as possible….”

Legal Literacies for Text Data Mining: New OER | Authors Alliance

“Authors Alliance is pleased to share the news of the open release of a comprehensive open educational resource (OER) on legal issues related to text data mining.

The new OER covers material taught at the Building Legal Literacies for Text Data Mining institute (funded by the National Endowment for the Humanities and led by Rachael Samberg and Tim Vollmer of UC Berkeley Library), and covers copyright, technological protection measures, privacy, and ethical considerations. It also helps other digital humanities professionals and researchers run their own similar institutes by describing in detail how the programming was developed and delivered, and includes ideas for hosting shorter literacy teaching sessions. Authors Alliance’s Executive Director, Brianna Schofield, co-authored a chapter on copyright in the OER.

Until now, humanities researchers conducting text data mining in the U.S. have had to maneuver through a thicket of legal issues without much guidance or assistance. The new OER empowers digital humanities researchers and professionals (such as librarians, consultants, and other institutional staff) to confidently navigate United States law, policy, ethics, and risk within digital humanities text data mining projects so that they can more easily engage in this type of research and contribute to the advancement of knowledge….”

Sci-Hub Pledges Open Source & AI Alongside Crypto Donation Drive * TorrentFreak

“Sci-Hub founder Alexandra Elbakyan has launched a donation drive to ensure the operations and development of the popular academic research platform. For safety reasons, donations can only be made in cryptocurrencies but the pledges include a drive to open source the project and the introduction of artificial intelligence to discover new hypotheses….”

NIH-Wide Strategic Plan: Fiscal Years 2021-2025

“NIH is committed to making findings from the research that it funds accessible and available in a timely manner, while also providing safeguards for privacy, intellectual property, security, and data management. For instance, NIH-funded investigators are expected to make the results and accomplishments of their activities freely available within 12 months of publication. NIH also encourages investigators to share results prior to peer review, such as through preprints, to speed the dissemination of their findings and enhance the rigor of their work through informal peer review. A robust culture of data sharing is critical to continued progress in science, maximizing NIH’s investment in research, and assurance of the highest levels of transparency and rigor. To this end, NIH will continue to promote opportunities for data management and sharing while allowing flexibility for various data types, sharing platforms, and strategies. Additionally, NIH is implementing a policy requiring that all applications include data sharing and management plans that consider input from stakeholders….”

Building Legal Literacies for Text Data Mining – Simple Book Publishing

“This book explores the legal literacies covered during the virtual Building Legal Literacies for Text Data Mining Institute, including copyright (both U.S. and international law), technological protection measures, privacy, and ethical considerations. It describes in detail how we developed and delivered the 4-day institute, and also provides ideas for hosting shorter literacy teaching sessions. Finally, we offer reflections and take-aways on the Institute….”

Now available: Open educational resource of Building Legal Literacies for Text Data Mining – UC Berkeley Library Update

“Last summer we hosted the Building Legal Literacies for Text Data Mining institute. We welcomed 32 digital humanities researchers and professionals to the weeklong virtual training, with the goal to empower them to confidently navigate law, policy, ethics, and risk within digital humanities text data mining (TDM) projects. Building Legal Literacies for Text Data Mining (Building LLTDM) was made possible through a grant from the National Endowment for the Humanities. 

Since the remote institute in June 2020, the participants and project team reconvened in February 2021 to discuss how participants had been thinking about, performing, or supporting TDM in their home institutions and projects with the law and policy literacies in mind.

To maximize the reach and impact of Building LLTDM, we have now published a comprehensive open educational resource (OER) of the contents of the institute. The OER covers copyright (both U.S. and international law), technological protection measures, privacy, and ethical considerations. It also helps other digital humanities professionals and researchers run their own similar institutes by describing in detail how we developed and delivered programming (including our pedagogical reflections and take-aways), and includes ideas for hosting shorter literacy teaching sessions. The resource (available as a web-book or in downloadable formats such as PDF, EPUB, and MOBI) is in the public domain under the CC0 Public Domain Dedication, meaning it can be accessed, reused, and repurposed without restriction. …”

A missed deadline: the state of play of the Copyright Directive | Europeana Pro

About two years ago, the Copyright in the Digital Single Market (CDSM) Directive was adopted, obliging European Union Member States to ‘bring into force the laws, regulations and administrative provisions necessary to comply with this Directive by 7 June 2021’. On this date, we take a look at the progress of Member States, and at some of the policy choices they have made.

PIJIP and Wikimedia Germany Co-Host RightsCon 2021 Panel – American University Washington College of Law

“PIJIP Director Sean Flynn co-hosted a panel titled Access to Digital Education in the Time of COVID-19: Copyright and Public Health Emergencies as part of RightsCon 2021. He hosted the discussion with Justus Dreyling, the project manager of international regulation with Wikimedia Germany.

The panel focused on the impact of inadequate copyright rules on access to and use of educational materials in digital setting as well as how new legal instruments at the international level could solve these problems and facilitate access to knowledge….”

Open Science in Kenya: Where Are We? | Research Metrics and Analytics

“According to the United Nations Educational, Scientific, and Cultural Organization (UNESCO), Open Science is the movement to make scientific research and data accessible to all. It has great potential for advancing science. At its core, it includes (but is not limited to) open access, open data, and open research. Some of the associated advantages are promoting collaboration, sharing and reproducibility in research, and preventing the reinvention of the wheel, thus saving resources. As research becomes more globalized and its output grows exponentially, especially in data, the need for open scientific research practices is more evident — the future of modern science. This has resulted in a concerted global interest in open science uptake. Even so, barriers still exist. The formal training curriculum in most, if not all, universities in Kenya does not equip students with the knowledge and tools to subsequently practice open science in their research. Therefore, to work openly and collaboratively, there is a need for awareness and training in the use of open science tools. These have been neglected, especially in most developing countries, and remain barriers to the cause. Moreover, there is scanty research on the state of affairs regarding the practice and/or adoption of open science. Thus, we developed, through the OpenScienceKE framework, a model to narrow the gap. A sensitize-train-hack-collaborate model was applied in Nairobi, the economic and administrative capital of Kenya. Using the model, we sensitized through seminars, trained on the use of tools through workshops, applied the skills learned in training through hackathons to collaboratively answer the question on the state of open science in Kenya. While the former parts of the model had 20–50 participants, the latter part mainly involved participants with a bioinformatics background, leveraging their advanced computational skills. This model resulted in an open resource that researchers can use to publish as open access cost-effectively. Moreover, we observed a growing interest in open science practices in Kenya through literature search and data mining and that lack of awareness and skills may still hinder the adoption and practice of open science. Furthermore, at the time of the analyses, we surprisingly found that out of the 20,069 papers downloaded from BioRXiv, only 18 had Kenyan authors, a majority of which are international (16) collaborations. This may suggest poor uptake of the use of preprints among Kenyan researchers. The findings in this study highlight the state of open science in Kenya and challenges facing its adoption and practice while bringing forth possible areas for primary consideration in the campaign toward open science. It also proposes a model (sensitize-train-hack-collaborate model) that may be adopted by researchers, funders and other proponents of open science to address some of the challenges faced in promoting its adoption in Kenya….”