Tracking Science: How Libraries can Protect Data and Scientific Freedom | ZBW MediaTalk

A modern expression states: If you are not paying for the product, you are the product yourself. How can libraries help to prevent tracking in science, thereby protecting the data of the researchers and, in an idealistic sense, scientific freedom? In an interview, Julia Reda reveals the starting points and pitfalls.

Rough Notes on ‘Surveillance Capitalism in our Libraries’ | Open Working

This is a very partial and personal list of sources and documents related to the topic of ‘Surveillance Capitalism in our Libraries’. I am sure there is much more to be added. It is shared in case anyone else finds it useful. A commentable version of the document is available here

Singapore starts making its copyright law fit for the digital world; others need to follow its example – Walled Culture

“Singapore’s previous copyright law provided a broad “fair dealing” right that allowed a range of general uses.  The title of this exception has now been changed from “fair dealing” to “fair use.” That might seem a trivial change, but it’s significant.  Fair dealing rights are, in general, more limited than fair use ones.  The adoption of the latter term is further confirmation that Singapore’s new Copyright Law is moving in the right direction, and aims to provide greater freedoms for the general public, rather than fewer, as has so often been the case in this sector.”

[Survey on text and data mining]

“Thank you for taking this survey about research use of text and data mining (TDM). We understand that by taking it, you consent to giving us this information. This research is led by Prof. Patricia Aufderheide in the School of Communication at American University and Brandon Butler at University of Virginia library. We are looking at the challenges faced by researchers doing TDM, especially in regard to copyright issues. This study was given a waiver by the Institutional Review Board of American University. You will be anonymous to us, unless you choose to give us your name, in which case your identity and all information associated with it will be kept confidential. The data are stored securely, via the Qualtrics platform and American University. We may reuse this information in later studies, but always keeping either anonymity or confidentiality, according to the option you have chosen.”

The General Index: Search 107 million research papers for free – Big Think

“Millions of research papers get published every year, but the majority lie behind paywalls. A new online catalogue called the General Index aims to make it easier to access and search through the world’s research papers. Unlike other databases which include the full text of research papers, the General Index only allows users to access snippets of content….” 

An exploratory analysis: extracting materials science knowledge from unstructured scholarly data | Emerald Insight

Abstract:  Purpose

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science.


The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach.


The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies.


To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area.

Giant, free index to world’s research papers released online

“In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.

The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.

Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place….”

An Open-Publishing Response to the COVID-19 Infodemic

Abstract:  The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript’s figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.


Sci-Hub Celebrates 10 Years Of Existence, With A Record 88 Million Papers Available, And A Call For Funds To Help It Add AI And Go Open Source | Techdirt

“To celebrate ten years offering a large proportion of the world’s academic papers for free — against all the odds, and in the face of repeated legal action — Sci-Hub has launched a funding drive:

Sci-Hub is run fully on donations. Instead of charging for access to information, it creates a common pool of knowledge free for anyone to access.

The donations page says that “In the next few years Sci-Hub is going to dramatically improve”, and lists a number of planned developments. These include a better search engine, a mobile app, and the use of neural networks to extract ideas from papers and make inferences and new hypotheses. Perhaps the most interesting idea is for the software behind Sci-Hub to become open source. The move would address in part a problem discussed by Techdirt back in May: the fact that Sci-Hub is a centralized service, with a single point of failure. Open sourcing the code — and sharing the papers database — would allow multiple mirrors to be set up around the world by different groups, increasing its resilience….”

Harnessing Scholarly Literature as Data to Curate, Explore, and Evaluate Scientific Research

Abstract:  There currently exist hundreds of millions of scientific publications, with more being created at an ever-increasing rate. This is leading to information overload: the scale and complexity of this body of knowledge is increasing well beyond the capacity of any individual to make sense of it all, overwhelming traditional, manual methods of curation and synthesis. At the same time, the availability of this literature and surrounding metadata in structured, digital form, along with the proliferation of computing power and techniques to take advantage of large-scale and complex data, represents an opportunity to develop new tools and techniques to help people make connections, synthesize, and pose new hypotheses. This dissertation consists of several contributions of data, methods, and tools aimed at addressing information overload in science. My central contribution to this space is Autoreview, a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss. I show that this approach can work to recommend relevant literature, and can also be used to systematically compare different features used in the recommendations. I also present the design, implementation, and evaluation of several visualization tools. One of these is an animated network visualization showing the influence of a scholar over time. Another is SciSight, an interactive system for recommending new authors and research by finding similarities along different dimensions. Additionally, I discuss the current state of available scholarly data sets; my work curating, linking, and building upon these data sets; and methods I developed to scale graph clustering techniques to very large networks.


Update: 1201 Exemption to Enable Text and Data Mining Research | Authors Alliance

“Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, is petitioning the Copyright Office for a new three-year exemption to the Digital Millennium Copyright Act (“DMCA”) as part of the Copyright Office’s eighth triennial rulemaking process. If granted, our proposed exemption would allow researchers to bypass technical protection measures (“TPMs”) in order to conduct text and data mining (“TDM”) research on literary works that are distributed electronically and motion pictures. Recently, we met with representatives from the U.S. Copyright Office to discuss the proposed exemption, focusing on the circumstances in which access to corpus content is necessary for verifying algorithmic findings and ways to address security concerns without undermining the goal of the exemption….”

Words Algorithm Collection – finding closely related open access books using text mining techniques | LIBER Quarterly: The Journal of the Association of European Research Libraries

Open access platforms and retail websites are both trying to present the most relevant offerings to their patrons. Retail websites deploy recommender systems that collect data about their customers. These systems are successful but intrude on privacy. As an alternative, this paper presents an algorithm that uses text mining techniques to find the most important themes of an open access book or chapter. By locating other publications that share one or more of these themes, it is possible to recommend closely related books or chapters.

The algorithm splits the full text in trigrams. It removes all trigrams containing words that are commonly used in everyday language and in (open access) book publishing. The most occurring remaining trigrams are distinctive to the publication and indicate the themes of the book. The next step is finding publications that share one or more of the trigrams. The strength of the connection can be measured by counting – and ranking – the number of shared trigrams. The algorithm was used to find connections between 10,997 titles: 67% in English, 29% in German and 6% in Dutch or a combination of languages. The algorithm is able to find connected books across languages.

It is possible use the algorithm for several use cases, not just recommender systems. Creating benchmarks for publishers or creating a collection of connected titles for libraries are other possibilities. Apart from the OAPEN Library, the algorithm can be applied to other collections of open access books or even open access journal articles. Combining the results across multiple collections will enhance its effectiveness.

Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries – ACRL Insider

“ACRL announces the publication of a new white paper, Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries, from authors Megan Senseney, Eleanor Dickson Koehl, Beth Sandore Namachchivaya, and Bertram Ludäscher.

This report from the IMLS National Forum on Data Mining Research Using In-Copyright and Limited-Access Text Datasets seeks to build a shared understanding of the issues and challenges associated with the legal and socio-technical logistics of conducting computational research with text data. It captures preparatory activities leading up to the forum and its outcomes to (1) provide academic librarians with a set of recommendations for action and (2) establish a research agenda for the LIS community….”