A modern expression states: If you are not paying for the product, you are the product yourself. How can libraries help to prevent tracking in science, thereby protecting the data of the researchers and, in an idealistic sense, scientific freedom? In an interview, Julia Reda reveals the starting points and pitfalls.
This is a very partial and personal list of sources and documents related to the topic of ‘Surveillance Capitalism in our Libraries’. I am sure there is much more to be added. It is shared in case anyone else finds it useful. A commentable version of the document is available here
“Singapore’s previous copyright law provided a broad “fair dealing” right that allowed a range of general uses. The title of this exception has now been changed from “fair dealing” to “fair use.” That might seem a trivial change, but it’s significant. Fair dealing rights are, in general, more limited than fair use ones. The adoption of the latter term is further confirmation that Singapore’s new Copyright Law is moving in the right direction, and aims to provide greater freedoms for the general public, rather than fewer, as has so often been the case in this sector.”
“Thank you for taking this survey about research use of text and data mining (TDM). We understand that by taking it, you consent to giving us this information. This research is led by Prof. Patricia Aufderheide in the School of Communication at American University and Brandon Butler at University of Virginia library. We are looking at the challenges faced by researchers doing TDM, especially in regard to copyright issues. This study was given a waiver by the Institutional Review Board of American University. You will be anonymous to us, unless you choose to give us your name, in which case your identity and all information associated with it will be kept confidential. The data are stored securely, via the Qualtrics platform and American University. We may reuse this information in later studies, but always keeping either anonymity or confidentiality, according to the option you have chosen.”
Abstract: JATSdecoder is a general toolbox which facilitates text extraction and analytical tasks on NISO-JATS coded XML documents. Its function JATSdecoder() outputs metadata, the abstract, the sectioned text and reference list as easy selectable elements. One of the biggest repositories for open access full texts covering biology and the medical and health sciences is PubMed Central (PMC), with more than 3.2 million files. This report provides an overview of the PMC document collection processed with JATSdecoder(). The development of extracted tags is displayed for the full corpus over time and in greater detail for some meta tags. Possibilities and limitations for text miners working with scientific literature are outlined. The NISO-JATS-tags are used quite consistently nowadays and allow a reliable extraction of metadata and text elements. International collaborations are more present than ever. There are obvious errors in the date stamps of some documents. Only about half of all articles from 2020 contain at least one author listed with an author identification code. Since many authors share the same name, the identification of person-related content is problematic, especially for authors with Asian names. JATSdecoder() reliably extracts key metadata and text elements from NISO-JATS coded XML files. When combined with the rich, publicly available content within PMCs database, new monitoring and text mining approaches can be carried out easily. Any selection of article subsets should be carefully performed with in- and exclusion criteria on several NISO-JATS tags, as both the subject and keyword tags are used quite inconsistently.
“Millions of research papers get published every year, but the majority lie behind paywalls. A new online catalogue called the General Index aims to make it easier to access and search through the world’s research papers. Unlike other databases which include the full text of research papers, the General Index only allows users to access snippets of content….”
The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science.
The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach.
The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies.
To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area.
“In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.
The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.
Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place….”
“Glen Worthey, from the HathiTrust Research Center, will speak on the center’s latest initiatives in text and data mining.”
Abstract: The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript’s figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.
“To celebrate ten years offering a large proportion of the world’s academic papers for free — against all the odds, and in the face of repeated legal action — Sci-Hub has launched a funding drive:
Sci-Hub is run fully on donations. Instead of charging for access to information, it creates a common pool of knowledge free for anyone to access.
The donations page says that “In the next few years Sci-Hub is going to dramatically improve”, and lists a number of planned developments. These include a better search engine, a mobile app, and the use of neural networks to extract ideas from papers and make inferences and new hypotheses. Perhaps the most interesting idea is for the software behind Sci-Hub to become open source. The move would address in part a problem discussed by Techdirt back in May: the fact that Sci-Hub is a centralized service, with a single point of failure. Open sourcing the code — and sharing the papers database — would allow multiple mirrors to be set up around the world by different groups, increasing its resilience….”
Abstract: There currently exist hundreds of millions of scientific publications, with more being created at an ever-increasing rate. This is leading to information overload: the scale and complexity of this body of knowledge is increasing well beyond the capacity of any individual to make sense of it all, overwhelming traditional, manual methods of curation and synthesis. At the same time, the availability of this literature and surrounding metadata in structured, digital form, along with the proliferation of computing power and techniques to take advantage of large-scale and complex data, represents an opportunity to develop new tools and techniques to help people make connections, synthesize, and pose new hypotheses. This dissertation consists of several contributions of data, methods, and tools aimed at addressing information overload in science. My central contribution to this space is Autoreview, a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss. I show that this approach can work to recommend relevant literature, and can also be used to systematically compare different features used in the recommendations. I also present the design, implementation, and evaluation of several visualization tools. One of these is an animated network visualization showing the influence of a scholar over time. Another is SciSight, an interactive system for recommending new authors and research by finding similarities along different dimensions. Additionally, I discuss the current state of available scholarly data sets; my work curating, linking, and building upon these data sets; and methods I developed to scale graph clustering techniques to very large networks.
“Authors Alliance, joined by the Library Copyright Alliance and the American Association of University Professors, is petitioning the Copyright Office for a new three-year exemption to the Digital Millennium Copyright Act (“DMCA”) as part of the Copyright Office’s eighth triennial rulemaking process. If granted, our proposed exemption would allow researchers to bypass technical protection measures (“TPMs”) in order to conduct text and data mining (“TDM”) research on literary works that are distributed electronically and motion pictures. Recently, we met with representatives from the U.S. Copyright Office to discuss the proposed exemption, focusing on the circumstances in which access to corpus content is necessary for verifying algorithmic findings and ways to address security concerns without undermining the goal of the exemption….”
Open access platforms and retail websites are both trying to present the most relevant offerings to their patrons. Retail websites deploy recommender systems that collect data about their customers. These systems are successful but intrude on privacy. As an alternative, this paper presents an algorithm that uses text mining techniques to find the most important themes of an open access book or chapter. By locating other publications that share one or more of these themes, it is possible to recommend closely related books or chapters.
The algorithm splits the full text in trigrams. It removes all trigrams containing words that are commonly used in everyday language and in (open access) book publishing. The most occurring remaining trigrams are distinctive to the publication and indicate the themes of the book. The next step is finding publications that share one or more of the trigrams. The strength of the connection can be measured by counting – and ranking – the number of shared trigrams. The algorithm was used to find connections between 10,997 titles: 67% in English, 29% in German and 6% in Dutch or a combination of languages. The algorithm is able to find connected books across languages.
It is possible use the algorithm for several use cases, not just recommender systems. Creating benchmarks for publishers or creating a collection of connected titles for libraries are other possibilities. Apart from the OAPEN Library, the algorithm can be applied to other collections of open access books or even open access journal articles. Combining the results across multiple collections will enhance its effectiveness.
“ACRL announces the publication of a new white paper, Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries, from authors Megan Senseney, Eleanor Dickson Koehl, Beth Sandore Namachchivaya, and Bertram Ludäscher.
This report from the IMLS National Forum on Data Mining Research Using In-Copyright and Limited-Access Text Datasets seeks to build a shared understanding of the issues and challenges associated with the legal and socio-technical logistics of conducting computational research with text data. It captures preparatory activities leading up to the forum and its outcomes to (1) provide academic librarians with a set of recommendations for action and (2) establish a research agenda for the LIS community….”