GitHub is Sued, and We May Learn Something About Creative Commons Licensing – The Scholarly Kitchen

“I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)

I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.

In the TDM context, this raises interesting questions:

Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
Does the attribution need to be included in the data set at every stage?
Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated….”

COMMUNIA Association – The Italian Implementation of the New EU Text and Data Mining Exceptions

“This blog post analyses the implementation of the copyright exceptions for Text and Data Mining, which is defined in the Italian law as any automated technique designed to analyse large amounts of text, sound, images, data or metadata in digital format to generate information, including patterns, trends, and correlations (Art. 70 ter (2) LdA). As we will see in more detail below, the Italian lawmaker decided to introduce some novelties when implementing Art. 3, while following more closely the text of the Directive when implementing Art. 4….

Notably, the new Italian exception also allows the communication to the public of the research outcome when such outcomes are expressed through new original works. In other words, the communication of protected materials resulting from computational research processes is permitted, provided that such results are included in an original publication, data collection or other original work.

The right of communication to the public was not contemplated in the original government draft; it was introduced in the last version of the article to accommodate the comments of the Joint Committees of the Senate and the Joint Committees of the Chamber, both highlighting the need to specify that the right of communication to the public concerns only the results of research, where expressed in new original works.


The beneficiaries of the TDM exception for scientific purposes are research organisations and cultural heritage institutions. Research organisations essentially reflect the definition offered by the directive. These are universities, including their libraries, research institutes or any other entity whose primary objective is to conduct scientific research activities or to conduct educational activities that include scientific research, which alternatively: …

The Italian lawmaker did not expressly contemplate any specific and fast procedure for cases where technical protection measures prevent a beneficiary from carrying out the permitted acts under both TDM exceptions. However, the law now recognises to the beneficiaries the right to extract a copy of the material protected by technological  measures in certain cases. Under Art. 70-sexies, LdA, beneficiaries of the TDM exception for scientific purposes (as well as the beneficiaries of the exception for digital and cross-border teaching activities exception) shall have the right to extract a copy of the protected material, when technological measures are applied based on agreements or on administrative procedures or judicial decisions. In order to benefit from this right, the person shall have lawful possession of copies of the protected material (or have had legal access to them), shall respect the conditions and the purposes provided for in the exception, and such extraction shall not conflict with the normal exploitation of the work or the other materials or cause an unjustified prejudice to the rights holders….”

Antibiotic discovery in the artificial intelligence era – Lluka – Annals of the New York Academy of Sciences – Wiley Online Library

Abstract:  As the global burden of antibiotic resistance continues to grow, creative approaches to antibiotic discovery are needed to accelerate the development of novel medicines. A rapidly progressing computational revolution—artificial intelligence—offers an optimistic path forward due to its ability to alleviate bottlenecks in the antibiotic discovery pipeline. In this review, we discuss how advancements in artificial intelligence are reinvigorating the adoption of past antibiotic discovery models—namely natural product exploration and small molecule screening. We then explore the application of contemporary machine learning approaches to emerging areas of antibiotic discovery, including antibacterial systems biology, drug combination development, antimicrobial peptide discovery, and mechanism of action prediction. Lastly, we propose a call to action for open access of high-quality screening datasets and interdisciplinary collaboration to accelerate the rate at which machine learning models can be trained and new antibiotic drugs can be developed.


Legal reform to enhance global text and data mining research | Science

“The resistance to TDM exceptions in copyright comes primarily from the multinational publishing industry, which is a strong voice in copyright debates and tends to oppose expansions to copyright exceptions. But the success at adopting exceptions for TDM research in the US and EU already—where publishing lobbies are strongest—shows that policy reform in this area is possible. Publishers need not be unduly disadvantaged by TDM exceptions because publishers can still license access to their databases, which researchers must obtain in the first instance, and can offer products that make TDM and other forms of research more efficient and effective.”

[2208.06178] Mining Legal Arguments in Court Decisions

Abstract:  Identifying, classifying, and analyzing arguments in legal discourse has been a prominent area of research since the inception of the argument mining field. However, there has been a major discrepancy between the way natural language processing (NLP) researchers model and annotate arguments in court decisions and the way legal experts understand and analyze legal argumentation. While computational approaches typically simplify arguments into generic premises and claims, arguments in legal research usually exhibit a rich typology that is important for gaining insights into the particular case and applications of law in general. We address this problem and make several substantial contributions to move the field forward. First, we design a new annotation scheme for legal arguments in proceedings of the European Court of Human Rights (ECHR) that is deeply rooted in the theory and practice of legal argumentation research. Second, we compile and annotate a large corpus of 373 court decisions (2.3M tokens and 15k annotated argument spans). Finally, we train an argument mining model that outperforms state-of-the-art models in the legal NLP domain and provide a thorough expert-based evaluation. All datasets and source codes are available under open lincenses at this https URL.


Data Protections and Licenses Affecting Text and Data Mining for Machine Learning


Abstract:  Machines don’t read works or data. Machines need to first abstract and then format data for learning and then apply tagging and other metadata to model the data into something the machine can “understand.” Legal protections aren’t purpose-built to allow machines to abstract data from a work, process it, model it, and then re-present it. Most licenses aren’t purpose-built for that either. This document walks the reader through all the known protections and licenses as to whether they cover machine learning practices. It then postulates a proposed license structure for that purpose.

Editorial: Mining Scientific Papers, Volume II: Knowledge Discovery and Data Exploitation

“The processing of scientific texts, which includes the analysis of citation contexts but also the task of information extraction from scientific papers for various applications, has been the object of intensive research during the last decade. This has become possible thanks to two factors. The first one is the growing availability of scientific papers in full text and in machine-readable formats together with the rise of the Open Access publishing of papers on online platforms such as ArXiv, Semantic Scholar, CiteSeer, or PLOS. The second factor is the relative maturity of open source tools and libraries for natural language processing that facilitate text processing (e.g., Spacy, NLTK, Mallet, OpenNLP, CoreNLP, Gate, CiteSpace). As a result, a large number of experiments have been conducted by processing the full text of papers for citation context analysis, but also summarization and recommendation of scientific papers….”

NISO vision interview with CORE’s Petr Knoth on the role of text mining in scholarly communication – Research

“This Vision Interview with Petr Knoth, Senior Research Fellow in Text and Data Mining at the Open University and Head of CORE (, served as the opening segment of the NISO Hot Topic virtual conference, Text and Data Mining, held on May 25, 2022. Todd Carpenter spoke at length with Knoth about the many ways in which text and data mining impacts the present as well as the future. They discussed just how innovative this technology can be for the needs of researchers in the information community….”

The LOTUS initiative for open knowledge management in natural products research | eLife

Abstract:  Contemporary bioinformatic and chemoinformatic capabilities hold promise to reshape knowledge management, analysis and interpretation of data in natural products research. Currently, reliance on a disparate set of non-standardized, insular, and specialized databases presents a series of challenges for data access, both within the discipline and for integration and interoperability between related fields. The fundamental elements of exchange are referenced structure-organism pairs that establish relationships between distinct molecular structures and the living organisms from which they were identified. Consolidating and sharing such information via an open platform has strong transformative potential for natural products research and beyond. This is the ultimate goal of the newly established LOTUS initiative, which has now completed the first steps toward the harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs. LOTUS data is hosted on Wikidata and regularly mirrored on Data sharing within the Wikidata framework broadens data access and interoperability, opening new possibilities for community curation and evolving publication models. Furthermore, embedding LOTUS data into the vast Wikidata knowledge graph will facilitate new biological and chemical insights. The LOTUS initiative represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.


Past and future uses of text mining in ecology and evolution | Proceedings of the Royal Society B: Biological Sciences

Abstract:  Ecology and evolutionary biology, like other scientific fields, are experiencing an exponential growth of academic manuscripts. As domain knowledge accumulates, scientists will need new computational approaches for identifying relevant literature to read and include in formal literature reviews and meta-analyses. Importantly, these approaches can also facilitate automated, large-scale data synthesis tasks and build structured databases from the information in the texts of primary journal articles, books, grey literature, and websites. The increasing availability of digital text, computational resources, and machine-learning based language models have led to a revolution in text analysis and natural language processing (NLP) in recent years. NLP has been widely adopted across the biomedical sciences but is rarely used in ecology and evolutionary biology. Applying computational tools from text mining and NLP will increase the efficiency of data synthesis, improve the reproducibility of literature reviews, formalize analyses of research biases and knowledge gaps, and promote data-driven discovery of patterns across ecology and evolutionary biology. Here we present recent use cases from ecology and evolution, and discuss future applications, limitations and ethical issues.


Open Research Knowledge Graph

“The Open Research Knowledge Graph (ORKG) aims to describe research papers in a structured manner. With the ORKG, papers are easier to find and compare….

Research is a fundamental pillar of societal progress. Yet, scientific communities face great difficulties in sharing their findings. With approximately 2.5 million newly published scientific articles per year, it is impossible to keep track of all relevant knowledge. Even in small fields, researchers often find themselves drowning in a publication flood, contributing to major scientific crises such as the reproducibility crisis, the deficiency of peer-review and ultimately the loss of knowledge.

The underlying problem is that we never updated our methods of scholarly communication to exploit the possibilities of digitalization. This is where the Open Research Knowledge Graph comes into play!

The ORKG makes scientific knowledge human- and machine-actionable and thus enables completely new ways of machine assistance. This will help researchers find relevant contributions to their field and create state-of-the-art comparisons and reviews. With the ORKG, scientists can explore knowledge in entirely new ways and share results even across different disciplines….”

Aligning the Research Library to Organizational Strategy – Ithaka S+R

“Open access has matured significantly in recent years. The UK and EU countries have committed largely to a “gold” version of open access, driven largely by transformative agreements with the major incumbent publishing houses.[14] The US policy environment has been far more mixed, with a great deal of “green” open access incentivized by major scientific funders, although some individual universities pursued transformative agreements. Both Canadian and US libraries have benefitted from the expansion of free and open access in strengthening their position at the negotiating table with major publishers.[15]

Progress on open access has radically expanded public access to the research literature. It has also brought with it a number of second-order effects. Some of them are connected to the serious problems in research integrity and the growing crisis of trust in science.[16] Others can be seen in the impacts on the scholarly publishing marketplace and the platforms that support discovery and access.[17]

While open access has made scientific materials more widely available, it has not directly addressed the challenges in translating scholarship for public consumption. Looking ahead, it is likely that scholarly communication will experience further changes as a result of computers increasingly supplanting human readership. The form of the scientific output may decreasingly look like the traditional journal article as over time standardized data, methods, protocols, and other scientific artifacts become vital for computational consumption….”

The need for open access and natural language processing | PNAS

“In PNAS, Chu and Evans (1) argue that the rapidly rising number of publications in any given field actually hinders progress. The rationale is that, if too many papers are published, the really novel ideas have trouble finding traction, and more and more people tend to “go along with the majority.” Review papers are cited more and more instead of original research. We agree with Chu and Evans: Scientists simply cannot keep up. This is why we argue that we must bring the powers of artificial intelligence/machine learning (AI/ML) and open access to the forefront. AI/ML is a powerful tool and can be used to ingest and analyze large quantities of data in a short period of time. For example, some of us (2) have used AI/ML tools to ingest 500,000+ abstracts from online archives (relatively easy to do today) and categorize them for strategic planning purposes. This letter offers a short follow-on to Chu and Evans (hereafter CE) to point out a way to mitigate the problems they delineate….

In conclusion, we agree with CE (1) on the problems caused by the rapid rise in scientific publications, outpacing any individual’s ability to keep up. We propose that open access, combined with NLP, can help effectively organize the literature, and we encourage publishers to make papers open access, archives to make papers easily findable, and researchers to employ their own NLP as an important tool in their arsenal.”