Defining a Machine-Readable Friendly License for Cloud Contribution Environments

Abstract:  There are two types of Contribution environments that have been widely written about in the last decade – closed environments controlled by the promulgator and open access environments seemingly controlled by everyone and no-one at the same time. In closed environments, the promulgator has the sole discretion to control both the intellectual property at hand and the integrity of that content. In open access environments the Intellectual Property (IP) is controlled to varying degrees by the Creative Commons License associated with the content. It is solely up to the promulgator to control the integrity of that content. Added to that, open access environments don’t offer native protection to data in such a way that the data can be access and utilized for Text Mining (TM), Natural Language Processing (NLP) or Machine Learning (ML). It is our intent in this paper to lay out a third option – that of a federated cloud environment wherein all members of the federation agree upon terms for copyright protection, the integrity of the data at hand, and the use of that data for Text Mining (TM), Natural Language Processing (NLP) or Machine Learning (ML).

Changing word meanings in biomedical literature reveal pandemics and new technologies | BioData Mining | Full Text

Abstract:  While we often think of words as having a fixed meaning that we use to describe a changing world, words are also dynamic and changing. Scientific research can also be remarkably fast-moving, with new concepts or approaches rapidly gaining mind share. We examined scientific writing, both preprint and pre-publication peer-reviewed text, to identify terms that have changed and examine their use. One particular challenge that we faced was that the shift from closed to open access publishing meant that the size of available corpora changed by over an order of magnitude in the last two decades. We developed an approach to evaluate semantic shift by accounting for both intra- and inter-year variability using multiple integrated models. This analysis revealed thousands of change points in both corpora, including for terms such as ‘cas9’, ‘pandemic’, and ‘sars’. We found that the consistent change-points between pre-publication peer-reviewed and preprinted text are largely related to the COVID-19 pandemic. We also created a web app for exploration that allows users to investigate individual terms ( To our knowledge, our research is the first to examine semantic shift in biomedical preprints and pre-publication peer-reviewed text, and provides a foundation for future work to understand how terms acquire new meanings and how peer review affects this process.

OpCitance: Citation contexts identified from the PubMed Central open access articles | Scientific Data

Abstract:  OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly.



Copyright, Text Mining and AI Training Tickets, Tue, 18 Apr 2023 at 12:00 PM | Eventbrite

“Join us?for?a conversation with Faith?Majekolagbe, Michael Geist and Ruth L. Okediji on copyright issues?and recent developments around the world?regarding text and data mining, notably in Canada and African countries

Conversations?about the modernisation of copyright frameworks have?arisen around the world in light of the computational?analysis and processing of copyrighted content for?greater knowledge and information generation through?text and data mining?techniques and?inputs for AI?development?especially?the training of?machine learning models.?Do and should?copyright frameworks?supports?the unlicensed use of in-copyright works for text and data mining and the development of?artificial intelligence? What are the?interests at stake?…”

Ivy+ Text Data Mining Education for Advocacy (TEA) Task Force Phase One Report: Actions and Interventions to Address Concerns with Text Data Mining Platforms

“The Ivy Plus Libraries Confederation (IPLC) Digital Scholarship Affinity Group assembled the Text Data Mining Education for Advocacy (TEA) Task Force to develop shared, open, and accessible educational materials to improve researchers’ literacy on the current influx of third party vendor Text Data Mining (TDM) platforms. This task force was specifically entrusted to examine use and limitations related to the “closedness” of each platform and the direct and collateral effects of the monetization of data on these systems for transparency, collaboration, cross-platforming, and publishing. This report offers constructive criticism of the opaque box, all-or-nothing approach that vendors are taking in order to offer information to support researchers, openness, and equity. Over the course of six months, this task force produced a literature review to examine the current discourse on these emerging platforms, as well as prototype user profiles to clarify researcher needs and evaluate whether the platforms actually meet those needs. This report reflects the results of those efforts to identify exciting new opportunities for assessing emerging TDM platforms….”

Tackling the Law of Text and Data Mining for Computational Research – Duke University Libraries Blogs

“Over the last several years, Duke, like many other institutions, has made a significant investment in computational research, recognizing that such research techniques can have wide-ranging benefits from translational research in the biomedical sciences to the digital humanities, this work can and has been transformative.  Much of this work is reliant on researchers being able to engage in text and data-mining (TDM) to produce the data-sets necessary for large-scale computational analysis. For the sciences, this can range from compiling research data across a whole series of research projects, to collecting large numbers of research articles for computer-aided systematic reviews. For the humanities, it may mean assembling a corpus of digitized books, DVDs, music, or images for analysis into how language, literary themes, or depictions have changed over time….

The techniques and tools for text and data-mining have advanced rapidly, but one constant for TDM researchers has been a fear of legal risk. For data-sets composed of copyrighted works, the risk of liability can seem staggering. With copyright’s statutory damages set as high as $150,000 per work infringed, a corpus of several hundred works can cause real concern. 

However, the risks of just avoiding copyrighted works are also high….”

McCracken | Licensing Challenges Associated With Text and Data Mining: How Do We Get Our Patrons What They Need? | Journal of Librarianship and Scholarly Communication

Abstract:  Today’s researchers expect to be able to complete text and data mining (TDM) work on many types of textual data. But they are often blocked more by contractual limitations on what data they can use, and how they can use it, than they are by what data may be available to them. This article lays out the different types of TDM processes currently in use, the issues that may block researchers from being able to do the work they would like, and some possible solutions.

McCracken | Licensing Challenges Associated With Text and Data Mining: How Do We Get Our Patrons What They Need? | Journal of Librarianship and Scholarly Communication

Abstract:  Today’s researchers expect to be able to complete text and data mining (TDM) work on many types of textual data. But they are often blocked more by contractual limitations on what data they can use, and how they can use it, than they are by what data may be available to them. This article lays out the different types of TDM processes currently in use, the issues that may block researchers from being able to do the work they would like, and some possible solutions.

GitHub is Sued, and We May Learn Something About Creative Commons Licensing – The Scholarly Kitchen

“I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)

I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.

In the TDM context, this raises interesting questions:

Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
Does the attribution need to be included in the data set at every stage?
Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated….”

COMMUNIA Association – The Italian Implementation of the New EU Text and Data Mining Exceptions

“This blog post analyses the implementation of the copyright exceptions for Text and Data Mining, which is defined in the Italian law as any automated technique designed to analyse large amounts of text, sound, images, data or metadata in digital format to generate information, including patterns, trends, and correlations (Art. 70 ter (2) LdA). As we will see in more detail below, the Italian lawmaker decided to introduce some novelties when implementing Art. 3, while following more closely the text of the Directive when implementing Art. 4….

Notably, the new Italian exception also allows the communication to the public of the research outcome when such outcomes are expressed through new original works. In other words, the communication of protected materials resulting from computational research processes is permitted, provided that such results are included in an original publication, data collection or other original work.

The right of communication to the public was not contemplated in the original government draft; it was introduced in the last version of the article to accommodate the comments of the Joint Committees of the Senate and the Joint Committees of the Chamber, both highlighting the need to specify that the right of communication to the public concerns only the results of research, where expressed in new original works.


The beneficiaries of the TDM exception for scientific purposes are research organisations and cultural heritage institutions. Research organisations essentially reflect the definition offered by the directive. These are universities, including their libraries, research institutes or any other entity whose primary objective is to conduct scientific research activities or to conduct educational activities that include scientific research, which alternatively: …

The Italian lawmaker did not expressly contemplate any specific and fast procedure for cases where technical protection measures prevent a beneficiary from carrying out the permitted acts under both TDM exceptions. However, the law now recognises to the beneficiaries the right to extract a copy of the material protected by technological  measures in certain cases. Under Art. 70-sexies, LdA, beneficiaries of the TDM exception for scientific purposes (as well as the beneficiaries of the exception for digital and cross-border teaching activities exception) shall have the right to extract a copy of the protected material, when technological measures are applied based on agreements or on administrative procedures or judicial decisions. In order to benefit from this right, the person shall have lawful possession of copies of the protected material (or have had legal access to them), shall respect the conditions and the purposes provided for in the exception, and such extraction shall not conflict with the normal exploitation of the work or the other materials or cause an unjustified prejudice to the rights holders….”

Antibiotic discovery in the artificial intelligence era – Lluka – Annals of the New York Academy of Sciences – Wiley Online Library

Abstract:  As the global burden of antibiotic resistance continues to grow, creative approaches to antibiotic discovery are needed to accelerate the development of novel medicines. A rapidly progressing computational revolution—artificial intelligence—offers an optimistic path forward due to its ability to alleviate bottlenecks in the antibiotic discovery pipeline. In this review, we discuss how advancements in artificial intelligence are reinvigorating the adoption of past antibiotic discovery models—namely natural product exploration and small molecule screening. We then explore the application of contemporary machine learning approaches to emerging areas of antibiotic discovery, including antibacterial systems biology, drug combination development, antimicrobial peptide discovery, and mechanism of action prediction. Lastly, we propose a call to action for open access of high-quality screening datasets and interdisciplinary collaboration to accelerate the rate at which machine learning models can be trained and new antibiotic drugs can be developed.


Legal reform to enhance global text and data mining research | Science

“The resistance to TDM exceptions in copyright comes primarily from the multinational publishing industry, which is a strong voice in copyright debates and tends to oppose expansions to copyright exceptions. But the success at adopting exceptions for TDM research in the US and EU already—where publishing lobbies are strongest—shows that policy reform in this area is possible. Publishers need not be unduly disadvantaged by TDM exceptions because publishers can still license access to their databases, which researchers must obtain in the first instance, and can offer products that make TDM and other forms of research more efficient and effective.”

[2208.06178] Mining Legal Arguments in Court Decisions

Abstract:  Identifying, classifying, and analyzing arguments in legal discourse has been a prominent area of research since the inception of the argument mining field. However, there has been a major discrepancy between the way natural language processing (NLP) researchers model and annotate arguments in court decisions and the way legal experts understand and analyze legal argumentation. While computational approaches typically simplify arguments into generic premises and claims, arguments in legal research usually exhibit a rich typology that is important for gaining insights into the particular case and applications of law in general. We address this problem and make several substantial contributions to move the field forward. First, we design a new annotation scheme for legal arguments in proceedings of the European Court of Human Rights (ECHR) that is deeply rooted in the theory and practice of legal argumentation research. Second, we compile and annotate a large corpus of 373 court decisions (2.3M tokens and 15k annotated argument spans). Finally, we train an argument mining model that outperforms state-of-the-art models in the legal NLP domain and provide a thorough expert-based evaluation. All datasets and source codes are available under open lincenses at this https URL.