[2208.06178] Mining Legal Arguments in Court Decisions

Abstract:  Identifying, classifying, and analyzing arguments in legal discourse has been a prominent area of research since the inception of the argument mining field. However, there has been a major discrepancy between the way natural language processing (NLP) researchers model and annotate arguments in court decisions and the way legal experts understand and analyze legal argumentation. While computational approaches typically simplify arguments into generic premises and claims, arguments in legal research usually exhibit a rich typology that is important for gaining insights into the particular case and applications of law in general. We address this problem and make several substantial contributions to move the field forward. First, we design a new annotation scheme for legal arguments in proceedings of the European Court of Human Rights (ECHR) that is deeply rooted in the theory and practice of legal argumentation research. Second, we compile and annotate a large corpus of 373 court decisions (2.3M tokens and 15k annotated argument spans). Finally, we train an argument mining model that outperforms state-of-the-art models in the legal NLP domain and provide a thorough expert-based evaluation. All datasets and source codes are available under open lincenses at this https URL.


Data Protections and Licenses Affecting Text and Data Mining for Machine Learning


Abstract:  Machines don’t read works or data. Machines need to first abstract and then format data for learning and then apply tagging and other metadata to model the data into something the machine can “understand.” Legal protections aren’t purpose-built to allow machines to abstract data from a work, process it, model it, and then re-present it. Most licenses aren’t purpose-built for that either. This document walks the reader through all the known protections and licenses as to whether they cover machine learning practices. It then postulates a proposed license structure for that purpose.

Editorial: Mining Scientific Papers, Volume II: Knowledge Discovery and Data Exploitation

“The processing of scientific texts, which includes the analysis of citation contexts but also the task of information extraction from scientific papers for various applications, has been the object of intensive research during the last decade. This has become possible thanks to two factors. The first one is the growing availability of scientific papers in full text and in machine-readable formats together with the rise of the Open Access publishing of papers on online platforms such as ArXiv, Semantic Scholar, CiteSeer, or PLOS. The second factor is the relative maturity of open source tools and libraries for natural language processing that facilitate text processing (e.g., Spacy, NLTK, Mallet, OpenNLP, CoreNLP, Gate, CiteSpace). As a result, a large number of experiments have been conducted by processing the full text of papers for citation context analysis, but also summarization and recommendation of scientific papers….”

NISO vision interview with CORE’s Petr Knoth on the role of text mining in scholarly communication – Research

“This Vision Interview with Petr Knoth, Senior Research Fellow in Text and Data Mining at the Open University and Head of CORE (core.ac.uk), served as the opening segment of the NISO Hot Topic virtual conference, Text and Data Mining, held on May 25, 2022. Todd Carpenter spoke at length with Knoth about the many ways in which text and data mining impacts the present as well as the future. They discussed just how innovative this technology can be for the needs of researchers in the information community….”

The LOTUS initiative for open knowledge management in natural products research | eLife

Abstract:  Contemporary bioinformatic and chemoinformatic capabilities hold promise to reshape knowledge management, analysis and interpretation of data in natural products research. Currently, reliance on a disparate set of non-standardized, insular, and specialized databases presents a series of challenges for data access, both within the discipline and for integration and interoperability between related fields. The fundamental elements of exchange are referenced structure-organism pairs that establish relationships between distinct molecular structures and the living organisms from which they were identified. Consolidating and sharing such information via an open platform has strong transformative potential for natural products research and beyond. This is the ultimate goal of the newly established LOTUS initiative, which has now completed the first steps toward the harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs. LOTUS data is hosted on Wikidata and regularly mirrored on https://lotus.naturalproducts.net. Data sharing within the Wikidata framework broadens data access and interoperability, opening new possibilities for community curation and evolving publication models. Furthermore, embedding LOTUS data into the vast Wikidata knowledge graph will facilitate new biological and chemical insights. The LOTUS initiative represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.


Past and future uses of text mining in ecology and evolution | Proceedings of the Royal Society B: Biological Sciences

Abstract:  Ecology and evolutionary biology, like other scientific fields, are experiencing an exponential growth of academic manuscripts. As domain knowledge accumulates, scientists will need new computational approaches for identifying relevant literature to read and include in formal literature reviews and meta-analyses. Importantly, these approaches can also facilitate automated, large-scale data synthesis tasks and build structured databases from the information in the texts of primary journal articles, books, grey literature, and websites. The increasing availability of digital text, computational resources, and machine-learning based language models have led to a revolution in text analysis and natural language processing (NLP) in recent years. NLP has been widely adopted across the biomedical sciences but is rarely used in ecology and evolutionary biology. Applying computational tools from text mining and NLP will increase the efficiency of data synthesis, improve the reproducibility of literature reviews, formalize analyses of research biases and knowledge gaps, and promote data-driven discovery of patterns across ecology and evolutionary biology. Here we present recent use cases from ecology and evolution, and discuss future applications, limitations and ethical issues.


Open Research Knowledge Graph

“The Open Research Knowledge Graph (ORKG) aims to describe research papers in a structured manner. With the ORKG, papers are easier to find and compare….

Research is a fundamental pillar of societal progress. Yet, scientific communities face great difficulties in sharing their findings. With approximately 2.5 million newly published scientific articles per year, it is impossible to keep track of all relevant knowledge. Even in small fields, researchers often find themselves drowning in a publication flood, contributing to major scientific crises such as the reproducibility crisis, the deficiency of peer-review and ultimately the loss of knowledge.

The underlying problem is that we never updated our methods of scholarly communication to exploit the possibilities of digitalization. This is where the Open Research Knowledge Graph comes into play!

The ORKG makes scientific knowledge human- and machine-actionable and thus enables completely new ways of machine assistance. This will help researchers find relevant contributions to their field and create state-of-the-art comparisons and reviews. With the ORKG, scientists can explore knowledge in entirely new ways and share results even across different disciplines….”

Aligning the Research Library to Organizational Strategy – Ithaka S+R

“Open access has matured significantly in recent years. The UK and EU countries have committed largely to a “gold” version of open access, driven largely by transformative agreements with the major incumbent publishing houses.[14] The US policy environment has been far more mixed, with a great deal of “green” open access incentivized by major scientific funders, although some individual universities pursued transformative agreements. Both Canadian and US libraries have benefitted from the expansion of free and open access in strengthening their position at the negotiating table with major publishers.[15]

Progress on open access has radically expanded public access to the research literature. It has also brought with it a number of second-order effects. Some of them are connected to the serious problems in research integrity and the growing crisis of trust in science.[16] Others can be seen in the impacts on the scholarly publishing marketplace and the platforms that support discovery and access.[17]

While open access has made scientific materials more widely available, it has not directly addressed the challenges in translating scholarship for public consumption. Looking ahead, it is likely that scholarly communication will experience further changes as a result of computers increasingly supplanting human readership. The form of the scientific output may decreasingly look like the traditional journal article as over time standardized data, methods, protocols, and other scientific artifacts become vital for computational consumption….”

The need for open access and natural language processing | PNAS

“In PNAS, Chu and Evans (1) argue that the rapidly rising number of publications in any given field actually hinders progress. The rationale is that, if too many papers are published, the really novel ideas have trouble finding traction, and more and more people tend to “go along with the majority.” Review papers are cited more and more instead of original research. We agree with Chu and Evans: Scientists simply cannot keep up. This is why we argue that we must bring the powers of artificial intelligence/machine learning (AI/ML) and open access to the forefront. AI/ML is a powerful tool and can be used to ingest and analyze large quantities of data in a short period of time. For example, some of us (2) have used AI/ML tools to ingest 500,000+ abstracts from online archives (relatively easy to do today) and categorize them for strategic planning purposes. This letter offers a short follow-on to Chu and Evans (hereafter CE) to point out a way to mitigate the problems they delineate….

In conclusion, we agree with CE (1) on the problems caused by the rapid rise in scientific publications, outpacing any individual’s ability to keep up. We propose that open access, combined with NLP, can help effectively organize the literature, and we encourage publishers to make papers open access, archives to make papers easily findable, and researchers to employ their own NLP as an important tool in their arsenal.”

Recommendations on the Transformation of Academic Publishing: Towards Open Access

“Three central arguments support this transformation: 1 ? Openly accessible publications can be read, reviewed and used more quickly and more widely by other researchers. This increases the quality of research and accelerates scientific progress. 2 ? OA makes scientific knowledge more widely available outside of the scientific community and lowers the threshold for various transfer activities. This increases the social effectiveness of (publicly funded) research. 3 ? Up to now, the business model of publishers has been based on rights of use. As they will no longer be granted exclusive rights under OA, publishers will become publication service providers and will compete with other providers. This may strengthen the negotiating position of scientific institutions vis-à-vis such service providers and improve the innovative capacity, cost transparency and cost efficiency of the publication system.

As far as the Council is concerned, the goal of the transformation is for academic publications to be made freely available immediately, permanently, at the original publication venue and in the citable, peer-reviewed and typeset version of record under an open licence (CC BY). This so-called gold route to OA (gold OA) is compatible with various business models…. 

For orientation in this market, the Council recommends that the Alliance of Science Organisations in Germany agree on common requirements for quality assurance of content (especially in terms of peer review processes) as well as for high-quality publication services. In the medium term, academic publications should not only be openly accessible, but also machine-readable through open, structured formats and semantic annotations….

“Gold OA” should not be equated with funding via article processing charges (APC)….

As the WR sees it, all third-party funders are obliged to fully finance the publication costs arising from publishing the results of the research they are funding….”


copyright act: Educators Push For Amendment To Copyright Act | Pune News – Times of India

“Senior academicians and vice-chancellors of universities in the city have demanded inclusive digital education for which technology and infrastructural advances will have to be matched with changes in the copyright law enacted in 1967.

It is related specifically to open educational resources, digitisation of resource material and their sharing or lending, text and data mining, procurement and sharing of e-resources, digitally supported teaching activities, including distance learning.
In their research, the professors have stated that, the amendments in the Copyright Act also needs to ease operations of public libraries, institutional libraries, galleries and museums and archives in physical and digital frameworks including National Digital Library of India….”

Using pretraining and text mining methods to automatically extract the chemical scientific data | Emerald Insight

Abstract:  Purpose

In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.


Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.


The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.


By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.

Text and Data Mining | NISO website

“Not so long ago, Text and Data Mining (TDM) — the automated detection of patterns and extraction of knowledge from machine-readable content or data — was a particular area of interest. So much so, that libraries and content providers developed licensing language and other resources to support researchers wanting to work with and manipulate this material, including a proliferation of LibGuides and APIs. But where are we now in identifying available resources and tools for TDM activities?

This virtual conference will provide an “explainer” for information professionals tasked with supporting researchers who are just beginning to engage with TDM, and wondering how to pull the data they need, how it is structured, and how they can expect to engage with it. Our speakers will cover essential technology, how it is deployed and used, the scope of support that the library may be asked to provide, and the spectrum of options for collaboration between information professionals and content and service providers.”