Using pretraining and text mining methods to automatically extract the chemical scientific data | Emerald Insight

Abstract:  Purpose

In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.

Design/methodology/approach

Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.

Findings

The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.

Originality/value

By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.

Making Biomedical Sciences publications more accessible for machines | SpringerLink

Abstract:  With the rapidly expanding catalogue of scientific publications, especially within the Biomedical Sciences field, it is becoming increasingly difficult for researchers to search for, read or even interpret emerging scientific findings. PubMed, just one of the current biomedical data repositories, comprises over 33 million citations for biomedical research, and over 2500 publications are added each day. To further strengthen the impact biomedical research, we suggest that there should be more synergy between publications and machines. By bringing machines into the realm of research and publication, we can greatly augment the assessment, investigation and cataloging of the biomedical literary corpus. The effective application of machine-based manuscript assessment and interpretation is now crucial, and potentially stands as the most effective way for researchers to comprehend and process the tsunami of biomedical data and literature. Many biomedical manuscripts are currently published online in poorly searchable document types, with figures and data presented in formats that are partially inaccessible to machine-based approaches. The structure and format of biomedical manuscripts should be adapted to facilitate machine-assisted interrogation of this important literary corpus. In this context, it is important to embrace the concept that biomedical scientists should also write manuscripts that can be read by machines. It is likely that an enhanced human–machine synergy in reading biomedical publications will greatly enhance biomedical data retrieval and reveal novel insights into complex datasets.

 

The General Index: Search 107 million research papers for free – Big Think

“Millions of research papers get published every year, but the majority lie behind paywalls. A new online catalogue called the General Index aims to make it easier to access and search through the world’s research papers. Unlike other databases which include the full text of research papers, the General Index only allows users to access snippets of content….” 

An exploratory analysis: extracting materials science knowledge from unstructured scholarly data | Emerald Insight

Abstract:  Purpose

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science.

Design/methodology/approach

The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach.

Findings

The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies.

Originality/value

To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area.

Giant, free index to world’s research papers released online

“In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.

The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.

Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place….”

WikiJournal Preprints/Aggregation of scholarly publications and extracted knowledge on COVID19 and epidemics – Wikiversity

“This project aims to use modern tools, especially Wikidata (and Wikpedia), R, Java, textmining, with semantic tools to create a modern integrated resource of all current published information on viruses and their epidemics. It relies on collaboration and gifts of labour and knowledge.

The world faces (and will continue to face) viral epdemics which arise suddenly and where scientific/medical knowledge is a critical resource. Despite over 100 Billion USD on medical research worldwide much knowledge is behind publisher paywalls and only available to rich universities. Moreover it is usually badly published, dispersed without coherent knowledge tools. It particularly disadvantages the Global South. This project aims to use modern tools, especially Wikidata (and Wikpedia), R, Java, textmining, with semantic tools to create a modern integrated resource of all current published information on viruses and their epidemics. It relies on collaboration and gifts of labour and knowledge….”

A web application to extract key information from journal articles

“Non-expert readers are thus typically unable to understand scientific articles, unless they are curated and made more accessible by third parties who understand the concepts and ideas contained within them. With this in mind, a team of researchers at the Texas Advanced Computing Center in the University of Texas at Austin (TACC), Oregon State University (OSU) and the American Society of Plant Biologists (ASPB) have set out to develop a tool that can automatically extract important phrases and terminology from research papers in order to provide useful definitions and enhance their readability….”

Paper Digest

Paper Digest uses an AI to generate an automatic summary of a given research paper. You can simply provide a DOI (digital object identifier), or the url to a PDF file, then Paper Digest will return a bulleted summary of the paper. This works only for open access full-text articles that allow derivative generation (i.e. CC-BY equivalent). In case you receive an error message and no summary is generated, it is most likely either the full text is not available to use or the license does not allow derivative generation….”

Underlay

“The Underlay is a global, distributed graph database of public knowledge. Initial hosts will include universities and individuals, such that no single group controls the content. This is an attempt to replicate the richness of private knowledge graphs in a public, decentralized manner….

Powerful collections of machine-readable knowledge are growing in importance each year, but most are privately owned (e.g., Google’s Knowledge Graph, Wolfram Alpha, Scopus). The Underlay aims to secure such a collection as a public resource. It also gives chains of provenance a central place in its data model, to help tease out bias or error that can appear at different layers of assumption, synthesis, and evaluation….

The Underlay team is developing the protocols, first instances, and governing rules of this knowledge graph. Information will be added at first by building focused, interpretive overlays — knowledge curated for a particular audience. Overlays could for instance be journals, maps, or timelines, incorporating many sources of more granular information into a single lens….

[Coming in Phase 2:] A network of Underlay nodes at different institutions, demonstrating local vs global updating. An initial pipeline for extracting structured knowledge and sources from documents to populate lower layers. Tools to sync with existing structured repositories such as Wikidata, Freebase, and SHARE. And tools to visualize what is in the Underlay and how it is being used….”

Taylor & Francis is bringing AI to academic publishing – but it isn’t easy | The Bookseller

“Leading academic publisher Taylor & Francis is developing natural language processing technology to help machines understand its books and journals, with the aim to enrich customers’ online experiences and create new tools to make the company more efficient.

The first step extracts topics and concepts from text in any scholarly subject domain, and shows recommendations of additional content to online users based on what they are already reading, allowing them to discover new research more easily. Further steps will lead to semantic content enrichment for more improvements in areas such as relatedness, better searches, and finding peer-reviewers and specialists on particular subjects….”

Taylor & Francis is bringing AI to academic publishing – but it isn’t easy | The Bookseller

“Leading academic publisher Taylor & Francis is developing natural language processing technology to help machines understand its books and journals, with the aim to enrich customers’ online experiences and create new tools to make the company more efficient.

The first step extracts topics and concepts from text in any scholarly subject domain, and shows recommendations of additional content to online users based on what they are already reading, allowing them to discover new research more easily. Further steps will lead to semantic content enrichment for more improvements in areas such as relatedness, better searches, and finding peer-reviewers and specialists on particular subjects….”