Semantic wikis as flexible database interfaces for biomedical applications | Scientific Reports

Abstract:  Several challenges prevent extracting knowledge from biomedical resources, including data heterogeneity and the difficulty to obtain and collaborate on data and annotations by medical doctors. Therefore, flexibility in their representation and interconnection is required; it is also essential to be able to interact easily with such data. In recent years, semantic tools have been developed: semantic wikis are collections of wiki pages that can be annotated with properties and so combine flexibility and expressiveness, two desirable aspects when modeling databases, especially in the dynamic biomedical domain. However, semantics and collaborative analysis of biomedical data is still an unsolved challenge. The aim of this work is to create a tool for easing the design and the setup of semantic databases and to give the possibility to enrich them with biostatistical applications. As a side effect, this will also make them reproducible, fostering their application by other research groups. A command-line software has been developed for creating all structures required by Semantic MediaWiki. Besides, a way to expose statistical analyses as R Shiny applications in the interface is provided, along with a facility to export Prolog predicates for reasoning with external tools. The developed software allowed to create a set of biomedical databases for the Neuroscience Department of the University of Padova in a more automated way. They can be extended with additional qualitative and statistical analyses of data, including for instance regressions, geographical distribution of diseases, and clustering. The software is released as open source-code and published under the GPL-3 license at


Data Conversion Laboratory and The New York Public Library to Speak at Digital Book World 2023 in New York City | SSP Society for Scholarly Publishing

“Fresh Meadows, NY, January 5, 2023–Data Conversion Laboratory (DCL), an industry leader in structured data and content transformations, and The New York Public Library (NYPL), the second largest library in the US—second only to the Library of Congress­—will speak at Digital Book World in New York City on January 17. Greg Cram, Associate General Counsel and Director, Information Policy at NYPL and Mark Gross, President at DCL, will present a case study that details how NYPL with the support of DCL has digitized historical records of the US Copyright Office, making those records searchable, accessible, and useful for new product development and more.

Data extraction combined with well-structured content is being used to enrich the Catalog of Copyright Entries to create an important resource for the world. Each year, millions of people interact with the NYPL’s digital content, including databases, online classes and programs, digitized collections items (including manuscripts and photographs), and more. The new addition of digitized records from the US Copyright Office will add another element, giving the public the ability to discover content, narrow search results, identify relevant records, and view both machine-readable text and an image of the printed record….”

Preprints als Informationsquelle besser nutzbar machen – TH Köln

From Google’s English:  “In the project PIXLS – Preprint Information eXtraction for Life Sciences, TH Köln and ZB MED will develop an application over the next three years that automatically opens up the preprint server. This enables the research community to make better use of current information that was published on preprint servers – and therefore hardly appears in classic detection and search systems.

The German Research Foundation (DFG) is funding the project as part of the e-Research Technologies framework programme.”

Framework for entity extraction with verification: application to inference of data set usage in research publications | Emerald Insight

Abstract:  Purpose

The purpose of this paper is to propose an extensible framework for extracting data set usage from research articles.


The framework uses a training set of manually labeled examples to identify word features surrounding data set usage references. Using the word features and general entity identifiers, candidate data sets are extracted and scored separately at the sentence and document levels. Finally, the extracted data set references can be verified by the authors using a web-based verification module.


This paper successfully addresses a significant gap in entity extraction literature by focusing on data set extraction. In the process, this paper: identified an entity-extraction scenario with specific characteristics that enable a multiphase approach, including a feasible author-verification step; defined the search space for word feature identification; defined scoring functions for sentences and documents; and designed a simple web-based author verification step. The framework is successfully tested on 178 articles authored by researchers from a large research organization.


Whereas previous approaches focused on completely automated large-scale entity recognition from text snippets, the proposed framework is designed for a longer, high-quality text, such as a research publication. The framework includes a verification module that enables the request validation of the discovered entities by the authors of the research publications. This module shares some similarities with general crowdsourcing approaches, but the target scenario increases the likelihood of meaningful author participation.

Editorial: Mining Scientific Papers, Volume II: Knowledge Discovery and Data Exploitation

“The processing of scientific texts, which includes the analysis of citation contexts but also the task of information extraction from scientific papers for various applications, has been the object of intensive research during the last decade. This has become possible thanks to two factors. The first one is the growing availability of scientific papers in full text and in machine-readable formats together with the rise of the Open Access publishing of papers on online platforms such as ArXiv, Semantic Scholar, CiteSeer, or PLOS. The second factor is the relative maturity of open source tools and libraries for natural language processing that facilitate text processing (e.g., Spacy, NLTK, Mallet, OpenNLP, CoreNLP, Gate, CiteSpace). As a result, a large number of experiments have been conducted by processing the full text of papers for citation context analysis, but also summarization and recommendation of scientific papers….”

Using pretraining and text mining methods to automatically extract the chemical scientific data | Emerald Insight

Abstract:  Purpose

In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.


Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.


The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.


By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.

Making Biomedical Sciences publications more accessible for machines | SpringerLink

Abstract:  With the rapidly expanding catalogue of scientific publications, especially within the Biomedical Sciences field, it is becoming increasingly difficult for researchers to search for, read or even interpret emerging scientific findings. PubMed, just one of the current biomedical data repositories, comprises over 33 million citations for biomedical research, and over 2500 publications are added each day. To further strengthen the impact biomedical research, we suggest that there should be more synergy between publications and machines. By bringing machines into the realm of research and publication, we can greatly augment the assessment, investigation and cataloging of the biomedical literary corpus. The effective application of machine-based manuscript assessment and interpretation is now crucial, and potentially stands as the most effective way for researchers to comprehend and process the tsunami of biomedical data and literature. Many biomedical manuscripts are currently published online in poorly searchable document types, with figures and data presented in formats that are partially inaccessible to machine-based approaches. The structure and format of biomedical manuscripts should be adapted to facilitate machine-assisted interrogation of this important literary corpus. In this context, it is important to embrace the concept that biomedical scientists should also write manuscripts that can be read by machines. It is likely that an enhanced human–machine synergy in reading biomedical publications will greatly enhance biomedical data retrieval and reveal novel insights into complex datasets.


The General Index: Search 107 million research papers for free – Big Think

“Millions of research papers get published every year, but the majority lie behind paywalls. A new online catalogue called the General Index aims to make it easier to access and search through the world’s research papers. Unlike other databases which include the full text of research papers, the General Index only allows users to access snippets of content….” 

An exploratory analysis: extracting materials science knowledge from unstructured scholarly data | Emerald Insight

Abstract:  Purpose

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science.


The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach.


The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies.


To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area.

Giant, free index to world’s research papers released online

“In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.

The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.

Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place….”