Weaving a Semantic Web across OSS repositories: unleashing a new potential for academia and practice

Abstract:  Several public repositories and archives of “facts” about libre software projects, maintained either by open source communities or by research communities, have been flourishing over the Web in the recent years. These have enable new analysis and support new quality assurance tasks.

This paper presents some complementary existing tools, projects and models proposed both by OSS actors or research initiatives, that are likely to lead to useful future developments in terms of study of the FLOSS phenomenon, and also to the very practitioners in the FLOSS development projects, provided that interoperability is fostered at all places.

A goal of the research conducted within the HELIOS project, is to address bugs traceability issues. For that, we investigate the potential of using Semantic Web technologies in navigating between many different bugtracker systems scattered all over the open source ecosystem.

By using Semantic Web techniques, it is possible to interconnect the databases containing data about open-source software projects development, hence letting OSS partakers identify resources, annotate them, and further interlink them using dedicated properties, collectively designing a distributed semantic graph. Such links expressed with standard Semantic techniques are paving the way to new applications (including ones meant for “end-users”). For instance this may have an impact on the way research efforts are conducted (less fragmented), and could also be used by development communities to improve Quality Assurance tasks.

Taylor & Francis is bringing AI to academic publishing – but it isn’t easy | The Bookseller

“Leading academic publisher Taylor & Francis is developing natural language processing technology to help machines understand its books and journals, with the aim to enrich customers’ online experiences and create new tools to make the company more efficient.

The first step extracts topics and concepts from text in any scholarly subject domain, and shows recommendations of additional content to online users based on what they are already reading, allowing them to discover new research more easily. Further steps will lead to semantic content enrichment for more improvements in areas such as relatedness, better searches, and finding peer-reviewers and specialists on particular subjects….”

Taylor & Francis is bringing AI to academic publishing – but it isn’t easy | The Bookseller

“Leading academic publisher Taylor & Francis is developing natural language processing technology to help machines understand its books and journals, with the aim to enrich customers’ online experiences and create new tools to make the company more efficient.

The first step extracts topics and concepts from text in any scholarly subject domain, and shows recommendations of additional content to online users based on what they are already reading, allowing them to discover new research more easily. Further steps will lead to semantic content enrichment for more improvements in areas such as relatedness, better searches, and finding peer-reviewers and specialists on particular subjects….”

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research* | Journal of the American Medical Informatics Association | Oxford Academic

“Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs….”

Automating semantic publishing – IOS Press

Abstract: “Semantic Publishing involves the use of Web and Semantic Web technologies and standards for the semantic enhancement of a scholarly work so as to improve its discoverability, interactivity, openness and (re-)usability for both humans and machines. Recently, people have suggested that the semantic enhancements of a scholarly work should be undertaken by the authors of that scholarly work, and should be considered as integral parts of the contribution subjected to peer review. However, this requires that the authors should spend additional time and effort adding such semantic annotations, time that they usually do not have available. Thus, the most pragmatic way to facilitate this additional task is to use automated services that create the semantic annotation of authors’ scholarly articles by parsing the content that they have already written, thus reducing the additional time required of the authors to that for checking and validating these semantic annotations. In this article, I propose a generic approach called compositional and iterative semantic enhancement (CISE) that enables the automatic enhancement of scholarly papers with additional semantic annotations in a way that is independent of the markup used for storing scholarly articles and the natural language used for writing their content.”

Automating semantic publishing – IOS Press

Abstract: “Semantic Publishing involves the use of Web and Semantic Web technologies and standards for the semantic enhancement of a scholarly work so as to improve its discoverability, interactivity, openness and (re-)usability for both humans and machines. Recently, people have suggested that the semantic enhancements of a scholarly work should be undertaken by the authors of that scholarly work, and should be considered as integral parts of the contribution subjected to peer review. However, this requires that the authors should spend additional time and effort adding such semantic annotations, time that they usually do not have available. Thus, the most pragmatic way to facilitate this additional task is to use automated services that create the semantic annotation of authors’ scholarly articles by parsing the content that they have already written, thus reducing the additional time required of the authors to that for checking and validating these semantic annotations. In this article, I propose a generic approach called compositional and iterative semantic enhancement (CISE) that enables the automatic enhancement of scholarly papers with additional semantic annotations in a way that is independent of the markup used for storing scholarly articles and the natural language used for writing their content.”

Semantic Scholar – An academic search engine for scientific articles

“We’ve pulled over 40 million scientific papers from sources like PubMed, Nature, and ArXiv….Our AI analyzes research papers and pulls out authors, references, figures, and topics….We link all of this information together into a comprehensive picture of cutting-edge research….What if a cure for an intractable cancer is hidden within the results of thousands of clinical studies? We believe that in 20 years’ time, AI will be able to connect the dots between studies to identify hypotheses and suggest experiments that would otherwise be missed. That’s why we’re building Semantic Scholar and making it free and open to researchers everywhere.

Semantic Scholar is a project at the Allen Institute for Artificial Intelligence (AI2). AI2 was founded to conduct high-impact research and engineering in the field of artificial intelligence. We’re funded by Paul Allen, Microsoft co-founder, and led by Dr. Oren Etzioni, a world-renowned researcher and professor in the field of artificial intelligence….”

Evaluating open access journals using Semantic Web technologies and scorecardsJournal of Information Science – Maria Hallo, Sergio Luján-Mora, Alejandro Maté, 2017

“This paper describes a process to develop and publish a scorecard from an OAJ (Open Access Journal) on the Semantic Web using Linked Data technologies in such a way that it can be linked to related datasets. Furthermore, methodological guidelines are presented with activities related to each step of the process. The proposed process was applied to a university OAJ, including the definition of the KPIs (Key Performance Indicators) linked to the institutional strategies, the extraction, cleaning and loading of data from the data sources into a data mart, the transformation of data into RDF (Resource Description Framework), and the publication of data by means of a SPARQL endpoint using the Virtuoso software. Additionally, the RDF data cube vocabulary has been used to publish the multidimensional data on the Web. The visualization was made using CubeViz, a faceted browser to present the KPIs in interactive charts.”

Research Libraries Powering Sustainable Knowledge in the Digital Age: LIBER Europe Strategy 2018-2022

“The 2018-2022 LIBER Strategy, which will steer LIBER’s development over the next five years, will support LIBER libraries in facing coming changes in the European working environment such as the various initiatives in advancing Open Science. It will also enable research in LIBER organisations to be world class. The leading role of LIBER brings added value to the implementation of the Strategy at a European level. …The term Open Science is not mentioned specifically in the Strategy. Instead, we emphasise innovative scholarly communication and digital skills and services, as well as research infrastructures to enable sustainable knowledge in the digital age…. Our Vision for the research landscape in 2022 is that the role of research libraries will lie in Powering Sustainable Knowledge in the Digital Age:

• Open Access is the predominant form of publishing;

• Research Data is Findable, Accessible, Interoperable and Reusable (FAIR);

• Digital Skills underpin a more open and transparent research life cycle;

• Research Infrastructure is participatory, tailored and scaled to the needs of the diverse disciplines;

• The cultural heritage of tomorrow is built on today’s digital information….

Open Access of Research Publications: this theme will encompass developing innovative services on top of the repository network, developments regarding Open Access business models for journals and the role of libraries therein, and the possibilities for libraries as Open Access publishers and innovative publishing…Semantic Interoperability; Open and Linked Data: research libraries are experts in metadata and ontologies and need to take a leadership role and engage with other stakeholders to ensure interoperability and accessibility of content….”

An artificial future | Research Information

“One of the most exciting data projects we [Elsevier] are working on at the moment is with a UK based charity, Findacure. We are helping the charity to find alternative treatment options for rare diseases such as Congenital Hyperinsulinism by offering our informatics expertise, and giving them access to published literature and curated data through our online tools, at no charge.

We are also supporting The Pistoia Alliance, a not-for-profit group that aims to lower barriers to collaboration within the pharmaceutical and life science industry. We have been working with its members to collaborate and develop approaches that can bring benefits to the entire industry. We recently donated our Unified Data Model to the Alliance; with the aim of publishing an open and freely available format for the storage and exchange of drug discovery data. I am still proud of the work I did with them back in 2009 on the SESL project (Semantic Enrichment of Scientific Literature), and my involvement continues as part of the special interest group in AI….”

WarSampo: Publishing and Using Linked Open Data about the Second World War

“The WarSampo system 1) initiates and fosters large scale Linked Open Data (LOD) publication of WW2 data from distributed, heterogeneous data silos and 2) demonstrates and suggests its use in applications and DH research. WarSampo is to our best knowledge the first large scale system for serving and publishing WW2 LOD on the Semantic Web for machine and human users. Its knowledge graph metadata contains over 9 million associations (triples) between data items including, e.g., a complete set of over 95,000 death records of Finnish WW2 soldiers, 160,000 authentic photos taken during the war, 32,000 historical places on historical maps, 23,000 war diaries of army units, and 3,400 memoir articles written by the veterans after the war. WarSampo data comes from several Finnish organizations and sources, such as National Archives, Defense Forces, Land Survey of Finland, Wikipedia/DBpedia, text books, and magazines.

WarSampo has two separate components: 1) WarSampo Data Service for machines and 2) WarSampo Semantic Portal with various applications for human users.”

Science Beam – using computer vision to extract PDF data | Labs | eLife

“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”

Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles

Abstract:  Purpose: this paper introduces the Research Articles in Simplified HTML (or RASH), which is a Web-first format for writing HTML-based scholarly papers; it is accompanied by the RASH Framework, i.e. a set tools for interacting with RASH-based articles. The paper also presents an evaluation that involved authors and reviewers of RASH articles, submitted to the SAVE-SD 2015 and SAVE-SD 2016 workshops.

Design: RASH has been developed in order to: be easy to learn and use; share scholarly documents (and embedded semantic annotations) through the Web; support its adoption within the existing publishing workflow

Findings: the evaluation study confirmed that RASH can already be adopted in workshops, conferences and journals and can be quickly learnt by researchers who are familiar with HTML.

Research limitations: the evaluation study also highlighted some issues in the adoption of RASH, and in general of HTML formats, especially by less technical savvy users. Moreover, additional tools are needed, e.g. for enabling additional conversion from/to existing formats such as OpenXML.

Practical implications: RASH (and its Framework) is another step towards enabling the definition of formal representations of the meaning of the content of an article, facilitate its automatic discovery, enable its linking to semantically related articles, provide access to data within the article in actionable form, and allow integration of data between papers.

Social implications: RASH addresses the intrinsic needs related to the various users of a scholarly article: researchers (focussing on its content), readers (experiencing new ways for browsing it), citizen scientists (reusing available data formally defined within it through semantic annotations), publishers (using the advantages of new technologies as envisioned by the Semantic Publishing movement).

Value: RASH focuses strictly on writing the content of the paper (i.e., organisation of text + semantic annotations) and leaves all the issues about it validation, visualisation, conversion, and semantic data extraction to the various tools developed within its Framework.