A protocol for adding knowledge to Wikidata, a case report | bioRxiv

Abstract:  Pandemics, even more than other scientific questions, require swift integration of knowledge and identifiers. In a setting where there is a large number of loosely connected projects and initiatives, we need a common ground, also known as a “commons”. Wikidata, a public knowledge graph aligned with Wikipedia, is such a commons, but Wikidata may not always have the right schema for the urgent questions. In this paper, we address this problem by showing how a data schema required for the integration can be modelled with entity schemas represented by shape expressions. As a telling example, we describe the process of aligning resources on the genomics of the SARS-CoV-2 virus and related viruses as well as how shape expressions can be defined for Wikidata helping others studying the SARS-CoV-2 pandemic. How this model can be used to make data between various resources interoperable, is demonstrated by integrating data from NCBI Taxonomy, NCBI Genes, UniProt, and WikiPathways. Based on that model, a set of automated applications or bots were written for regular updates of these sources in Wikidata and added to a platform for automatically running these updates. Although this workflow is developed and applied in the context of the SARS-CoV-2 pandemic, it was also applied to other human coronaviruses (MERS, SARS, SARS-CoV-2, Human Coronavirus NL63, Human coronavirus 229E, Human coronavirus HKU1, Human coronavirus OC4) to demonstrate its broader applicability.



“In the face of a global health contingency, the vital role of Open Access is endorsed: to bring knowledge to all corners of the world, to allow science to be quickly and timely accessible so that its contribution is reflected in the improvement of the quality of human life , in saving lives and in the development of a better society for all. Open Access initiatives such as Redalyc have been working towards this goal for 18 years. Today, the AmeliCA/Redalyc alliance reaffirms its commitment to Open Access and continues to develop technology which it is now applied to the semantic dissemination of articles published on topics of interest in epidemiology, pandemics and related topics. This development enable to publish more than 6 thousand articles in Linked Open Data (LOD) format so that they can be processed and interconnected in the LOD knowledge cloud and allow users to browse content and access to full-texts in a thematic discovery service….”

AmeliCA/Redalyc1 run an ontology-based algorithm, previously developed called OntoOAI (Becerril-Garci?a & Aguado-Lopez, 2018), on their databases to extract epidemics-related content. The results include: an ontology representation of the knowledge published in 6,557 scientific articles including concepts and relations, as well as their attributes, a directed-graph thematic content browser to access to full-texts and a dataset available at SPARQL endpoint to query the results as part of Linked Open Data….”



The SPAR Ontologies

“Over the past eight years, we have been involved in the development of a set of complementary and orthogonal ontologies that can be used for the description of the main areas of the scholarly publishing domain, known as the SPAR (Semantic Publishing and Referencing) Ontologies. In this paper, we introduce this suite of ontologies, discuss the basic principles we have followed for their development, and describe their uptake and usage within the academic, institutional and publishing communities….”