Datasheets for Digital Cultural Heritage Datasets – Journal of Open Humanities Data

Abstract:  Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide context and information on provenance, purposes, composition, the collection process, recommended uses or societal biases reflected in training datasets. This approach fits well with practices and procedures established in GLAM institutions, such as establishing collections’ descriptions. However, digital cultural heritage datasets are marked by specific characteristics. They are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous. Punctuated by a series of recommendations to create datasheets for digital cultural heritage, the paper addresses the scope and characteristics of digital cultural heritage datasets; possible metrics and measures; lessons from concepts similar to datasheets and/or established workflows in the cultural heritage sector. This paper includes a proposal for a datasheet template that has been adapted for use in cultural heritage institutions, and which proposes to incorporate information on the motivation and selection criteria, digitisation pipeline, data provenance, the use of linked open data, and version information.

Frontiers | Editorial: Linked Open Bibliographic Data for Real-time Research Assessment

“.Despite the value of open bibliographic resources, they can involve inconsistencies that should be solved for better accuracy. As an example, OpenCitations mistakenly includes 1370 self-citations and 1498 symmetric citations as of April 30, 20221 . As well, they can involve several biases that can provide a distorted mirror of the research efforts across the world (Martín-Martín, Thelwall, Orduna-Malea, & Delgado López-Cózar, 2021). That is why these databases need to be enhanced from the perspective of data modeling, data collection, and data reuse. This goes in line with the current perspective of the European Union on reforming research assessment (CoARA, 2022). In this topical collection, we are honored to feature novel research works in the context of allowing the automatic generation of realtime research assessment reports based on open bibliographic resources. We are happy to host research efforts emphasizing the importance of open research data as a basis for transparent and responsible research assessment, assessing the data quality of open resources to be used in real-time research evaluation, and providing implementations of how online databases can be combined to feed dashboards for real-time scholarly assessment….”

ARIADNE PLUS – Ariadne infrastructure

“The ARIADNEplus project is the extension of the previous ARIADNE Integrating Activity, which successfully integrated archaeological data infrastructures in Europe, indexing in its registry about 2.000.000 datasets (ARIADNE portal). ARIADNEplus will build on the ARIADNE results, extending and supporting the research community that the previous project created and further developing the relationships with key stakeholders such as the most important European archaeological associations, researchers, heritage professionals, national heritage agencies and so on. The new enlarged partnership of ARIADNEplus covers all of Europe. It now includes leaders in different archaeological domains like palaeoanthropology, bioarchaeology and environmental archaeology as well as other sectors of archaeological sciences, including all periods of human presence from the appearance of hominids to present times. Transnational Activities together with the planned training will further reinforce the presence of ARIADNEplus as a key actor.

The ARIADNEplus data infrastructure will be embedded in a cloud that will offer the availability of Virtual Research Environments where data-based archaeological research may be carried out. The project will furthermore develop a Linked Data approach to data discovery, making available to users innovative services, such as visualization, annotation, text mining and geo-temporal data management. Innovative pilots will be developed to test and demonstrate the innovation potential of the ARIADNEplus approach.

ARIADNEplus is funded by the European Commission under the H2020 Programme, contract no. H2020-INFRAIA-2018-1-823914….”

The State of Wikidata and Cultural Heritage: 10 Years In | Wiki Education

Wiki Education is hosting webinars all of October to celebrate Wikidata’s 10th birthday. Below is a summary of our first event. Watch Tuesday’s webinar in full on our Youtube. Sign up for our next three events here.

Never before has the world had a tool like Wikidata. The semantic database behind Wikipedia and virtual assistants like Siri and Alexa is only ten years old this month, and yet with almost 1 billion unique items, it’s the biggest open database ever. Wiki Education’s “Wikidata Will” Kent gathered key players in the Wikidataverse to reflect on the last ten years and set our sights on the next ten. Kelly Doyle, the Open Knowledge Coordinator for the Smithsonian Institution; Andrew Lih, Wikimedian at Large with Smithsonian Institution and Wikimedia strategist with the Metropolitan Museum of Art; and Lane Rasberry, Wikimedian in Residence at University of Virginia’s Data Science Institute discussed the “little database that could” (not so little anymore!).



Enriching Wikidata with Linked Open Data

Abstract:  Large public knowledge graphs, like Wikidata, contain billions of statements about tens of millions of entities, thus inspiring various use cases to exploit such knowledge graphs. However, practice shows that much of the relevant information that fits users’ needs is still missing in Wikidata, while current linked open data (LOD) tools are not suitable to enrich large graphs like Wikidata. In this paper, we investigate the potential of enriching Wikidata with structured data sources from the LOD cloud. We present a novel workflow that includes gap detection, source selection, schema alignment, and semantic validation. We evaluate our enrichment method with two complementary LOD sources: a noisy source with broad coverage, DBpedia, and a manually curated source with narrow focus on the art domain, Getty. Our experiments show that our workflow can enrich Wikidata with millions of novel statements from external LOD sources with a high quality. Property alignment and data quality are key challenges, whereas entity alignment and source selection are well-supported by existing Wikidata mechanisms. We make our code and data available to support future work.


Library as Laboratory: Analyzing Biodiversity Literature at Scale

“Imagine the great library of life, the library that Charles Darwin said was necessary for the “cultivation of natural science” (1847). And imagine that this library is not just hundreds of thousands of books printed from 1500 to the present, but also the data contained in those books that represents all that we know about life on our planet. That library is the Biodiversity Heritage Library (BHL) The Internet Archive has provided an invaluable platform for the BHL to liberate taxonomic names, species descriptions, habitat description and much more. Connecting and harnessing the disparate data from over five-centuries is now BHL’s grand challenge. The unstructured textual data generated at the point of digitization holds immense untapped potential. Tim Berners-Lee provided the world with a semantic roadmap to address this global deluge of dark data and Wikidata is now executing on his vision. As we speak, BHL’s data is undergoing rapid transformation from legacy formats into linked open data, fulfilling the promise to evaporate data silos and foster bioliteracy for all humankind….”

Moving Koha library catalogue into linked data using the LODRefine | Emerald Insight

Abstract:  Purpose

The purpose of this paper is to investigate connected data through the use of open-source technology. It demonstrates the transformation process from library bibliographic data to linked data, which allows for easy searching across numerous collections of information.


In generating this file, a high-level operating system such as Ubuntu, which is based on the LAMP architecture, is used. It is required to use open-source strategies in building the relevant information. LODRefine is being used to convert all of Koha’s bibliographic data into linked data that is now available on the Web. This framework has been conceptualized and formulated based on linked data principles and search algorithms accordingly.


Linked data services have been made publicly available to library users by using a variety of different forms of data. Information may be sought quickly and easily using this interface built on numerous search structures. Aside from that, it also meets the needs of users who use the linked data search mechanism to find information. Through modern scripts and algorithms, it is now possible for library users to easily search the linked data enables services.


This paper demonstrates how quickly and easily related data from bibliographic details may be developed and generated using a spreadsheet. The entire procedure culminates in the presence of specialists in the library setting. A further advantage of the SPARQL system is that it allows visitors to group distinct concepts and aspects using independent URIs and URLs instead of the SPARQL endpoint.

Representing COVID-19 information in collaborative knowledge graphs: a study of Wikidata | Zenodo

Abstract:  Information related to the COVID-19 pandemic ranges from biological to bibliographic and from geographical to genetic. Wikidata is a vast interdisciplinary, multilingual, open collaborative knowledge base of more than 88 million entities connected by well over a billion relationships and is consequently a web-scale platform for broader computer-supported cooperative work and linked open data. Here, we introduce four aspects of Wikidata that make it an ideal knowledge base for information on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external databases, and its multidisciplinary organization. The structure of the raw data is highly complex, so converting it to meaningful insight requires extraction and visualization, the global crowdsourcing of which adds both additional challenges and opportunities. The created knowledge graph for COVID-19 in Wikidata can be visualized, explored and analyzed in near real time by specialists, automated tools and the public, for decision support as well as educational and scholarly research purposes via SPARQL, a semantic query language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format.