“The LOC-DB project will develop ready-to-use tools and processes based on the linked-data technology that make it possible for a single library to meaningfully contribute to an open, distributed infrastructure for the cataloguing of citations. The project aims to prove that, by widely automating cataloguing processes, it is possible to add a substantial benefit to academic search tools by regularly capturing citation relations. These data will be made available in the semantic web to make future reuse possible. Moreover, we document effort, number and quality of the data in a well-founded cost-benefit analysis.
The project will use well-known methods of information extraction and adapt them to work for arbitrary layouts of reference lists in electronic and print media. The obtained raw data will be aligned and linked with existing metadata sources. Moreover, it will be shown how these data can be integrated in library catalogues. The system will be deployable to use productively by a single library, but in principle it will also be scalable for using it in a network….”
Identified elements of knowledge shared across research disciplines and mapped the elements critical to the successful transfer of knowledge from document to user.
Built a technology-facilitated process whereby complex analyses can be distilled for ease of discovery and use. Findings from published research papers in only the top academic journals are added to the platform daily.
Implemented a search results display feature that (1) uses consistent, logical expressions about research findings rather than happenstance excerpts of text, and (2) prioritizes results not by popularity, but according to validity and usefulness (e.g., research design). …”
“As a political scientist who regularly encounters so-called “open data” in PDFs, this problem is particularly irritating. PDFs may have “portable” in their name, making them display consistently on various platforms, but that portability means any information contained in a PDF is irritatingly difficult to extract computationally.”
Abstract: “Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers’ quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.”
Abstract: Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.
“StrepHit (pronounced “strep hit”, means “Statement? repherence it!”) is a Natural Language Processing pipeline that harvests structured data from raw text and produces Wikidata statements with reference URLs. Its datasets will feed the primary sources tool.
In this way, we believe StrepHit will dramatically improve the data quality of Wikidata through a reference suggestion mechanism for statement validation, and will help Wikidata to become the gold-standard hub of the Open Data landscape….”