The Semantic Scholar Open Data Platform

The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-theart techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.

A Milestone for Semantic Scholar

“Even the most diligent scientists need a quick primer on the latest research. Which is why Semantic Scholar, the AI-powered platform for academic papers, can come in handy when you want to know the latest studies on, say, Covid-19 or Russian troll accounts. And this month, the rapidly-evolving search engine turns six, while also hitting another milestone: uploading 200 million papers to its archives. “Semantic Scholar is a poster child for AI2’s mission: AI for the Common Good,” says Oren Etzioni, CEO of the Allen Institute for AI, which created the project. “When we launched it, we had no idea that it would serve upwards of 8 million users per month just a few years later.”

What began in 2015 as a database for some 3 million computer science papers has recently grown into much more. Along with adding neuroscience papers, then biomedicine, then all fields of science in 2019, the platform last year launched the CORD-19 dataset and paper, a comprehensive dataset of more than 300,000 full-text Covid-19-related papers, and more than 840,000 metadata entries in total, that’s available to anyone, thus facilitating further research on the pandemic. To date, this largest single collection — with some articles on coronaviruses that would otherwise languish behind paywalls and others that date back to the 1950s — has been downloaded more than 200,000 times, and has become the basis of the most popular Kaggle competition ever….”

Lost Green OA articles from Semantic Scholar

“Over the last two weeks, one of the largest repositories we index, Semantic Scholar, removed most of the articles it had been hosting. The end result for Unpaywall is that about 1 million formerly Green OA articles are now Closed. This is about 12% of all Green OA. We’re working on finding new locations for as many articles as we can.

The total number of articles removed from Semantic Scholar was about 8 million, but most of them are still OA because we had other locations….”

Introducing TLDRs on Semantic Scholar | by Semantic Scholar | AI2 Blog | Nov, 2020 | Medium

“TLDRs (Too Long; Didn’t Read) are super-short summaries of the main objective and results of a scientific paper, generated using expert background knowledge and the latest GPT-3 style NLP techniques. This new feature is now available in beta for nearly 10 million computer science papers and counting in Semantic Scholar.

Staying up to date with scientific literature is a key part of any researchers’ workflow, and parsing a long list of papers from various sources by reading paper abstracts is time-consuming. The new TLDR feature in Semantic Scholar puts single-sentence, automatically-generated paper summaries right on the search results and author pages, allowing you to quickly locate the right papers and spend your time reading what matters to you….”

S2ORC: The Semantic Scholar Open Research Corpus

“S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.

We’ve curated a unified resource that combines aspects of citation graphs (i.e. rich paper metadata, abstracts, citation edges) with a full text corpus that preserves important scientific paper structure (i.e. sections, inline citation mentions, references to tables and figures).
Our corpus covers 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges by unifying data from many different sources covering many different academic disciplines and identifying open-access papers using services like Unpaywall. …”

Exploring the COVID-19 network of scientific research with SciSight | AI2 Blog

“Three months into the coronavirus pandemic, the world’s scientific knowledge of the SARS-CoV-2 virus is rapidly expanding. Reports of potential vaccines and treatments sprout up almost daily. Thousands of papers have been pouring into Semantic Scholar’s COVID-19 Open Research Dataset (CORD-19), a collection of nearly 60,000 scientific publications of potential relevance to the topic, both historical and cutting-edge….

To help accelerate scientific discovery with visualization, last month we launched SciSight, a framework of exploratory search and visualization tools for the COVID-19 literature. The first version of SciSight supported exploring associations between biomedical concepts appearing in the literature. In preliminary user interviews, the tool was found helpful in discovery-oriented search. We now release two important updates of SciSight….”

Exploring the COVID-19 network of scientific research with SciSight | AI2 Blog

“Three months into the coronavirus pandemic, the world’s scientific knowledge of the SARS-CoV-2 virus is rapidly expanding. Reports of potential vaccines and treatments sprout up almost daily. Thousands of papers have been pouring into Semantic Scholar’s COVID-19 Open Research Dataset (CORD-19), a collection of nearly 60,000 scientific publications of potential relevance to the topic, both historical and cutting-edge….

To help accelerate scientific discovery with visualization, last month we launched SciSight, a framework of exploratory search and visualization tools for the COVID-19 literature. The first version of SciSight supported exploring associations between biomedical concepts appearing in the literature. In preliminary user interviews, the tool was found helpful in discovery-oriented search. We now release two important updates of SciSight….”

The rise of the “open” discovery indexes? Lens.org, Semantic Scholar and Scinapse | Musings about librarianship oa.scite

“In this blog post, I will talk specifically on a very important source of data used by Academic Search engines – Microsoft Academic Graph (MAG) and do a brief review of four academic search engines – Microsoft Academic, Lens.org, Semantic Scholar and Scinapse ,which uses MAG among other sources….

We live in a time, where large (>50 million) Scholarly discovery indexes are no longer as hard to create as in the past, thanks to the availability of freely available Scholarly article index data like Crossref and MAG.”

SUPP.AI by AI2

“Dietary and herbal supplements are popular but unregulated. Supplements can interact or interfere with the action of prescription or over-the-counter medications. Currently, it is difficult to find accurate and timely scientific evidence for these interactions.

To solve this problem, Supp.AI automatically extracts evidence of supplement and drug interactions from the scientific literature and presents them here….

To find out more about this work, please read our publication….

Supp.AI is a free service of the non-profit Allen Institute for AI….”

Could This Search Engine Save Your Life? – The Chronicle of Higher Education

One of the Allen Institute’s priorities is an academically oriented search engine, established in 2015, called Semantic Scholar (slogan: “Cut through the clutter”). The need is great, with more than 34,000 peer-reviewed journals publishing 2.5 million articles a year. “What if a cure for an intractable cancer is hidden within the tedious reports on thousands of clinical studies?,” Etzioni once said.

Although Semantic Scholar has focused so far on computer and biomedical sciences, Etzioni says that the engine will soon push into the social sciences and the humanities as well. The Chronicle spoke with him about information overload, impact factors’ imperfect inevitability, and the promise and perils of AI….”