“Millions of research papers get published every year, but the majority lie behind paywalls. A new online catalogue called the General Index aims to make it easier to access and search through the world’s research papers. Unlike other databases which include the full text of research papers, the General Index only allows users to access snippets of content….”
There’s a vast amount of research out there, with the volume growing rapidly with each passing day. But there’s a problem. Not only is a lot of the existing literature hidden behind a paywall, but it can also be difficult to parse and make sense of in a comprehensive, logical way. What’s really needed is a super-smart version of Google just for academic papers. Enter the General Index, a new database of some 107.2 million journal articles, totaling 38 terabytes of data in its uncompressed form. It spans more than 355 billion rows of text, each featuring a key word or phrase plucked from a published paper.
“In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.
The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.
Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place….”
“The General Index is here to serve as your map to human knowledge. Pulled from 107,233,728 journal articles, The General Index is a searchable collection of keywords and short sentences from published papers that can serve as a map to the paywalled domains of scientific knowledge.
In full, The General Index is a massive 38 terabyte archive of searchable terms. Compressed, it comes to 8.5 terabytes. It can be pulled directly from archive.org, which can be a difficult and lengthy process. People on the /r/DataHoarder subreddit have uploaded the data to a remote server and are spreading it across BitTorrent. You can help by grabbing a seed here.
The General Index does not contain the entirety of the journal articles it references, simply the keywords and n-grams—a string of simple phrases containing a keyword—that make tracking down a specific article easier. “This is an early release of the general index, a work in progress,” Carl Malamud, the founder of Public.Resource.org and co-creator of the General Index, said in a video about the archive. “In some cases text extraction failed, sometimes metadata is not available or is perhaps incorrect while the underlying corpus is large, it is not complete and it is not up to date.”…”