Jailbreaking the PDF, a collaborative #scholrev project, WE not I

I am really excited about the #scholrev hackathon program put together as “Jailbreaking the PDF”: some additional information at http://duraspace.org/jailbreaking-pdf-hackathon

From Alexander Garcia and Alex Garcia-Castro

Montpellier, France  The upcoming “Jailbreaking the PDF” hackathon (http://scholrev.org/hackathon) will be held Monday, May 27 in Montpellier, France at the Agence Bibliographique de l’enseignement Superieur (ABES):


Currently, the bulk of peer-reviewed scientific knowledge is locked up in PDF documents, which are difficult to get information.

We want to change that.

If you’re interested in hacking on PDFs and exploring ways to access scholarly data in modern ways, this hackathon is for you. There is no registration fee–the event is free.  Bring yourself your favorite laptop, and we’ll supply the food, drinks wifi,

repository, and everything else necessary to hack away.

Future announcements will be posted at http://scholrev.org/hackathon.

As with all hackathons we’ll work it out on the day (and possibly some on the night before). There are some suggested projects at http://wehack.it/hackathons/47-jailbreaking-the-pdf (I have put #ami2 in), but the important thing is to come up with things we can do on the day that will make a real impact. It’s a great chance to show that there is a critical mass of people in #scholrev and that we can achieve things.

The key thing is that we all want to change the world – in this case by repurposing PDFs to liberate information and by doing that working out how we change our ways of communicating (“publishing”) to humans and machines. What makes Jailbreak different is the Open approach – our tools are Open , our data and results are Open.

And it is more important that WE succeed rather than I succeed.

There are several reasons for developing technology and they include:

  • Creating a business and a market
  • Being the first to create something and gain (academic) recognition
  • Changing the world

So I and colleagues have been developing #AMI2, a toolset for turning PDF content into semantic form. I’m not interested in creating a business (at present) and I have the luxury of not needing academic glory. I shall be happy to submit a paper in due course as there are novel aspects but citations aren’t the primary driver.

No, this is my contribution to a toolkit to change the world. Because scholarly publishing critical needs a revolution and it’s not coming from conventional sources. Hacking PDFs can be a major part of the game-changer. And if the software is Open, then we can grow it.

I’m delighted to see that there are other people hacking PDFs and I shall meet some at this workshop. What will I feel if someone else has developed a tool that does things that AMI2 can’t do or does them better? It may surprise you, but I shall feel pleased. And I hope others would feel the same way.

Because it advances us all and makes the overall task easier and quicker. We’ve found this in chemistry software with the Blue Obelisk (http://en.wikipedia.org/wiki/Blue_Obelisk ) where over 20 groups write F/OSS software. Each is independent – we don’t try to aggregate this into one monster toolkit (it wouldn’t work). But each looks to see what the others are doing, keeps in gentle touch and avoids needless duplication. I expect this spirit to develop in Jailbreak.

In any case hacking PDFs requires a large amount of heuristics. Examples are:

  • Translating undocumented fonts to Unicode
  • Dealing with graphics
  • Interpreting figures
  • Publisher – and journal-specific annotations
  • Recognising and processing tables
  • Hacking references and metadata

Many of these are never-ending jobs. Many are also boring. But many are ideal for a shared approach. I am very interested to see what the CERMINE: Content ExtRactor and MINEr does:

CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts: document’s metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue, parsed bibliographic references the structure of document’s sections, section titles and paragraphs. CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts. CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. Limitations This is an experimental service, and result may be not accurate. Uploaded file will be used only for metadata extraction, we do not store uploaded files. Accepted file format – *.pdf, maximum file size is 5 MB. License CERMINE is licensed under GNU Affero General Public License version 3.

I’ve run CERMINE on a few files and it looks very useful. It certainly does things AMI2 doesn’t and vice versa. CERMINE is machine-learning based whereas AMI2 is heuristic. Both have to be adjusted when they get a new document type. AMI2 doesn’t do good metadata (still working out some general heuristics) but it addresses italics, bold, strange characters, sub/superscripts, compound document objects (e.g. captioned figures ), tables, document sections, etc. There’s undoubtedly a role for both.

And the opportunity to create shared resources (e.g. fonts, journal templates, common terminology and nomenclature, etc.)

Content-mining and re-use needs a community focus and this workshop looks exactly that.