Fabulous TabulaPDF liberates tables in PDFs; we are collaborators


If you found someone writing software which di some of the same sort of things that your software did would you be pleased or upset? If you’re an academic in the modern world you might wellb be upset. “Bugger, we’ve been scooped!” “we’ll have to replicate their functionality”, etc. Because academic success depends on being the first and beating down competition. I have seen so many cases where programs have been rewritten solely for career advancement.

So you might think that when I heard about Tabula 6 months ago I would have been upset. Tabula uses PDFBox (as does our AMI) to turn PDFs into something useful for machines. But I was delighted. Because turning PDFs into semantic form is one of the most soul-destroying activities on earth. There’s no map, a new form of PDF can knowck you back, people don’t understand why we spend this time hacking. It’s lonely.

So first and foremost we welcome Tabula as friends. So let’s see them:

It was then wonderful to meet Mike and Manuel at #MozFest. Mike and Manuel are not concerned about their journal impact factor. They want to make the world a better place.

By hacking PDFs?

Yes! Today’s journalist – e.g. @ProPublica – uses data to find and justify stories. Stories are contained in expense slips, company reports, government spends, etc. The UK MP expenses scandals used crowd-sourcing to analyse zillions of MP’s expense receipts. NHS waste in prescribing non-generic drugs has been highlighted by hacking data. My “crusade” – to liberate factual science from journals – is similar. Most science is destroyed into PDF.

But it can be liberated. So next Wednesday in Oxford (sold out!). I’ll be demonstrating Tabula to get data out of scientific PDFs. It’s immediately understandable and easy to use.

And we face the challenges together. The really horrible aspects of:

  • Optical character recognition (words and numbers in diagrams are often bitmaps)
  • Recognising table formats – many tables are simply layout for humans
  • Restructuring lists
  • Analysing graphs


So Mike and Manuel and I spent an hour swapping our experiences, making friends. We’re looking at each others’ codes. We’ll try to avoid duplication.

And most importantly, our community has now grown. Growing from 1 to 2 increasing the impact by a factor of 4. It makes our individual efforts more believable. It helps new people join.

I’m going to have to get a better tweetpic…