Content Mining: Recent progress

A lot has happened in the last month and it’s kept me so busy that I haven’t blogged as much as I would have liked.

The simple message is that we are starting to mine the scholarly literature on global scale, starting with the easiest (sociopolitical) areas . We’ve started to build the community, build the tools and deploy the results. I am not frightened by scale, as there are in-place solutions.

The most important thing is community. If there’s a perceived need then CM will happen and fast. And on Tuesday we made massive community progress.

We met PLoS in the Haymakers pub (Cambridge) and talked about how they could help us crawl a daily PLoS. Then PLoS held an OpenDrinks in KingsCross London. Everyone was excited about the way scholarly communication could – and in some cases will – open up. BioMedCentral (AmyeK), CrossRef (GeoffB) and lots of OKFers. I came away with the strong feeling that we agree on the Why, Whether of CM and have now moved to How?

We’re doing a “soft launch” of the Content Mine. Something new every day. Advocacy and news from all sectors of the community. Debugging as we go. So we’ve started, not with a bang, but a snowflake. The avalanche will come.

One of the most important things is that we have set up an OKFN mailing list for CM. . Mailing lists are one of the best ways of collecting ideas, resources, community. If you have questions, offers of help, insights – please post. It’s a friendly community.

Some community milestones


2013-11-14 I was invited to present at UK Serials Group (core of librarians, publishers, university admin) and that gave me the chance to put slides together It was well received and the delegates were interested in CM. UKSG recorded a video “Scientific Data costs billions but most is thrown away – what should be done?”. It’s at (ca 25 mins). Many thanks.

2013-11-27 Open-science Oxford. . Wonderful event run by Jenny Molloy. It meant I had to get a portable demo ready (more below)



2013-11-25 CKAN/the Datahub. I decided we should use CKAN (OKFN) for the extracted content. CKAN is an open system for managing metadata and URL-based data storage. I think it will do very well for us. It’s got a vibrant developer and user community. Mark Wainwright gave us an excellent intro . We’ve learned how to use it – having people online really helps – and we are doing our bit by contributing back a revised CKANClient-J. You are welcome to browse, e.g. but please realise that this is Open Science – we are building it as we go. For example “In vivo” isn’t a species – it’s a false positive and we are refining the filters daily.

We’ve been joined by Mark Williamson and Andy Howlett in the Unilever Centre. They are doing a great job in helping refactor the existing code and framework. Andy’s working on a plugin mechanism so that if YOU want to – say – search for galaxies we can make it easy to insert an Astro plugin. Mark has done a huge job on making the system robust and distributable – the commandline-interface and deployment we used in Oxford. We are aiming at a system which is very easy to deploy so that when we run workshops it will be easy for all participants.

The latest tools are all on Bitbucket:

    Visitor (plugin) architecture for adding discipline-specific analyses.
  • A crawler architecture. (currently PLoSOne and soon BMC). I hope to make this very general so it’s easy-er to create crawlers. Crawlers are never fun. They reflect the horrible effects of creating information for sighted humans only. But they are an excellent place for crowd contributions – one crawler per publishers/journal. And a CKAN Datahub repo client.
  • . Computer vision for scientific diagrams. All the basic technology exists, with Java solutions for almost all. I’ve been pleasantly surprised how well it performs. It’s experimental. I did a lot on Hough, but it looks like Thinning and Segmentation is actually better. OCR is a problem – Tesseract is C++ and very messy, JavaOCR ought to be the answer but is impenetrable, Lookup (Cross-correlation) is not quite what I want. I think I’m going to have to write a scientific OCR. Anyone interested in joining is very welcome. It won’t be as hairy as it sounds – I have some ideas.



I am a strong believer in – create a standard simple way of doing things so the number of options you HAVE to specify is small. So, for example, if no –output is given the system puts them in a known place. That also helps community – you can Skype: ” can you find plos/2013-12-13/daily.log in your /extracted folder?”

And documentation

We like the Bitbucket Wiki and its markdown. It’s relatively easy to record what we have done and what we want people to do.

So – watch out for content appearing flake by flake.


And join in…