The Content Mine: Where do we get FACTS from?

We’re now well under way with The Content Mine – a project to extract 100 million facts from the scholarly literature (mainly journals). Here’s how it will work , (also see neighbouring blog posts posts and the video (https://vimeo.com/78353557 , a 5 minute summary of what we are going to do and why it will work). I summarised the problem: as:

  • How we are going to get them?
  • How are we going to process them?
  • Where do we put them?
  • Who will help?

And addressed (3) in part (Wikimedia, but there are additional places).

The overall summary of the people I want to talk to (not necessarily those that have given any approval, and nowhere near complete) is shown:

It’s a bid hard to read (some animals don’t have very good handwriting). But in the period since the video was made I’ve talked with BMC (+Ross Mounce, and we are going back), PLoS, and we shall visit, British Library (and we’ll go back), EuropePMC (AGM tomorrow), might bump into eLife in the pub… emails with Greg Wilson (SW carpentry), Wikimedia (and will visit), etc.

And I’m presenting at UK Serials group (“journals” ) on Thursday and I’ll invite legacy publishers to get involved. There’s all sorts of reasons why they should like the idea of their data being mined, without their having to do anything other than hold the lawyers back). If I make a good case I’m sure I’ll convince them. [With possibly one or two exceptions].

So Q (1). Where are we going to get the facts?

We’ll get them from standard journal articles that I and and millions of others can read (either Openly or because our universities have paid to read). They normally consist of:

  • “the full text”. This is normally PDF, which used to be impossible to read, but I have cracked enough of the (awful) typesetting that you can regard this as a problem solved in principle. The good news is that some articles contain vector diagrams in the PDF and I can read these. About 40% +- 10 of BMC is vector. Most of ArXiV is.
  • OR html. This is useful because we convert PDF to XHTML and we analyse that. Most Open publishers and many legacy publishers expose HTML. This has some advantages (no errors converting from PDF) but has the problem that the figures and tables may be removed. Sometimes high Unicode points (especially math) are represented by images.
  • OR XML. This is available from Open Access publishers but jealoulsy guarded by legacy ones. Since I can recreate it from the PDF I don’t need it. It’s useful because it uses a single DTD (JATS/NLM).
  • Supplemental data. This can be anything. In good cases (CIF) it’s highly structured and well described. In almost all other cases it’s messy, ranging from CSV (parsable, but without ontology) to XLS (parsable but even worse) to data destroyed as PDF or WORD. (Except I can recover some of this).

Some closed publishers (e.g. Nature, Royal Soc. Chem., Amer. Chem. Soc.) expose their supplemental data and let us mine it already. Others hide it behind paywalls.

I’m going to do this in stages, starting with Open Access publishers (i.e. those who have 95%+ of their output as BOAI-compliant – I do not use confusing terms like “Gold”). I can do this without seeking their permission as the CC-BY licences grant me the right. However I am keeping in touch because it’s good to have a harmonious community – for example I wouldn’t want to burn their servers out by bad practice. (Actually this is complete FUD (see Cameron Neylon) – there is no way that PMR and friends *could* burn out anyone’s servers). But I’d like to be seen to be responsible.

So I’m talking with them. If you are an open access publisher who would like their content mined, and have a community that would appreciate it, get in touch.

I’m starting with BioMedCentral. Why?

(Not because they have a mascot (Gulliver Turtle) and the others don’t). But because Ross and I need what’s in BMC Evolutionary Biology and we’ve developed the tools on that. And because BMC (unlike PLoS) keep vector graphics. BUT! I have talked with PLoS and as a result they are going to …

And we also know that the Royal Society allows content mining by subscribers. So there’s a lot to get started on.

And on Thursday I’ll ask the legacy closed publishers to allow content mining by subscribers. We don’t need their help (i.e. we don’t need XML (because it has no pictures)) – we just want them to chain up their lawyers.

Would I be breaking the law to mine FACTS from content I had paid to read? I don’t think so. “The right to read is the right to mine”. I don’t want to break the law. As I said last week “I have kept within the law – up till now”. I want to be able to continue to say that.

But at some stage, next year, we shall move onto FACTS in the whole literature.