What is TextAndData/ContentMining?

What is TextAndData/ContentMining?

I prefer “ContentMining” to the formal legal phrase “Text and Data Mining” because it emphasizes all kinds of content – audio, photos, videos, diagrams, chemistry, etc. I chose it to assert that non-textual content – such as diagrams – could be factual and therefore uncopyrightable. And because it’s a huge extra exciting dimension.


Mining is the process of finding useful information where the producer hadn’t created it for that specific process. For example the log books of the British navy – which recorded data on weather – are now being used to study climate change (certainly not in the minds of the British Admiralty). Records of an eclipse in ancient China have been used to study the rotation of the earth. So forty years ago I studied hundreds of papers of individual crystal structures to determine reaction pathways – again completely unexpected to the original authors.


In science mining is a way to dramatically increase our human knowledge simply by running software over existing publications. Initially I had to type this in by hand (the papers really were papers) and then I developed ways of using electronic information. Ca 15 years ago I developed tools which could trawl over the whole of the crystallographic literature and extract the structures and we built this into Crystaleye – where the software added much more information than in the original paper. (We have now merged this with the Crystallography Open Database http://www.crystallography.net ). My vision was to do this for all chemical information – structures, melting points, molecular mass, etc. Ambitious, but technically not impossible. We had useful funding and collaboration with the Royal Society of Chemistry and developed OSCAR as software specifically to extract chemistry from text. Ten years ago things looked exciting – everyone seemed to accept that having access to electronic publications meant that you could extract facts by machine. It stood to reason that machines were simply a better , more accurate, faster way of extracting facts than pencil and retyping.


So what new science can we find by mining?

  • More comprehensive coverage. In 1974 I read and analyzed 1-200 papers in 6 months. In 2017 my software can read 10000 papers in less than a day.
  • More comprehensive within a paper. Very often I would limit the information beacuse I didn’t have time (e.g. the anisotropic displacements of atoms). Now it’s trivial to include everything.
  • Aggregation and intra-domain analytics. By analysing thousands of papers you can extract trends and patterns that you couldn’t do before. In 1980 I wanted to ask “How valid is the 18-electron rule?” – there wasn’t enough data/time. Now I could answer this within minutes.
  • Aggregation and inter-domain analytics. I think this is where the real win is for most people. “What pesticides are used in what countries where Zika virus is endemic and mosquito control is common?”. You cannot get an answer from a traditional search engine – but if we search the full-text literature for pesticide+country+disease+species we can rapidly find those papers with the raw information and then extract and analyze it. “Which antibodies to viruses have been discovered in Liberia?”. An easy question for our software to answer, except it was behind a paywall – no-one saw it and the Ebola outbreak was unexpected.
  • Added information. If I find “Chikungunya” in an article, the first thing I do is link it electronically to Wikidata/Wikipedia. This tells me immediately the whole information hinterland of every concept I encounter. It’s also computable – if I find a terpene chemical I can compute the molecular properties on-the-fly. I can, for example, predict the boiling point and hence the volatility without this being mentioned in the article. The literature is a knowledge symbiont.


Everyone is already using the results of Text Mining. Google and other search engines have sophisticated language analysis tools that find all sources with (say) “Chikungunya”. What I want to excite you about is the chance to go much further.
Why do we need other search engines when we have “Google”?


  • Google shows you what it wants you to see. (The same is true for Elsevinger). You do not know how these were selected, it’s not reproducible, and you have no control. (Also, if you care, Google and Elsevinger monitor everything you do and either monetize it or sell it back to your Vice-Chancellor).
  • Google does not allow you to collect all the papers that fit a given search. They give links – but try to scrape all these links and you will be cut off. By contrast Rik Smith-Unna, working with ContentMine (CM) developed “getpapers” – which is exactly what the research scientist needs – an organized collection of the papers resulting from a search. ContentMine tools such as “AMI” allow the detailed analysis of the details in the papers.
  • Google can’t be searched by numeric values. Try asking for papers with patients in the age range 12-18 and it’s impossible (you might be lucky that this precise string is used but generally you get nothing). In contrast CM tools can search for numbers, search within graphs, search species and much more. “Find all diterpene volatiles from conifers over 10 metres high at sea level in tropical latitudes” is a straightforward concept for CM software.


That’s a brief introduction – and I’ll show real demos tomorrow.