STM publishers give Green Light for Text-and-Data Mining and we go ahead

Until this week I and other scholars had been generally forbidden to use machines to read the scientific literature and extract facts (“Text and Data Mining”, TDM or “Content-mining”). The STM publishers had prepared a draft licence which can be summarised as 20 different ways in which “the readers’ machines have no rights”. You’ll remember that we were invited to discuss this in Licences for Europe” and we indicated we didn’t want licences, we wanted rights.

So I felt I had to raise this at UKSG on. In my slides I outlined principles:

“The right to read is the right to mine” and noted that “Unrestricted TDM saves lives”

And recommended that we all make some changes:

  • Libraries – reject TDM restrictions
  • Publishers – Damascene conversion J
  • Funders – insist on CC-BY

(A Damascene conversion is a sudden change of heart from the dark side to the light) see Wikipedia:

So I expected some gentle flak for being unrealistic.

But WOW! Just before my talk Vicky Gardner from Taylor and Francis talked about T+F OA and posted a slide which said TDM was allowed for non-commercial purposes. Gulp! So I asked her. Something like:

PMR: Does TF allow me to mine your non-OA content?

VG: Yes

PMR: The subscription material??

VG Yes

PMR: Wow! Christmas has come early [Then gets so excited crashes down step and into people, chairs, audience.]

It turned out the STM publishers had released a statement on Wed (the day before and I’d missed it). It starts






“Signatories [STM] commit to granting the necessary copyright licenses to permit the text and data mining of copyright protected content and other subject matter on reasonable terms for non-commercial scientific research purposes in the European Union.


“the purpose of non-commercial text and data mining of subscribed journal content for non-commercial scientific research, at no additional cost to researchers/subscribing institutions”.


The document’s a bit abstract in places, but it’s a political not technical one. The general message is that I and my friends can go ahead and mine content as long as I don’t burn out the publishers’ servers or publish copyright material (e.g. licensed pictures of Mus Michaelis (copyright Disney Corp)).


Dear STM publishers I won’t do either of those deliberately and if I make a mistake I’ll rectify it and say sorry.


So we’re starting today! Ross and I hacked AMI yesterday to read and emit species from HTML, XML PSF and SVG. We’ll start extracting species next week and publishing them daily or even more frequently.


The big day is Wed 27th November – Oxford Open Science run by Jenny Molloy. We are launching our TDM kit for researchers (completely Open). The idea is thate researchers will find it easy to use, saves them time, and they’ll enhance and distribute it. The idea is to get it widespread into research labs where it grows its open culture of mining and sharing the results.


I’d be delighted for offers of help. Over the next few posts I’ll detail what TDM for science is and how to get started.










see also…