The scandal of publisher-forbidden textmining: The vision denied

This is the first post of probably several in my concern about textmining. You do NOT have to be a scientist to understand the point with total clarity. This topic is one of the most important I have written about this year. We are at a critical point where unless we take action our scholarly rights will be further eroded. What I write here is designed to be submitted to the UK government as evidence if required. I am going to argue that the science and technology of textmining is systematically restricted by scholarly publishers to the serious detriment of the utilisation of publicly funded research.

What is textmining?

The natural process of reporting science often involves text as well as tables. Here is an example from chemistry (please do not switch off – you do not need to know any chemistry.) I’ll refer to it as a “preparation” as it recounts how the scientist(s) made a chemical compound.

To a solution of 3-bromobenzophenone (1.00 g, 4 mmol) in MeOH (15 mL) was added sodium borohydride (0.3 mL, 8 mmol) portionwise at rt and the suspension was stirred at rt for 1-24 h. The reaction was diluted slowly with water and extracted with CH2Cl2. The organic layer was washed successively with water, brine, dried over Na2SO4, and concentrated to give the title compound as oil (0.8 g, 79%), which was used in the next reaction without further purification. MS (ESI, pos. ion) m/z: 247.1 (M-OH).

The point is that this is a purely factual report of an experiment. No opinion, no subjectivity. A simple, necessary account of the work done. Indeed if this were not included it would be difficult to work out what had been done and whether it had been done correctly. A student who got this wrong in their thesis would be asked to redo the experiment.

This is tedious for a human to read. However during the C20 there have been large industries based on humans reading this and reporting the results. Two of the best known abstracters are the ACS’s Chemical Abstracts and Beilstein’s database (now owned by Elsevier). These abstracting services have been essential for chemistry – to know what has been done and how to repeat it (much chemistry involves repeating previous experiments to make material for further synthesis , testing etc.).

Over the years our group has developed technology to read and “understand” language like this. Credit to Joe Townsend, Fraser Norton, Chris Waudby, Sam Adams, Peter Corbett, Lezan Hawizy, Nico Adams, David Jessop, Daniel Lowe. Their work has resulted in an Open Source toolkit (OSCAR4, OPSIN, ChemicalTagger) which is widely used in academia and industry (including publishers). So we can run ChemicalTagger over this text and get:

EVERY word in this has been interpreted. The colours show the “meaning” of the various phrases. But there is more. Daniel Lowe has developed OPSIN which works out (from a 500-page rulebook from IUPAC) what the compounds are. So he has been able to construct a complete semantic reaction:

If you are a chemist I hope you are amazed. This is a complete balanced chemical reaction with every detail accurately extracted. The fate of every atom in the reaction has been worked out. If you are not a chemist, try to be amazed by the technology which can read “English prose” and turn it into diagrams. This is the power of textmining.

There are probably about 10 million such preparations reported in the scholarly literature. There is an overwhelming value in using textmining to extract the reactions. In Richard Whitby’s Dial-a-molecule project (EPSRC) the UK chemistry community identified the critical need to text-mine the literature.

So why don’t we?

Is it too costly to deploy?


Will it cause undue load on pubklisher servers.

No, if we behave in a responsible manner.

Does it break confidentiality?

No – all the material is “in the public domain” (i.e. there are no secrets)

Is it irresponsible to let “ordinary people” do this/


Then let’s start!



But Universities pay about 5-10 Billion USD per year as subscriptions for journals. Surely this gives us the right to textmine the content we subscribe to.


Here is part of the contract that Universities sign with Elsevier (I think CDL is California Digital Library but Cambridge’s is similar) see for more resources

 The CDL/ Elsevier contract includes [@ "Schedule 1.2(a)


"Subscriber shall not use spider or web-crawling or other software programs, routines, robots or other mechanized devices to continuously and automatically search and index any content accessed online under this Agreement. "


What does that mean?


Whyever did the library sign this?

I have NO IDEA. It’s one of the worst abrogations of our rights I have seen.

Did the libraries not flag this up as a serious problem?

If they did I can find no record.

So the only thing they negotiated on was price? Right?

Appears so. After all 10 Billion USD is pretty cheap to read the literature that we scientists have written. [sarcasm].

So YOU are forbidden to deploy your state-of-the art technology?

PMR: That’s right. Basically the publishers have destroyed the value of my research. (I exclude CC-BY publishers but not the usual major lot).

What would happen if you actually did try to textmine it.

They would cut the whole University off within a second.

Come on, you’re exaggerating.

Nope – it’s happened twice. And I wasn’t breaking the contract – they just thought I was “stealing content”.

Don’t they ask you to find out if there is a problem?

No. Suspicion of theft. Readers are Guilty until proven innocent. That’s publisher morality. And remember that we have GIVEN them this content. If I wished to datamine my own chemistry papers I wouldn’t be allowed to.

But surely the publishers are responsive to reasonable requests?

That’s the line they are pushing. I will give my own experience in the next post.

So they weren’t helpful?

You will have to find out.

Meanwhile you are going to send this to the government, right?

Right. The UK has commissioned a report on this. Prof Hargreaves.

And it thinks we should have unrestricted textmining?

Certainly for science technical and medical.

So what do the publishers say?

They think it’s over the top. After all they have always been incredibly helpful and responsive to academics. So there isn’t a real problem. See

Nonetheless, the UK Publishers Association, which describes its “core service” as “representation and lobbying, around copyright, rights and other matters relevant to our members, who represent roughly 80 per cent of the industry by turnover”, is unhappy. Here’s Richard Mollet, the Association’s CEO, explaining why it is against the idea of such a text-mining exception:

If publishers lost the ability to manage access to allow content mining, three things would happen. First, the platforms would collapse under the technological weight of crawler-bots. Some technical specialists liken the effects to a denial-of-service attack; others say it would be analogous to a broadband connection being diminished by competing use. Those who are already working in partnership on data mining routinely ask searchers to “throttle back” at certain times to prevent such overloads from occurring. Such requests would be impossible to make if no-one had to ask permission in the first place.

They’ve got a point, haven’t they?

PMR This is appalling disinformation. This is ONLY the content that is behind the publisher’s paywalls. If there were any technical problems they would know where they come from and could arrange a solution.

Then there is the commercial risk. It is all very well allowing a researcher to access and copy content to mine if they are, indeed, a researcher. But what if they are not? What if their intention is to copy the work for a directly competing-use; what if they have the intention of copying the work and then infringing the copyright in it? Sure they will still be breaking the law, but how do you chase after someone if you don’t know who, or where, they are? The current system of managed access allows the bona fides of miners to be checked out. An exception would make such checks impossible.

[“managed access” == total ban]

If you don’t immediately see this is a spurious argument, then read the techndirt article. The ideal situation for publishers is if no-one reads the literature. Then it’s easy to control. This is, after all PUBLISHING (although Orwell would have loved the idea of modern publishing being to destroy communication).

Which leads to the third risk. Britain would be placing itself at a competitive disadvantage in the European & global marketplace if it were the only country to provide such an exception (oh, except the Japanese and some Nordic countries). Why run the risk of publishing in the UK, which opens its data up to any Tom, Dick & Harry, not to mention the attendant technical and commercial risks, if there are other countries which take a more responsible attitude.

So PMR doing cutting-edge research puts Britain at a competitive disadvantage. I’d better pack up.

But not before I have given my own account of what we are missing and the collaboration that the publishers have shown me.

And I’ll return to my views about the deal between University of Manchester and Elsevier.