Guardian article on Content-mining (thanks Alok Jha) makes it mainstream

[I meant to blog this earlier but I have been spending time on writing content-mining software rather than the continued depressing struggle with reactionary commercial publishers of #scholpub.]

On 2012-05-24 the Guardian (a/the mainstream liberal daily newspaper in Uk published an article in its main news pages on content-mining from scientific #scholpub articles. In the paper “It’s a useful research tool so why forbid it?” (p14), online

Text mining: what do publishers have against this hi-tech research tool?

Researchers push for end to publishers’ default ban on computer scanning of tens of thousands of papers to find links between genes and diseases

Byline: Alok Jha, Science Correspondent

Alok was the person who promoted “Academic Spring” on the front page of the Guardian last month. He contacted me and others, especially Robert Kiley from Wellcome Trust. Robert and his Wellcome colleagues have made a massive contribution to free scientific information – without Wellcome we would have much poorer involvement. And as sponsors of UKPMC Robert is at the frontline of content-mining – he knows firsthand how hard it is to get any help from the publishing industry[*].

The coverage included stories from Casey Bergman + Max Haeussler, Heather Piwowar and myself – detailing carefully and accurately our major ongoing difficulties. Some snippets:

All of them [above] needed access to tens of thousands of research papers at once, so they could use computers to look for unseen patterns and associations across the millions of words in the articles. This technique, called text mining, is a vital 21st-century research method. It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice.

It is a technique with big potential. A report published by McKinsey Global Institute last year said that “big data” technologies such as text and data mining had the potential to create €250bn (£200bn) of annual value to Europe’s economy, if researchers were allowed to make full use of it.

Unfortunately, in most cases, text mining is forbidden. Bergman, Murray-Rust, Piwowar and countless other academics are prevented from using the most modern research techniques because the big publishing companies such as Macmillan, Wiley and Elsevier, which control the distribution of most of the world’s academic literature, by default do not allow text mining of the content that sits behind their expensive paywalls.

Absolutely correct.

Any such project requires special dispensation from – and time-consuming individual negotiations with – the scores of publishers that may be involved.

“That’s the key fact which is halting progress in this field,” said Robert Kiley, head of digital services at the Wellcome Trust. “For a lot of people, though there is promise there, the activation effort is just too great.”

Exactly. My research has been set back 2-3 years by fruitless “discussions” with publishers.

Asking for permission from publishers is an option, though time-consuming. The University of British Columbia (UBC) researcher, Heather Piwowar, was trying to map the ways scientists use and share papers.

She was eventually contacted by Alicia Wise, Elsevier’s director of universal access, who convened a conference call with Piwowar, a UBC librarian and five Elsevier colleagues. That conversation led to permission for UBC researchers to text mine the Elsevier journals to which they already had access.

Piwowar said: “It takes a lot of time and a lot of energy and doesn’t scale at all. To me it’s a good result because now I have access to things I didn’t have access to before and also it will also hopefully drive change by people saying, ‘This is not an OK way to build on our scholarly literature.’”

The colossal waste of time is clear. Elsevier want me to negotiate with them and the Cambridge University Library. I have to tell Elsevier what research I want to do. The library has better things to do with its time. So do I.

And it’s technically completely unnecessary. I can access the articles I want by standard means. It’s a pinprick in the daily Elsevier downloads. It’s sheer FUD to suggest I will crash their servers. I don’t want ZIP files from them through a special API. I already have what I want. All I need is Elsevier to say they won’t sue me.

Wise said that, in principle, her company was happy to enable text mining for its content. “We want to help researchers deepen their insight and understanding, we want to help them to advance science and healthcare and we want to be able to do that in ways that help realise the maximum benefit from the content we publish. Text mining is clearly a part of this landscape and it will continue to be and we’re keen to support it.”

“In principle” means nothing. In the comments AW described

Elsevier is leading the research information industry to enable text mining.

NO! BMC and PLoS are leading it. I can mine them – as much as I like and I can’t mine Elsevier at all.

We provide text mining solutions to an array of customers, and we also enable researchers to text mine our content for themselves. This is all done through licensing, which is highly efficient and easily scalable.

So efficient and scalable that I have got nowhere in ca. 3 years. So efficient that we need 5 Elsevier staff for one researcher.

We began partnering with the University of Southern California in 2007 to enable researchers in its Neuroscience Research Institute to content mine and we now have agreements with about 20 universities around the world.

Wow! 20/1500 universities in 5 years. Just over 1%.

We also serve researchers in a broad array of commercial organisations. Earlier this year we announced our acquisition of Ariadne Genomics and QUOSA, companies that both provide state-of-the-art text mining services to improve researcher productivity. We continue to invest to develop an array of text mining ourselves, and we offer other tools through collaboration with partners such as the UK’s National Centre for Text Mining. We are also working with other publishers to ensure that text mining is possible regardless of who has published it or where it is located.

These are all words. I am still not allowed to text-mine. And it is Elsevier who makes the rules – in most science it’s God who makes the rules, but here it’s Mammon. I will write a blog on Elsevier and Helpfulness. “Elsevier is a helpful publisher” is similar to a British bank which advertises “helpful banking”. Think of “helpful banking” whenever you think of Elsevier.

Back to the positive.

So what Alok has done is massive! To get national coverage at this level is a huge boost to the legitimacy of our effort. It means the issue is now clear to everyone and cannot be ignored as a minor fringe activity. The UCSF declaration for Open Access (still not mandatory and therefore of very limited practical effect) mentioned mining. Funders are starting to promote mining. UKPMC is fully aware of its huge potential – the dam is only maintained by publisher lawyers and publisher lobbyists in Capitol Hill (US).

So I have been aggressively tooling up for when I am allowed to mine the scientific content. The Guardian article acted as a trial in the court of public opinion and I think the publishers have very little support there.

But I am starting with BMC. Who knows, maybe there is enough hidden science in just 5% of the scholarly literature?

Today we continue developing our Manifesto on Content Mining



[*]Yes, I exempt PLoS, BMC, and lots of worthy society publishers