Elsevier, FooBar and Content-mining – yet another Digital Land Grab – wake up academia and fight. Or surrender for ever

I have just discovered Elsevier’s content mining document.

For those who don’t know I have been trying to get permission to text-mine Elsevier content for two years and have been treated as a second-class citizen and ultimately come away with nothing. See http://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/ . The analysis in this post will centre round Elsevier but also applies to another major publisher (FooBar, who I will reveal in later posts if my informant agrees). And I suspect it applies to a large proportion the rest of the publishing community. I’ll reproduce most of the document. (I don’t have the sacred copyright permission to reproduce it of course, but…). BTW the Elsevier staff in Oxford a year ago promised that they would update me when this document came out but of course they didn’t.

Read http://www.elsevier.com/wps/find/intro.cws_home/contentmining before you read my critique Consider the implications. Then I’ll indicate why we have been so badly let down by academic libraries or their purchasing agents who have given away more of our crown jewels without a fight.

If you want to know why I am so angry with University Libraries read the bottom of the post as well.

OK, have you read it? – it’s not very long. I’ll go through and annotate it – Like a peer-reviewer. Because after all that’s why we pay Elsevier isn’t it? – because without them we’d be incapable of organising peer-review: (Elsevier is in italics).


Overview of content mining

•    Content Mining concerns the automatic processing of large collections of various forms of data and information to identify, organise and perform analysis in order to determine possible links within the content that may not be obvious on initial inspection.

PMR: This is a extraordinarily simplistic view. It probably arises from Elsevier’s limited vision. FromWikipedia

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and concept extraction out of images/audio/video could be seen as information extraction.


•    There are various methods to perform this processing, but there are elements common to all methods, including an automated way to process all sizes and types of content in which to identify relevant information, facilitate its extraction and its analysis.

PMR: This is a woolly sentence – the only relevant concept is automation. This is the key to our struggle for Free/Open information to mine.

•    Content mining has links to semantic technology as it focuses on the interlinks and contextual commonalities to enhance the understanding of the content.

PMR: I have no idea what a “contextual commonality” is. The only meaningful concept here is semantic technology.

•    The development of these mining approaches are of particular importance within the scientific community to drive the interdisciplinary nature of research and support new areas of discovery.


PMR: A safe generalization adding little new insight


Elsevier’s principles on content mining

  • Elsevier wants to support our customers to advance science and health.

PMR: This is so vapid that it can only be classified as marketing froth. What Elsevier “wants” and what Elsevier provides have no correspondence in reality

  • We want to help them realise the maximum benefit from our content and enhance insight and understanding through content mining.

PMR: And in practice they do everything possible to retard the independent development of textmining

  • Our journals and books have added value – we invest in quality content and enrich content to maximise discoverability and usability.

PMR: “maximise usability”??? Double-column (or even single column) PDF is a major destruction of information. Scientists have spent hundreds of person-years (probably thousands) trying to get information out of PDF. Whereas simply providing us with the original author manuscript in Word or LaTeX is all we need. We can add the document semantics. But no, we need Elsevier to provide the content.

  • We believe a transparent content mining policy framework is essential, which needs efficient implementation and flexibility to cover multiple scenarios.

PMR: Devoid of meaning. “transparent”?? Efficient and flexible? Weasel words (a Wikipedia term) that imply only Elsevier is clever enough to do this,

  • The framework of open innovation enables and facilitates application development within our content.

PMR: “open” means controlled by Elsevier. The rest of the sentence is unproven, unimplemented marketing speak.

  • Elsevier will continue to manage its content in modern digital formats that facilitate the easy access, use, and re-use of content.



Our approach to providing content mining

  • Elsevier is receiving an increasing number of content mining requests and we are developing solutions to meet customer needs. We are doing this because we realise that researchers and organisations which to derive even more value from our content, but in a way that they choose. Consequently we have adapted our policies to this primary goal.

PMR: “Researchers choose”??? No, Elsevier chooses. Or have I missed a public consultation process??

  • We wish to understand our customers’ text mining requirements and as practically every content mining request has a different goal and there is not a common solution to provide this. Consequently we request that customers looking to mine our content should speak to their Elsevier Account Manager or should contact us directly at universal.access@elsevier.com

PMR: Maybe they do “wish” but they aren’t trying – as this document shows. Yes, every content mining project has a different goal. So, before doing research on OUR own output we have to speak to “our Elsevier account manager”.

PMR: a separate comment for “universal access”. Newspeak. This is so Orwellian it’s unbelievable.

  • We will then discuss the mining request, access to the content (see below), licensing and (where applicable) pricing for the project.

PMR: HERE WE HAVE THE CRUX!!! This is the first meaningful sentence in the whole document. I have to get permission from Elsevier to do research on “their” content. If they don’t like what I want to do they can just block it – or better fail to respond. “Licensing”. I won’t be able to publish the results Openly. I’ve already seen their contract (see my blog post). We carry out mining for Elsevier to possess the results. To create enhanced content that they can sell to the community for higher process and higher justification for their added value.

  • Mining requests are often content specific. Customers can choose to mine our full-text content, abstracts, data and other materials. A charge may be applicable dependent on the request.

PMR: “A charge”. I’ll discuss that below. This is the second meaningful sentence.

  • Common requests for Content Mining include:
    • Running extensive searches and using locally loaded content for text mining purposes for research.
    • Extraction of semantic entities from Elsevier content for the purpose of recognition and classification of the relations between them.
    • Performing extensive mining operations on subscribed content, including structuring input text, deriving patterns within this text and evaluation and interpretation of the output.
    • Customers can integrate results on a server used for the subscriber’s own mining system for access and use by its researchers through the subscriber’s internal secure network.

PMR: “the subscriber’s internal secure network”. Again we see the control. Nobody can extract value and publish it to the world.

  • All commercial usage of content mining results arising from Elsevier content will be subject to licensing and will be chargeable. We will discuss the utilisation of results in accordance to each request.

PMR: Do I have to explain the implications of this?

Facilitating access & technology to empower content mining

Elsevier have developed several different methods to allow customers to mine our content. This provides maximum flexibility and multiple options to access the required content. Examples of this include methods to deliver high amounts of content on demand, API access and other solutions associated with specific content types. For example:

  • ScienceDirect and Scopus licence agreements – subscribers to these products may have options to search, download, email and extract content to allow them to perform their requisite analyses
  • Application Marketplace – Enabling developers who wish to design and implement applications to analyse our content, or who may wish to test applications as part of their research within Elsevier content. For further information on SciVerse Applications, please visit http://www.info.sciverse.com/sciverse-applications

PMR: In other words I can only have access through Elsevier’s walled gardens.

Now I think we are at a really dangerous place in the history of modern digital scholarship.

The simple position is that we have given the publishers our content. Up till now they have simply replayed it back to us (at vast cost to us and profit to them). But the cost is irrelevant.

Now they want to control it. And get us to pay even more. Lots more.

And the first library to agree to pay for text-mining access has sold the whole academic community down the river.

It is our RIGHT to text-mine scientific content. We created it and we can use modern tools to mine it. Without any help from publishers.

By when University libraries “purchase subscriptions” they only consider the pricing. They come back and tell us “we got a great deal – we beat the publishers down!” (I think there has been a recent Russell group “victory”).

But they flabbily sign the ultra restrictive clauses in the contracts. This is not about copyright, it’s actually signing a much much more restrictive contract. That forbids scientists like me any possibility of doing any meaningful chemical linguistic research. So here are two questions for libraries:

  • Has your organization ever challenged the restrictive contracts on text-mining? And won the freedom to text-mine?
  • Have you ever negotiated with a publisher about additional charges for textmining?

Only if you can answer YES and then NO can you hold your head up.

And FooBar? The publisher that wanted to charge huge amounts for one University to do textmining? Yes, they exist. But I want to get accurate documentation.

So if you have any information about publishers wanted fees to allow textmining please add them as comments.