European copyright: Cancel Articles 3, 11 and 13

The proposed reforms to European Copyright will be disastrous. If you don’t know about them the links below are the seminal introductions. When you understand, then write to your MEPs as well. You are welcome to refer to this letter and copy some or all.

See Glyn Moody’s comprehensive and accurate arguments against the articles:

Thanks to the wonderful which makes this easy and professional


Thursday 7 June 2018

Dear David Campbell Bannerman, Stuart Agnew, Patrick O’Flynn, Tim Aker, John Flack, Alex Mayer and Geoffrey Van Orden,

Articles 3, 11 and 13 of the Proposed Copyright Directive

I write as a scientist, Reader Emeritus of the University of Cambridge and also as founder of a Cambridge non-profit which employs high-tech staff in Cambridge.

I am desperately concerned about the proposed copyright reforms on which you will soon vote. The issues are concisely summarised by MEP Julia Reda . (I have corresponded with Julia for 5 years and she understands all the issues both technical and political. She has installed and run our software!).

In summary the proposals are a mess, and unworkable. They bring confusion, rather than clarity and by default bring total power to “copyright owners”. If they are passed they will destroy knowledge-based innovation in Europe which will pass either to Silicon Valley, SE Asia or the Middle East. Knowledge innovators and companies in Europe are now “chilled” by copyright law and fearful of action. By default they will move to countries with more permissive laws, or simply close.

I am one of the most prominent TDM experts In Europe who both develops software and applications and also publicly campaigns for European Copyright reform. My, and ContentMine’s, goal is to develop machine technology which reads the whole scientific / medical literature and extracts validated factual knowledge for the benefit of us all on a daily basis (environment, health, bioscience, etc.). We have been partners in the H2020 project “FutureTDM” which analysed the problems that European TDM faces.

Europe spends 100 Billion Euro on STEM research but much of it (perhaps 80%) is underutilised because we need machines to help us. We work with clients who have to review 50,000 medical articles – taking months – to advise Medicine agencies about treatments and drugs. We build machine-assisted knowledge tools to speed this process by 10-fold or even more. Article 3 will kill this.

And if we use machines we are likely to fall foul of copyright law and be chilled or even face prosecution (as has happened elsewhere). As an example we have had several approaches from companies around the world who want us to mine the literature for them. We always have to consider the copyright problems upfront. For example TDM can only be carried out (without permission) by “Public Interest Research Organizations” for “non-commercial” purposes. Is PM-R a PIRO? Is If not then citizen-based innovation is being killed. Are we non-commercial? The only way to find out is to be taken to court. Julia Reda has proposed that every citizen should be allowed to carry out TDM for any purpose. That, and only that, is legal certainty.

Do we have to limit our innovation because of the lobbying by European “publishers”, many of whom do not even create their own content but act as rent collectors on 100 B Euro of publicly funded science and medicine?

  • Article 13 stops us publishing knowledge
  • Article 11 stops us telling people about knowledge
  • Article 3 stops us reading knowledge.

Please oppose the current drafts and work with Julia Reda for a positive innovative copyright future for Europe. With your help we can be world leaders

Yours sincerely,

Peter Murray-Rust


IMLS Forum on Text and Data Mining – 1 – Background

I am honoured to have been invited to Chicago to be part of the International Museum and Library Services. Here’s the occasion:

Data Mining Research Using In-copyright and Limited-access Text Datasets

National Forum, April 5 & 6, 2018, Chicago, Illinois
This project will bring together experts and thought leaders for a 1.5 day meeting to articulate an agenda that provides guidelines for libraries to facilitate research access, implement best practices, and mitigate issues associated with methods, approaches, policy, security, and replicability in research that incorporates text datasets that are subject to intellectual property (IP) rights.
Forum attendees will include librarians, researchers, and content providers who will be called to explore issues and challenges for scholars performing data mining and analysis on in-copyright and limited-access text datasets. These datasets are subject to restrictions that lead researchers to obtain permission for each use, to perform non-consumptive research where they do not have read access to the full text corpus, or to work their library to identify whether the content provider’s licensing terms and agreements allow for use.
This project is funded by the Institute of Museum and Library Services award LG-73-17-0070-17.
PMR> I’m excited and hope I can help. The point is that official, legal Text and Data Mining (“ContentMining”) of in-copyright datasets is effectively non-existent in the UK. That’s potentially surprising as the UK passed an extension (“Hargreaves”) specifically allowing it for non-commercial research purposes and promoted it as a great opportunity for wealth generation.

I would like to do it, but I am not, and may comment on this later. But the uncertainties are so great and the difficulties forbidding that no UK University that I know of actively supports it through infrastructure, tools, financial and legal support. Researchers in the UK are left to find their own solutions, in the face of continued obstruction from content providers. Legal language uses “chilling” to describe when people or organizations are frightened of being sued or otherwise penalized. UK TDM free of publisher restrictions has entered an ice age with little prospect of of warming up.

I’ll cover the reasons why in later posts. In the next I will post my own submission to the  meeting.

The meeting aims to create a concerted way forward. The US has a different structure from UK, both legal and academic – US copyright supports Fair Use whereas the UK has very little certainty for the reader/miner.

The University of Illinois at Urbana-Champaign (UIUC) has very kindly invited me to visit before IMLS. I’ll be talking to Digital Humanities, Computer Science, Libraries, etc and giving a general talk on Tuesday 3 April. Not sure whether it will be streamed. Then catching a ride with the others going to Chicago.

I hope the IMLS meeting will generate progress – we need it.  But it’s very difficult to satisfy all parties without becoming fuzzy and anodyne. That requires a strong sense of purpose.

CopyCamp2017 4: What is (Responsible) ContentMining?

My non-profit organization has the goal of making contentmining universally available to everyone through three arms:

  • Advocacy. Why it’s so valuable and why you should convince others and why restrictions should be removed.
  • Community. We need a large vibrant public community of practice .
  • Tools. We need to be able to do this easily.

There is a lot of apathy and a considerable amount of push-back and obfuscation (mainly from mega-publishers) and it’s important that we do things correctly. So 4 of us wrote a document on how to do it responsibly:

Responsible Content Mining
Maximilian Haeussler, Jennifer Molloy,
Peter Murray-Rust and Charles Oppenheim
The prospect of widespread content mining of the scholarly literature is emerging, driven by the promise of increased permissions due to copyright reform in countries such as the UK and the support of some publishers, particularly those that publish Open Access journals. In parallel, the growing software toolset for mining, and the availability of ontologies such as DBPedia mean that many scientists can start to mine the literature with relatively few technical barriers. We believe that content mining can be carried out in a responsible, legal manner causing no technical issues for any parties. In addition, ethical concerns including the need for formal accreditation and citation can be addressed, with the further possibility of machine-supported metrics. This chapter sets out some approaches to act as guidelines for those starting mining activities.

Content mining refers to automated searching, indexing and analysis of the digital scholarly literature by software. Typically this would involve searching for particular objects to extract, e.g. chemical structures, particular types of images, mathematical formulae, datasets or accession numbers for specific databases. At other times, the aim is to use natural language processing to understand the structure of an article and create semantic links to other content.

and we gave a typical workflow (which will be useful when we discuss copyright).


Of course there are variants, and particularly where we start with bulk downloading and then searching. For example we are now downloading all Open content, processing it and indexing against Wikidata. There is little point in everybody doing the same thing and, because the result is Open, everyone can share the results of processing.

We’ll use this diagram in later posts.


CopyCamp2017 3: The Hague Declaration and why ContentMining is important

In 2015 LIBER (The European body for Research Libraries) collected a number of leading figures in the Library and Scholarship world to create the Hague Declaration on freedom for Text and Data Mining. This stated not only the aspirations but also the reasons for demanding freedom, and I reproduce chunk of it here for CopyCamp2017 to consider.

The Hague Declaration aims to foster agreement about how to best enable access to facts, data and ideas for knowledge discovery in the Digital Age. By removing barriers to accessing and analysing the wealth of data produced by society, we can find answers to great challenges such as climate change, depleting natural resources and globalisation.

PMR: note that this is about why it’s so important – the answers to the health of the planet and the beings on it may be hidden in the scientific literature and Mining can pull this out.



New technologies are revolutionising the way humans can learn about the world and about themselves. These technologies are not only a means of dealing with Big Data1, they are also a key to knowledge discovery in the digital age; and their power is predicated on the increasing availability of data itself. Factors such as increasing computing power, the growth of the web, and governmental commitment to open access2 to publicly-funded research are serving to increase the availability of facts, data and ideas.

However, current legislative frameworks in different legal jurisdictions may not be cast in a way which supports the introduction of new approaches to undertaking research, in particular content mining. Content mining is the process of deriving information from machine-readable material. It works by copying large quantities of material, extracting the data, and recombining it to identify patterns and trends.

At the same time, intellectual property laws from a time well before the advent of the web limit the power of digital content analysis techniques such as text and data mining (for text and data) or content mining (for computer analysis of content in all formats)3. These factors are also creating inequalities in access to knowledge discovery in the digital age. The legislation in question might be copyright law, law governing patents or database laws – all of which may restrict the ability of the user to perform detailed content analysis.

Researchers should have the freedom to analyse and pursue intellectual curiosity without fear of monitoring or repercussions. These freedoms must not be eroded in the digital environment. Likewise, ethics around the use of data and content mining continue to evolve in response to changing technology.

Computer analysis of content in all formats, that is content mining, enables access to undiscovered public knowledge and provides important insights across every aspect of our economic, social and cultural life. Content mining will also have a profound impact for understanding society and societal movements (for example, predicting political uprisings, analysing demographical changes). Use of such techniques has the potential to revolutionise the way research is performed – both academic and commercial.

PMR: This shows clearly the potential of ContentMining and the friction that the current legal system (mainly copyright) places on it, by default.
And a non-exhaustive list of benefits:


The potential benefits of content mining are vast and include:

  • Addressing grand challenges such as climate change and global epidemics

  • Improving population health, wealth and development

  • Creating new jobs and employment

  • Exponentially increasing the speed and progress of science through new insights and greater efficiency of research

  • Increasing transparency of governments and their actions

  • Fostering innovation and collaboration and boosting the impact of open science

  • Creating tools for education and research

  • Providing new and richer cultural insights

  • Speeding economic and social development in all parts of the globe

So what should be done? I’ll leave that to the next post.


CopyCamp 2: workshop on ContentMining – what is it and how to do it

In the last post I explained why I became interested in contentmining to do scientific research and started to explain how it it is still a major political and legal challenge. I am excited that I have been asked to run a workshop at CopyCamp, and here is the information I am giving to participants. (You may also find my slides useful ).

Workshops on TDM/contentmining cover many areas and the precise format of this one will depend on the participants. On the program notes I suggested:

  •  hackers (who can make tools such as R, Python, etc.) do exciting things
  • scientists (including citizens) which want to explore questions in bioscience
  • librarians who want to explore C21st ways of creating knowledge
  • open activists who want to change policy both by political means and using tools
  • young people. we have had wonderful contributions from a 15-year old

So if everyone wants to talk about European and UK copyright politics, that’s fine. But we also have tools and tutorial showing how mining is done and we suggest people get some hands-on. It’s probably going to be a good idea to work in small groups where there are complementary skills:

Dear workshop participant:
I am delighted that you have signed up to my workshop  on Friday 29th at CopyCamp.
Wikidata, ContentMine and the automatic liberation of factual data: (The Right to Read is the Right To Mine)  The workshop will explore how Open Source tools can extract factual information from the Open Access scientific literature (specialising in BioMedicine). We will introduce Wikidata, a rapidly growing collection of of 30 million high-quality data and metadata and use it to index scientific articles. Participants will query the literature at EuropePMC using “getpapers” and retrieve hundreds or thousands of full-text articles  [snip…]
We will adapt the workshop to the skills and wishes of participants when we assemble, though please contact me earlier if there are things you would like to do. Topics can be chosen from:
* online demo of mining
* installation of full ContentMine software stack, and use of public repositories (EuropePubMedCentral, arXiv)
* introduction to WikiFactMine for extracting facts from open access publications.
* political and legal aspects of contentmining (with a European and UK slant)
If any participants are connected with (Polish) Wikipedia that could be valuable and exciting. (By default we shall use English Wikipedia). Note that Wikidata carries a large number of links to other language Wikipedias and this may be a valuable resource to explore.
If you want to run the full ContentMine stack it’s a good idea to install beforehand, so here are the instructions for *adventurous* members of the workshop:

This is a VM and should be independent of the operating system of the host machine. It has been tested in several installations but there may be problems with non-US/UK keyboards and encodings. By default the tutorial is in English (all the resources, EuropePMC, dictionaries are also in English and generally use only ASCII 32-127.

Of course anyone anywhere can also try out the tutorials.

CopyCamp: why Copyright reform has failed TDM / ContentMining – 1 The vision and the tragedy

I am honoured to have been invited to speak at CopyCamp2017,  “The Internet of Copyrighted Things” .  I’ve not been to CopyCamp before, but I’ve been to similar events and I’m delighted to see it is sponsored by organisations, some of which I belong to, that are fighting for digital freedom. In these posts I’ll show why copyright has failed science; this post shows why knowledge is valuable and must be free.

I’m giving a workshop on Thursday and talking on Friday (after scares from Ryanair) and I’m blogging (as I often to) to clear my thoughts and help add to the static slides. This is the latest in a 40-year journey of hope, which is increasingly destroyed by copyright maximalism. I am being turned from an innovative scientist who had a dream of building something excitingly new to an angry activist who is fighting for everyone’s rights. I can accept when science doesn’t work because it often just doesn’t; I get angry when mega-capitalists are using science as a way to generate money and in the wake destroying something potentially wonderful.

Here’s the story. 45 years ago I had my first scientific insight – working with Jack Dunitz in Zurich – that by collecting many seemingly unrelated observations (in this case crystal structures) I could find new science by looking at the patterns between them (“reaction pathways”). This is knowledge-driven research, where a scientist takes the results of others and interprets them in different ways. It’s as old as science itself, exemplified in chemistry by Mendeleev’s collection of the properties of compounds and analysis in the Periodic Table of the Elements. Mendeleev didn’t measure all those properties – many will have been reported in the scientific literature – his genius was to make sense out of seemingly unrelated properties.

40 years ago chemists started to use computers to carry out simple chemical artificial intelligence – analysis of spectra and chemical synthesis. I was entranced by the prospect, but realised it relied on large amounts of knowledge to take it further. I was transformed by TimBL’s vision of the Semantic Web – where knowledge could be computed. I moved to Cambridge in 1999 with the long-term aim to create “chemical AI”.  I created a dream – the WorldWide Molecular Matrix – where knowledge would be constantly captured, formalized and logic or knowledge engines would extract, or even create, new chemical insights.

To do this we’d need automatic extraction of information using machines – thousands of articles or even more. In 2005-2010 I was funded (with others) by EPSRC and JISC to develop tools to extract chemical knowledge from the scientific literature. It’s hard and horrible because scientific papers are not authored to be read by machines. I have spent years writing code to do this and now have a toolset which can read tens of thousands of papers a day (or more if we pay for clouds) and extract high quality chemistry. This chemistry is novel because it’s too expensive and boring to extract by hand and would be an important addition to what we have. As an example Nick Day in my group built CrystalEye which extracted 250,000 crystal structures, improved them and published them under an Open Licence – we’ve no joined forces with the wonderful Crystallography Open Database . Later Peter Corbett, Daniel Lowe, and Lezan Hawizy built novel, Open, software for extracting chemistry from the text of papers.

So now I have everything I want – thousands of scientific articles every day, maybe 10-15% containing some chemistry, and a set of Open tools that anyone can use and improve. I’m ready to try the impossible dream – of building a chemical AI…

What will it find?

NOTHING. Because if I or anyone use it without the PUBLISHER’s permissiom, the University will be immediately cut off by the publisher because …

… because it might upset their market. Or their perceived dominance over researchers. This isn’t a scare or over-reaction – there are enough stories of scientists of many disciplines being cut off arbitrarily to show it’s standard. One day 2 years ago the American Chemical Society’s automatic triggers cut off 200 universities. Publishers send bullying mails “you have been illegally downloading content” (totally untruee), or “stealing” (also untrue).

This is now so common that many researchers and even more librarians are scared of publishers. This blog has outlined much of this in the past and it’s not getting better. My dream has been destroyed by avarice, fear and conservatism. I’ll outline the symptoms, what needs to be done and urge citizens to own this problems and assert that they have a fundamental right to open scientific knowledge.

My slides at CopyCamp: provide additional material.

WLIC/IFLA2017: UBER for scholarly communications and libraries? It’s already here…

WLIC/IFLA2017: UBER for scholarly communications and libraries? It’s already here…

You all know of the digital revolution that is changing the world of service – Amazon, UBER, AirBnB, coupled to Facebook, Google, Siri, etc. The common feature is a large corporation (usually from Silicon valley) which builds a digital infrastructure that controls and feeds off service providers. UBER doesn’t own taxis, and takes no responsibility for their actions. AirBnB doesn’t own hotels, Amazon doesn’t have shopfronts. But they act as the central point for searches, and they design and control the infrastructure. Could it happen for scholcom / libraries? TL;DR it’s already happened.

You may love UBER, may accept it as part of change, or rebel against it.  If you want to save money or save time it’s probably great. If you don’t care whether the drivers are insured or maintain their vehicles, fine. If you don’t care about regulation, and think that a neoliberal market will determine best practices, I can’t convince you.

But if you are a conventional service provider (hotels, taxis) you probably resent the newcomers. If you are blind, or have reduced mobility,  and are used to service provision by taxis you’ll probably be sidelined. UBER and the rest provide what is most cost-effective for them, not what the community needs.

So could it happen in scholarly communications and academic libraries? Where the merit of works is determined by communities of practice? Where all the material is created by academics, and reviewed by academics? Isn’t the dissemination overseen by the Universities and their libraries? And isn’t there public oversight of the practices?


It’s overseen and tightly controlled by commercial companies who have no public governance, can make the rules and who can break the rules and get away with it. While the non-profit organizations are nominally academic societal, in practice many are controlled by managers whose primary requirement is often to generate income as much as to spread knowledge. The worth of scientists is determined not by community acclaim or considered debate but by algorithms run by the mega-companies. Journals are, for the most part, created and managed by corporations. Society journals exist, and new journals are created, but many increasingly end up by being commercialised. What role does the Library have?

Very little.

It nominally carries out the purchase – but has little freedom in a market which is designed for the transfer of money, not knowledge. In the digital era, libraries should be massively innovating new types of knowledge, not simply acting as agents for commercial publishers.

So now Libraries have a chance to change. Where they can take part in the creation of new knowledge. To help researchers. To defend freedom.

It’s probably the last great opportunity for libraries:

Content-mining (aka Text and Data Mining, TDM).

This is a tailor-made opportunity for Libraries to show what they can contribute. TDM has been made legal and encouraged in the UK for 3 years. Yet no UK Library has made a significant investment, no UK Vice Chancellor has spoken positively of the possibilities, no researchers have been encouraged. [1]

And many have been discouraged – formally – including me.

Mining is as revolutionary as the printing press. Libraries should be welcoming it rather than neglecting or even obstructing it. If they don’t embrace it, then the science library will go the way of the corner shop, the family taxi, the pub. These are becoming flattened by US mega-corporations. Products are designed and disseminated by cash-fed algorithms.

The same is happening with libraries.

There is still time to act. Perhaps 6 months. Universities spend 20,000,000,000 USD per year (20 Billion) on scholarly publishing – almost all goes to mega-corporations. If they spent as little as 1% of that (== 200 Million USD) on changing the world it would be transformative. And if they did this by supporting Early Career Researchers (of all ages) it could change the world.

If you are interested, read the next blog post. Tomorrow.

[1] The University of Cambridge Office of Scholarly Communication ran the first UK University meeting on TDM last month.


ContentMine at IFLA2017: The future of Libraries and Scholarly Communications

ContentMine at IFLA2017: The future of Libraries and Scholarly Communications


I am delighted to have been invited to talk at IFLA (, the global overarching body for Libraries of all sorts. I’m in a session 232 (see ) with
Congress Programme, IASE Conference Room 24.08.2017, 10:45 – 12:45

Session 232 Being Open About Open – Academic & Research Libraries, FAIFE and Copyright and Other Legal Matters


What’s FAIFE? It’s

The overall objective of IFLA/FAIFE is to raise awareness of the essential correlation between the library concept and the values of intellectual freedom  …
Monitor the state of intellectual freedom within the library community
Respond to violations of free access to information and freedom of expression

I share these views. But freedom of access and freedom of expression is under threat in the digital world. Mega-corporations control content and services and are actively trying to claw more control, for example by controlling the right to post hyperlinks to scholarly articles – even open access – (“Link Tax”)

And recently


I have spent 3-4 years on the edge of the political arena and I’ve seen how hard companies fight to remove our rights and to give them control.


And we need your help.

If you are a librarian, then you can only protect access to knowledge by actively fighting for it.

That means you. Not waiting for someone to create a product that you can buy


By actively creating the scholarly infrastructure of the future and embedding rights for everyone.

Now, for the first and possibly the last time we have an opportunities for libraries to make their own contribution to freedom.


I’ve set up the non-profit organization  which promotes three areas for fighting for freedom:


  • Community. The community deserves better from academia, and the community is willing to help, if given the chance. The biggest communal knowledge creation is in Wikimedia and we are working with them to make high-quality knowledge universally created and universally available.



We now have tools which can create the next generation of scholarly knowledge – for everyone.


But YOU can and must help.


IFLA has very generously given us workshop time for a demonstration and discussion of Text and Data Mining (TDM)


Imperial Hall 23.08.2017, 11:45 – 13:30

Session 199 Text and Data Mining (TDM) Workshop for Data Discovery and Analytics – Big Data Special Interest Group (SIG)


We’ll be giving simple hands-on web demonstrations of Mining , interspersed with the chance to discuss policy and investment in tools, practices and people. Especially young people. No future knowledge required.

This is (hopefully) the first of several blogs.


What is TextAndData/ContentMining?

What is TextAndData/ContentMining?

I prefer “ContentMining” to the formal legal phrase “Text and Data Mining” because it emphasizes all kinds of content – audio, photos, videos, diagrams, chemistry, etc. I chose it to assert that non-textual content – such as diagrams – could be factual and therefore uncopyrightable. And because it’s a huge extra exciting dimension.


Mining is the process of finding useful information where the producer hadn’t created it for that specific process. For example the log books of the British navy – which recorded data on weather – are now being used to study climate change (certainly not in the minds of the British Admiralty). Records of an eclipse in ancient China have been used to study the rotation of the earth. So forty years ago I studied hundreds of papers of individual crystal structures to determine reaction pathways – again completely unexpected to the original authors.


In science mining is a way to dramatically increase our human knowledge simply by running software over existing publications. Initially I had to type this in by hand (the papers really were papers) and then I developed ways of using electronic information. Ca 15 years ago I developed tools which could trawl over the whole of the crystallographic literature and extract the structures and we built this into Crystaleye – where the software added much more information than in the original paper. (We have now merged this with the Crystallography Open Database ). My vision was to do this for all chemical information – structures, melting points, molecular mass, etc. Ambitious, but technically not impossible. We had useful funding and collaboration with the Royal Society of Chemistry and developed OSCAR as software specifically to extract chemistry from text. Ten years ago things looked exciting – everyone seemed to accept that having access to electronic publications meant that you could extract facts by machine. It stood to reason that machines were simply a better , more accurate, faster way of extracting facts than pencil and retyping.


So what new science can we find by mining?

  • More comprehensive coverage. In 1974 I read and analyzed 1-200 papers in 6 months. In 2017 my software can read 10000 papers in less than a day.
  • More comprehensive within a paper. Very often I would limit the information beacuse I didn’t have time (e.g. the anisotropic displacements of atoms). Now it’s trivial to include everything.
  • Aggregation and intra-domain analytics. By analysing thousands of papers you can extract trends and patterns that you couldn’t do before. In 1980 I wanted to ask “How valid is the 18-electron rule?” – there wasn’t enough data/time. Now I could answer this within minutes.
  • Aggregation and inter-domain analytics. I think this is where the real win is for most people. “What pesticides are used in what countries where Zika virus is endemic and mosquito control is common?”. You cannot get an answer from a traditional search engine – but if we search the full-text literature for pesticide+country+disease+species we can rapidly find those papers with the raw information and then extract and analyze it. “Which antibodies to viruses have been discovered in Liberia?”. An easy question for our software to answer, except it was behind a paywall – no-one saw it and the Ebola outbreak was unexpected.
  • Added information. If I find “Chikungunya” in an article, the first thing I do is link it electronically to Wikidata/Wikipedia. This tells me immediately the whole information hinterland of every concept I encounter. It’s also computable – if I find a terpene chemical I can compute the molecular properties on-the-fly. I can, for example, predict the boiling point and hence the volatility without this being mentioned in the article. The literature is a knowledge symbiont.


Everyone is already using the results of Text Mining. Google and other search engines have sophisticated language analysis tools that find all sources with (say) “Chikungunya”. What I want to excite you about is the chance to go much further.
Why do we need other search engines when we have “Google”?


  • Google shows you what it wants you to see. (The same is true for Elsevinger). You do not know how these were selected, it’s not reproducible, and you have no control. (Also, if you care, Google and Elsevinger monitor everything you do and either monetize it or sell it back to your Vice-Chancellor).
  • Google does not allow you to collect all the papers that fit a given search. They give links – but try to scrape all these links and you will be cut off. By contrast Rik Smith-Unna, working with ContentMine (CM) developed “getpapers” – which is exactly what the research scientist needs – an organized collection of the papers resulting from a search. ContentMine tools such as “AMI” allow the detailed analysis of the details in the papers.
  • Google can’t be searched by numeric values. Try asking for papers with patients in the age range 12-18 and it’s impossible (you might be lucky that this precise string is used but generally you get nothing). In contrast CM tools can search for numbers, search within graphs, search species and much more. “Find all diterpene volatiles from conifers over 10 metres high at sea level in tropical latitudes” is a straightforward concept for CM software.


That’s a brief introduction – and I’ll show real demos tomorrow.


Text and Data Mining: Overview

Text and Data Mining: Overview

Tomorrow The University of Cambridge Office of Scholarly Communication is running a 1-day Symposium on Text and Data Mining ( ). I have been asked to present   , a project funded by the Shuttleworth Foundation through a personal Fellowship, evolved into a not-for-profit company.

I hope to write several blog posts before tomorrow , and maybe some afterwards. I have been involved in mining science from the semi-structured literature for about 40 years and shall give a scientific slant. As I have got 20-25 minutes I am recording thoughts here so people can have time to explore the more complex aspects.

Machines are now part of our information future and present, but many sectors, including academia, have not embraced this. Whereas supermarkets, insurance, social media are all modernised, scholarly communication still works with “papers”. These papers contain literally billions of dollars of unrealised value but very few people care about this. As a result we are not getting the full value of technical and medical funding, much of which is wasted through the archaic physical format and outdated attitudes.

These blog posts will cover the following questions – how many depends on how the story develops. They include:

  • What mining could be used for and why it could revolutionise science and scholarship
  • Why TDM in the UK and Europe (and probably globally) has been a total political and organizational failure.
  • What directions are we going in? (TL;DR you won’t enjoy them unless you are a monopolistic C21st exploiter, in which case you’ll rejoice.)
  • What I personally am doing to fight the cloud of digital enclosure.

There are 3 arms to ContentMine activities:

  • Advocacy/political. Trying to change the way we work top-down, through legal reforms, funding, etc. (TL;DR it’s not looking bright)
  • Tools. ContentMining needs a new generation of Open tools and we are developing these. The vision is to create intelligent scientific information rather than e-paper (PDF). Much of this is recently enhanced by the development of
  • Community. The great hope is the creative activity of young people (and people young at heart). Young people are sick of the tired publisher-academic complex which epitomises everything old, with meretricious values.

This sounds very idealistic – and perhaps it is. But the Academic-Publisher complex is all-pervasive – it kills young people’s hopes and actions. Our values are managed by algorithms that Elsevinger sells to Vice-chancellors to manage “their” research. The AP complex has destroyed the potential of TDM in the UK and elsewhere and so we must look to alternative approaches.

For me there is a personal sadness. 15 years ago I could mine the literature and no-one cared. I had visions of building the Open shared scientific information of the future. I called it – after Gibson’s vision of the matric in cyberspace. It draws on the vision of TimBL and the semantic web, and the idea of global free information. It was technically ahead of its time by perhaps 15 years, but now – with Wikidata, and modern version control (Git) – we can actually build this.

So my vision is to mine the whole of the scientific literature and create a free scientific resource for the whole world.

It’s technically possible and we have developed the means to do it. And we’ve started. And we will show you how, and how you can help.

But we can only do it on a small part of the literature because the Academic-Publisher complex has forbidden it on the rest.



How Wikidata can change the world of scientific information 1/n

>> Hang on! What’s Wikidata? And Wikimedia? I’ve heard of Wikipedia, but…

Wikipedia is a free encyclopedia. It doesn’t do everything. It’s one of about 12 projects under the aegis of the Wikimedia Foundation. It’s the one everyone has heard of, but there are lots of others which are also about making structured information and knowledge available for free and freely reusable by everyone. For example Wikimedia Commons is a huge resource of free images, videos, etc. Many of them are linked from Wikipedia articles but there are lots more which can be re-used in all sorts of ways. Teaching, research, new media …

>> OK, so Wikidata is the same thing for data? …

… Yes, but it’s not “all the world’s free data”. It’s carefully described data, carefully selected, and with clear provenance. When you find some Wikidata you know:

  •  what it is
  • where it came from
  • how it can be used
  • what other data it is related to

>> so give me an example. If I want to find out where Zika is endemic, then can I find it in Wikidata?… Yes. Good example. Actually “Zika” represents quite a lot of different things. It represents a virus…

>> Yes, but surely that’s it?

… No, it also represents the fever caused by the virus. They aren’t the same …

>> OK, I can see that. OK there would have to be two entries…

… No there’s more. Do you know where Zika virus was first discovered ?

>> In Africa? But no idea where…

… In the Zika forest – in Uganda. The virus was named after the forest. So it’s got a separate identifier. Lots of diseases are named after the place where they were first identified.

And then there are people called “Zika”

>> But they wouldn’t cause any confusion?

… Yes, some of them are authors of scientific papers. Which have nothing to do with Zika virus, Zika forest, Zika fever…

>> H’mm. So if I search for “Zika” in G**gle. I’ll get all of these?

… G**gle will guess what you want, and add in what it and its sponsors want you to see. So I didn’t find any authors in the first 4 pages. It’s powerful, but it’s not objective,
and it’s not reproducible. If you search tomorrow you’ll get different results.

>> And Wikidata is more objective?

… Yes. Wikidata has different entries (items) for each of the categories above. The virus, the fever, the forest and the authors have different identifiers.

>> identifiers?

… Yes. Good information systems have unique identifiers for each piece of information. Your passport number is unique. That’s what the machines read at airports. So here are some identifiers:

  • – Zika Virus Q202864
  • – Zika Fever Q8071861
  • – Zika Forest Q22138769have a look at, that’s got masses of information about Zika virus.
    Oh, and here’s a botanist, Peter Francis Zika, whose Wikidata identifier is Q21613657.>> Help – that’s too much at once…. understood

    >> H’m. So does everything in the scientific world have an identifier in Wikidata?

    … no – there’s far too much. Even G**gle won’t get everything. But everything with a Wikipedia article will (or should) have a Wikidata item.
    And lots of things are in Wikidata that don’t have articles.
    The Wikidata community has imported lots of information directly from authoritative sources.

    >> Ok so I can assume that every *important* scientific fact is in Wikidata?

    … that depends on what is “important”? But there are already huge amounts of bioscientific information. Drugs, diseases …

    >> Hm, my brain is really starting to overheat. Let’s take a break and come back. Maybe with some more examples??

    … certainly with some more examples. I’ll show you how items can be linked together by properties…

    >> OK. We’ve not even talked about how it will change science. you may have to reteach me some of this when we next meet…

    … Just remember “Wikidata”.  be seing you

The critical role of e-Theses: award acceptance speech at NDLTD

I am honoured by this award; I ‘ll describe the current struggle for ownership of digital scholarly knowledge, emphasize young people and machine-understandable theses and suggest practices.


Early Career Researchers see the digital literature – including theses – as a primary research resource. We’ve set up ContentMine – a non-profit supporting machine reading and analysis of scholarship. There are 10-20,000 journal articles a day – and several hundred theses – so machines are essential. Today we’re announcing 6 ContentMine fellows – all of whom have exciting projects to create new bioscience from the scholarly literature.


But this brave new world is often opposed by the Publisher-Academic complex. Academia feeds knowledge and public money into companies who in return define the scholarly infrastructure and the rules by which Academia has to play.

The key issue is who controls scholarship? Universities? Students? Researchers? Or corporations only answerable to their shareholders? How many universities have been arbitrarily cut off by publishers with the accusation that “their” content is being stolen? Knowledge that should be available to the whole world is being controlled and monitored. Increasingly, universities are acquiescent and even required by publishers to police “compliance”.


Last month one of our fellowship – a graduate student colleague in the Netherlands – was legally mining the literature to detect malpractice – such as unjustifiable statistical procedures. After 30,000 downloads a publisher cut off the University and – without discussion – wrote denouncing him for “stealing” content. They required his research be stopped. The University complied. Then another publisher. And a third. Last month Cambridge was cut off for 3 weeks by one publisher. No explanation. No dialogue.


Europe is trying to reform copyright to support research. I am working with them, but there’s massive lobbying by publishers. They want to control and monitor everything. Textual content, repositories for data, metadata, metrics for academic glory.


Machine-understandable e-theses represent one of the remaining areas not controlled by publishers. They are a new opportunity for universities and a knowledge resource for everyone – citizens as well as academics. They report billions of dollars of research, and are often the only place where it’s published. To maximize the spread of knowledge – which young people are passionate about – some suggestions.

  • Be proud of theses.
  • Think of “use” rather than “deposit”
  • Make theses globally discoverable.
  • Involve citizens everywhere. Think of the Global South.
  • Don’t repeat the mistakes of the “West”. Do it differently.
  • Release immediately.
  • Use DOCX, Tex, CSV, SVG, XHTML, besides PDF.
  • Use versioned text and data GIT, DAT …
  • Use openly controlled international repositories.
  • Use permissive licences allowing mining and re-use.
  • Do not hand over rights for content, discovery or access.
  • Don’t buy systems – Encourage young people to build them.
  • Experiment with Open Notebook Science.
  • Encourage and use e-theses as a primary tool for research.
  • Use Wikipedia / Wikidata as the default metadata for scholarship.


And a warning: Unless libraries take this type of opportunity now they will be increasingly replaced by commercial services and disappear. E-theses and young people are your chance.


“Dialogue” with Elsevier – story-2 (“Despicable” Legal Weasel Words)

ContentMine is going to mine the whole scholarly literature (10,000 articles every day). We’d hoped to do this some months ago and one of the reasons is the massive pushback from major publishers. Technically, legally , politically.

UK government note: You are about to spend about 40 M GBP each year with Elsevier. The real costs are about 2.5 M GBP according to Bjoern Brembs. A significant amount of the rest (even after the huge profit of ca 38% (yes!)) is spent on lobbyists, reps , lawyers, firewalls, captchas, etc. Much of their time is spent trying to make it as difficult as possible to create the Scholarly Commons [1] where we can read, use and re-use the literature without constantly looking up to worry about publishers.

So one of the aspects is legal agreements. We need legal agreement in all sorts of areas, buying houses, hiring staff, etc. These are often between two parties and they negotiate (e.g. on price and exactly what is included) and most of the time it’s relatively understood what the bargain is.

But not with Elsevier. Elsevier produce devious, complex, bespoke legal agreements unlike any other publisher. They neve use a standard form if they can complicate and mislead. You may think I’m being unfair and biassed, but I have spent many days challenging them over text and data mining. (TDM). They put in specific restrictions and clauses about what they hold onto. Despite the fact that it’s legal in UK, they try to persuade you that you have to make a separate agreement with them (an API). You don’t. It’s legal, probably, but it’s immoral and unethical.

Here’s the most recent unpleasantness. A common way to publish your work as Open Access is to pay the publisher (often a lot of money) to allow you to use a CC BY licence. And you retain all rights as author. Straightforward publishers like BMC have done this for 10 years and I have published with them perfectly happily

So when you hear that Elsevier’s licence is CC BY you think fine, I continue to own the paper and Elsevier have a non-exclusive right to use it.

But no. Elsevier has written weasel words into the small print. You no longer own the paper. It may be CC BY but it’s Elsevier’s. And the weasel words are there to look like you are getting what you paid for, but actually you have to be a lawyer to be sure that you have actually been fooled.

Does this matter? At first sight not. And if you trust Elsevier ,  maybe not. But I don’t, and nor does Heather Morrison and nor does Michael Eisen. So let’s listen to them:
Screen Shot 2016-05-26 at 15.47.46

Here’s one of the ubiquitous Elsevier staff trying to convince Michael, and here’s Michael’s repsonse. I’ll leave it there , the TL;DR is that this contract is misleading and should be rejected. Michael calls it “despicable”. I wish that Universities treated licences as serious and challenged them rather than letting Michael, Mike Taylor, Heather, Charles Oppenheim, Ross Mounce, me, etc. to to their work voluntarily. After all it’s the Universities who contract with the publishers, and they just don’t seem to care whether their money is well spent.

Read the following from MikeE. If you teach law students, set it as an exercise to pick holes in…

<quote from=”mikeEisen” >
Elsevier is tricking authors into surrendering their rights
By MICHAEL EISEN | Published: MAY 24, 2016
A recent post on the GOAL mailing list by Heather Morrison alerted me to the following sneaky aspect of Elsevier’s “open access” publishing practices.

To put it simply, Elsevier have distorted the widely recognized concept of open access, in which authors retain copyright in their work and give others permission to reuse it, and where publishers are a vehicle authors use to distribute their work, into “Elsevier access” in which Elsevier, and not authors, retain all rights not granted by the license. As a result, despite highlighting the “fact” that authors retain copyright, they have ceded all decisions about how their work is used, if and when to pursue legal action for misuse of their work and, crucially, if they use a non-commercial license they are making Elsevier is the sole beneficiary of commercial reuse of their “open access” content.

For some historical context, when PLOS and BioMed Central launched open access journals over a decade ago, they adopted the use of Creative Commons licenses in which authors retain copyright in their work, but grant in advance the right for others to republish and use that work subject to restrictions that differ according to the license used. PLOS and BMC and most true open access publishers use the CC-BY license, whose only condition is that any reuse must be accompanied by proper attribution.

When PLOS, BioMed Central and other true open access publishers began to enjoy financial success, established subscription publishers like Elsevier began to see a business opportunity in open access publishing, and began offering a variety of “open access” options, where authors pay an article-processing charge in order to make their work available under one of several licenses. The license choices at Elsevier include CC-BY, but also CC-BY-NC (which does not allow commercial reuse) and a bespoke Elsevier license that is even more limiting (nobody else can reuse or redistribute these works).

At PLOS, authors do not need to transfer any rights to the publisher, since the agreement of authors to license their work under CC-BY grants PLOS (and anyone else) all the rights they need to publish the work. However, this is not true with more restrictive licenses like CC-BY-NC, which, by itself, does not give Elsevier the right to publish works. Thus, Elsevier if either CC-BY-NC or Elsevier’s own license are used, the authors have to grant publishing rights to Elsevier.

However, as Morrison points out, the publishing agreement that Elsevier open access authors sign is far more restrictive. Instead of just granting Elsevier the right to publish their work:

Authors sign an exclusive license agreement, where authors have copyright but license exclusive rights in their article to the publisher**.

**This includes the right for the publisher to make and authorize commercial use, please see “Rights granted to Elsevier” for more details.

(Text from Elsevier’s page on Copyright).

This is not a subtle distinction. Elsevier and other publishers that offer it routinely push CC-BY-NC to authors under the premise that they don’t want to allow people to use their work for commercial purposes without their permission. Normally this would be the case with a work licensed under CC-BY-NC. But because exclusive rights to publish works licensed with CC-BY-NC are transferred to Elsevier, the company, and not the authors, are the ones who determine what commercial reuse is permissible. And, of course, it is Elsevier who profit from granting these rights.

It’s bad enough that Elsevier plays on misplaced fears of commercial reuse to convince authors not to grant the right to commercial reuse, which violates the spirit and goals of open access. But to convince people that they should retain the right to veto commercial reuses of their work, and then seize all those rights for themselves, is despicable.

– See more at:



[1] Maryann Martone’s phrase.


“Dialogue” with Elsevier – story-1 (Will Elsevier publish Crystallographic Data?)

TL;DR. I continue to try to get public data out of Elsevier. I think I should be able to – every other publisher has no problem. After some not-very-useful replies Elsevier simply give up answering me.

Over the last 7-8 years I have had major issues with Elsevier on many aspects – licensing, paywalls , availability etc.  It’s normally impossible to find anyone who gives me a straight answer. I believe that a modern company should have a clear channel of communication – where requests are handled formally and there is accountability when things go wrong.

Indeed some do. Cambridge and Oxford University Presses. They are parts of the Universities and so have to abide by Freedom Of Information request rules – give clear public answers to  questions within a given period of time (20 working days). – and I have used this. By contrast Elsevier find many ways of not answering questions.

So I try again – and I leave it to you to decide whether this is a company that the UK should give 40 M GBP of taxpayers’ money to. When I go to meetings about scholarly publishing and informatics there are increasingly representatives from Elsevier who mingle with the other delegates.  They are “friendly” and “want to help us”. So here’s the first story – there may be more. Always remember that we are paying them.

Background: Everyone (except Elsevier) thinks that non-sensitive scientific data accompanying an article should be in the public domain. This is critical because the data:

  • is there to support the claims in the article.
  • can be re-used by others for many purposes (data-driven science, deriving parameters, aggregation, simulation – a huge list). I have spent much of my scientific career re-using public data.

I’m going to take crystallographic data – my field, but also central to much modern science. Its publication has been supported by International Unions, CODATA, and many other respected scientific bodies.

And almost all publishers make it public. American Chemical Society, Royal Society of Chemistry, Acta Crystallographica and many more.

But not Elsevier. They either hide it behind a paywall, or send it to the Cambridge Crystallographic Data centre, who provide it under a subscription licence (a trivial amount – probably < 1% is available for free, but NOT for re-use). I wrote to the “Director of Universal Access” some years ago and got waffle.

I and many others think this is outrageous. It’s public data, not Elsevier’s . The science in the paper is seriously diminished without the data. I help run the Crystallography Open Database (COD) which has hundreds of thousands of structures. Will Elsevier give these back to the public?

The only route that I have are the “helpful” reps I meet. So a month ago I met one. He agreed to take my concern into Elsevier. At least I would get a clear answer…

I am including all the letters. I have removed the name of the Elsevier rep.

[1,2,3] TL;DR he agreed to take something on. It wasn’t his department. He’s sent it off.

[4,5] He discovers that authors can send their files to [1] CCDC or [2] Elsevier (behind the paywall). This defines the scope of the question. (It doesn’t tell me anything I didn’t already know). Most of the data are in the second category.

[6] PMR reiterates that by hiding data Elsevier is going against all other responsible parties in the field.

[7] Elsevier replies that they assume I only want data of type 2 and that they can make them available “openly” behind a Mendeley login.

[8] PMR replies that we want all the data (this is consistent with every other publisher) and that data behind a Mendeley login are not open. PMR lists 6 questions that he would like answered.

I have not had the courtesy of a reply even after 19 days, so I can only assume Elsevier regard me as not worth continuing to answer.


[1]  PMR 2016-04-19
I thank you for our conversation yesterday, where you agreed that all factual supplemental crystallographic data published with papers in Elsevier journals should be made available without restrictions (effectively CC0). You agreed that you would work with your technical colleagues to see how this could be done as soon as possible (“flipping a switch”). You agreed that this would bring Elsevier into line with most other major publishers (ACS, RSC, IUCr, Nature) who have for many years released all their crystallographic data (“CIF”s) into public view on their websites, without restrictions. This data would include both current published data (probably back to about 1990) and all future data.

The Crystallography Open Database (COD) (PMR is a board member) has a 10-year record of accepting, validating CIFs from all domains (organic, organometallic, inorganic, metals and alloys) and then offering to the world for re-use under effectively CC0 licence. It also provides a variety of modern search and analysis software.

I ask that you commit to this publicly now and am confident that COD will be willing to host the data if Elsevier does not wish to mount them on its web pages.

[2] Elsevier 2016-04-19 =======

To be very clear – what we agreed was that I would look into this and get back to you with a clear response one way or the other. I made no commitment as this is not in my area of responsibility. I’d appreciate in the interests of establishing trust between us that we are both careful in reporting our conversations accurately.

Per our conversation I have already reached out to my colleagues to understand the current situation w.r.t. Crystallographic supplemental files in our journals. I will let you know as soon as hear back.

[3] PMR: 2016-04-19 =========

Thank you,

I would not intend to publish anything representing your views and position that you weren’t happy with.

For reference I shall forward you Elsevier’s less-than-useful reply 3 years ago. If you can do better than this , fine, else it will be a waste of my time.

It would be useful for you to set a tight timescale. If you aren’t able to give a clear yes/no in a month from now it will be yet another “we’ll look into it for you” that disappears, and I shall regard it as “no”.

After a month I shall announce Elsevier’s decision as reported to me.

If you want technical help and explanation of what we want and how to make it available. then I am sure Saulius will be delighted to help.
[4] Elsevier 2016-04-19 ==========
I’ll follow up as promised and let you know the outcome
[5] Elsevier 2016-05-05 ==========

I wanted to let you know that I am making some progress in discussions with internal colleagues w.r.t. how we currently treat CIF files, but I haven’t got fully to the bottom of the story.A couple of facts I have discovered:

  • We give authors the choice as to whether they deposit their CIF files with an external database and provide us with a link to the file, or have us host their files as Supplementary Material. You can see examples of the two cases as follows
  1. Articles linking out to CCDC using data banner links (e.g.
  2. Articles with CIF files delivered as supplementary material (e.g.

I will keep you informed as I learn more. However, for confirmation I assume that your main interest is in securing open access to files in category 2 above?


PMR: [6] 2016-05-05 ============

To clarify our relationship. I am acting as a board member of the Crystallography Open Database, a Public Interest Research Organization (PIRO),  and copying them. You are a formal representative of Elsevier. / RELX. I regard our correspondence as in the public interest and intend to publish all of it.

I list below a number of direct, simple questions to which I request answers. These are ones that I would expect organizations subject to Freedom Of Information requests (including, for example OUP and CUP, and myself) to have to answer. Although Elsevier is not subject to FOI I am expect the same comprehensive and clarity of response. There is also a request for crystallographic data.
I have set out our expectations. In summary: All major publishers except Elsevier make their crystallographic data fully and publicly available, effectively CC0. Elsevier’s policy is in direct conflict with national and international science organizations such as CODATA, ICSU and the International Union of Crystallography. Elsevier’s position in withholding scientific data is in direct opposition to the  norms and expectations of the scientific world.

[7] Elsevier 2016-05-05 ==========

I wanted to let you know that I am making some progress in discussions with internal colleagues w.r.t. how we currently treat CIF files, but I haven’t got fully to the bottom of the story.

  1. Articles linking out to CCDC using data banner links (e.g.
  2. Articles with CIF files delivered as supplementary material (e.g.


I will keep you informed as I learn more. However, for confirmation I assume that your main interest is in securing open access to files in category 2 above?


[8] PMR 2016-05-05 ============

Thank you. I have allocated you the same length of time (one month/20 working days) as for a UK FOI request to provide information. If you are also, as I expect, working towards a change in Elsevier policy and practice, then it will be necessary at the end of the month to detail what you have set in motion and with what expected timescale. Until 21st May I accept that I will not make our discussions public.

>ELS> For articles with CIF files that we host as supplementary material, we are still evaluating both from the technical point of view and the legal point of view the feasibility of making these available openly via our new Mendeley data platform (

The word “openly” is imprecise.  I note that Mendeley requires a login which is inconsistent with Openness. I assume therefore that Mendeley will impose its own terms and conditions, which by definition will be inconsistent with CC0.

Note that under the new 2014 UK exception to Copyright I can legally mine the data associated with any Elsevier publication that I have the right to read. Since the data itself is uncopyrightable, and since a journal is not a database covered by European sui generis database rights,  I can therefore download all CIFs as part of my personal non-commercial research and I can publish the data from that research. There is no benefit in using Mendeley. From Elsevier’s point of view it would possibly be preferable to bundle this historic data now and ship it to COD and we would be happy to make this technically possible. Otherwise I shall extract it under the UK law.

>ELS> I will keep you informed as I learn more. However, for confirmation I assume that your main interest is in securing open access to files in category 2 above?

>PMR> **NO**. Our interest is in all supporting ALL crystallographic data that is associated with ALL Elsevier publications, in line with all other major publishers.

I would therefore like answers to the following questions and will publish answers when the month is up. I have framed them so that many can be answered with Yes/No/DeclineToAnswer/. I use the phrase “NonOpen database) to refer to databases such as those run by CCDC, ICSD and other organizations which do not make the total data available under CC0.  (Note that if these questions were submitted to OUP I would expect them all to be fully answered under FOI).

  1. Does Elsevier hold copies of ALL raw CIFs associated with Elsevier publications, or if not can it obtain these CIFs?
  2. Please provide a complete list of all NonOpen databases that Elsevier requires or allows authors to submit crystallographic data to. Please indicate whether Elsevier has the right to obtain ALL the crystallographic data in BULK associated with their publications.3. Does Elsevier have formal contractual relations with these NonOpen databases. Please indicate what these contracts allow and forbid.
  3. Please indicate how Elsevier decides on its policy on crystallographic data. Does it consult with ICSU, CODATA or IUCr? When was the policy last reviewed? What is the mechanism for PIROs to formally request changes in policy?
  4. Please provide a list of all files of Type 1 and Type 2 and a service where updates of these lists can be obtained.

These are requests for information.

Our request for crystallographic data, which is consistent will all major scientific bodies, funding bodies and all major publishers other than yourselves, for the files themselves is:

5.Please provide all files of type 1 and 2 , or an open mechanism (e.g. an API) where all these files can be obtained. Please confirm that redistribution of the files is permitted without further permission.

  1. Please indicate that Elsevier is committing to changing the policy to make supplemental data files publicly, freely and openly available. Please indicate the process that has been initiated and how it will report back to the world.

I will publish the correspondence, unedited, on 21st May.



19 days have elapsed without the courtesy of a reply

Taxi Ken and I discuss the UK’s negotiations with Elsevier

When I go the airport by taxi [1] I try to get the same taxi driver, Ken [2]. Ken is a shining example of why every citizen of the world needs access to the whole scholarly literature – open and for free.

You often hear publishers (and some academics) say “ordinary people wouldn’t understand the science”.  This is appallingly arrogant , and blatantly untrue.   In the taxi we are discussing whether people listen more to scientists from Cambridge than less-well-known universities:

PMR: Doctors in a Western Australian hospital struggled for many years to convince the medical profession of the true cause of stomach ulcers

KEN: You mean Campylobacter.

This is the point. Ken has no University education. But he knows the cause of ulcers and he knows the precise scientific name [3] (Read the story of Barry_Marshall and Robin Warren; everyone should be able to follow it. They published their results [4] in The Lancet a well-known medical journal.)

Oh, dear, Ken. I’m sorry you can’t read this unless classic paper unless you fork out 36 USD – and you would then have just 24 hours to read it. And you can’t show it to your mates – that’s copyright violation. Oh, and all the money goes to Elsevier – none to the authors.

Taxi drivers are an underclass. Only academics in Cambridge are allowed to read about Helicobacter. What? It was published in 1983? Yes,  that’s far too recent to make it Open and Free for taxi-drivers. One of the most important papers in science? won a Nobel prize? You expect taxi-driver tax-payers to be allowed to read work they fund??  Sorry.  Just keep driving taxis.

It’s a moral imperative to publish science for everyone. Not just academics but also taxi drivers. Next time you are in a taxi, don’t sit back but ask your driver: “are you interested in science?”. Not everyone is, but everyone who is interested in science can be a scientist. It’s a matter of attitude and philosophy, not a white coat.

PMR: So we had a meeting last week to discuss negotiations with Elsevier.

KEN: Elsevier , the publisher?… (Ken is interested in politics , and science/Cambridge. He knows about Elsevier.)

PMR Yes…

And I continue to set the scene:

A week ago a small selected group of concerned Cambridge academics (including PMR), and library staff, met with Jisc (who are advising HEFCE – who fund English universities), to find out about Jisc’s negotiations with Elsevier about university subscriptions.  Almost 40M GBP  year  of taxpayer and student money for academics to read journals. Until now I didn’t even realise there was a negotiation – it has been kept very quiet indeed. The deal has to be concluded by 24:00 2016-12-31 ; if not the subscriptions are cancelled and even Cambridge academics won’t be able to read The Lancet.  (The journal which Ken still can’t anyway read, thus bringing Cambridge academics and taxi-drivers even closer together).

The meeting opened my eyes to the massive and visceral resistance that Elsevier was putting up against any normal “negotiations”. (I’ve negotiated with equipment suppliers before – one deal was >2M GBP in today’s money.) What came over to me very clearly was that this is not about price. It’s a battle for control. Who makes the decisions about the dissemination of scholarly knowledge, whether or not paid for by the taxpayers and student fees?

Elsevier (along with Digital Science from Nature/Springer, etc.) are rapidly taking over our academic infrastructure. Last week Elsevier they bought SSRN, the social sciences repository. They now control preprints in significant part of academia. They are selling the Universities PURE – a system of repositories that are under Elsevier control and where I fear the we are the product as well as the customers. What does Open matter if the gateway is controlled by a publisher who Openwashes the language to legitimize its control?

It’s not because Elsevier are the biggest, it’s because they are the most ruthless, arrogant, publisher. I have been dealing with them for ca 8 years. They treat me as a nobody – a nuisance. I’m far from the only one. It’s critical that we wrest back control. After all it’s us who are paying the money. And it’s every taxi driver in the world who ultimately suffers.

The UK is not the first country to negotiate with Elsevier. Last year (2015) the Dutch did. They announced that unless they got a set of nonnegotiable demands they would walk away from Elsevier and cancel subscriptions.

What actually happened?

I don’t know. The Dutch agreed something with Elsevier. What? Don’t know , because Elsevier requires secrecy and the Dutch agreed. You can read reports before the deal and after the deal. Maybe, Ken, you can make more of them than me.
The worst possible thing would be an announcement on 2017-01-01:

“The UK and Elsevier have concluded a 40 M agreement about purchase of Elsevier publications and services. The details are commercially sensitive but [some important person] says: ‘This is a good deal for the UK …'”

The thinking world will say: “The Dutch stuck their heels in a bit but finally Elsevier won. The same has happened in UK”.

The unthinking will say: “I didn’t even realise that UK and Elsevier were negotiating. Ah well, I don’t read the scientific literature because I can’t afford it.”

KEN: had already said – his words – “The theft of Knowledge“.

This has to be fought in public, starting now. Elsevier are effectively a monopoly. The people of Europe – 250,000 – fought software patents and won. The people of many countries took on Microsoft and won. The people of UK, and later Europe, should take on Elsevier and win. It won’t be easy but it has to be done.

And if the taxi drivers take to the streets it can happen.

And in case you are wondering why we are paying 40 Million GBP for electronic content – which Elsevier neither authors nor referees , here is Bjoern Brembs  with the breakdown of costs.  Bjoern – a neuroscientist – asks  Why haven’t we already canceled all subscriptions?

The question in the title is serious: of the ~US$10 billion we collectively pay publishers annually world-wide to hide publicly funded research behind paywalls, we already know that only between 200-800 million go towards actual costs. The rest goes towards profits (~3-4 billion) and paywalls/other inefficiencies (~5 billion).

So the UK will be paying 20M GBP to Elsevier for gross inefficiencies (which from my own experience I can confirm) and technology to stop Ken, and Chris Hartgerink (“Elsevier stopped me doing my research”) , reading science.

And we should pay them just 3M GBP at the most.

Or, much better, do it ourselves. We’d do it cheaper, better, faster. Bjoern, a senior scientist, says so and I agree.


[1] Don’t worry – It is the cheapest overall cost (over car parking or hotels).

[2] not his real name but he is happy for me to publish our discussion. He cares.

[3] it’s been renamed to Helicobacter in the intervening period.

[4] Marshall BJ, Warren JR (June 1983). “Unidentified curved bacilli on gastric epithelium in active chronic gastritis”. Lancet 321 (8336): 1273–5. doi:10.1016/S0140-6736(83)92719-8. PMID 6134060. [from Wikipedia].