Machines are better referees than humans but we’ll be sued if we use them

Andy Howlett and Mark Williamson in our group have been developing fantastic software.

It can read the whole scientific literature and analyse it in minute detail. One of the things we are starting with is chemistry. ChemVisitor (part of AMI2) can read chemical structure diagrams and chemical names and work out what they mean.

It takes less than a second. That’s pretty impressive, and we’ll be reporting this at the ACS meeting next month. Here’s the first picture we chose.

Our software can read the whole chemical literature every day and work out all the compounds. And I can do it on my laptop.


Hey – hang on – you’re violating copyright! And copyright is more important than science, isn’t it? Well, actually I am not violating it here, because this is from a CC-BY paper (I omit the attribution for a reason you’ll see). But yes, if it was from a Tetrahedron (Elsevier) article or J. American Chemical Society I would have to get permission. I’d probably have to pay. I wouldn’t be allowed to do X, Y or Z… It would take days without any likelihood of success.

And all I am doing is science. Note that chemical structure diagrams are NOT creative works. They are data. They are the only effective way of communicating what the compound is. But Elsevier and ACS and Nature and Science and … will all challenge me with lawyers if I take diagrams from non-CC-BY articles (e.g from Nature).

Now Andy has just mailed to say that this diagram is wrong. One of the compounds is incorrectly drawn. He’s contacted the author who has agreed. The error matters. These are compounds that many of you may eat. If the compound has the wrong name or formula then the science is badly flawed. And that can mean people die.

So try it for yourself. Which compound is wrong? (*I* don’t know yet) How would you find out? Maybe you would go to Chemical Abstracts (ACS). Last time I looked it cost 6USD to look up a compound. That’s 50 dollars, just to check whether the literature is right. And you would be forbidden from publishing what you found there (ACS sent the lawyers to Wikipedia for publishing CAS registry numbers). What about Elsevier’s Reaxys? Almost certainly as bad.

But isn’t there an Open collection of molecules? Pubchem in the NIH? Yes, and ACS lobbied on Capitol Hill to have it shut down as it was “socialised science instead of the private sector”. They nearly won. (Henry Rzepa and I ran a campaign to highlight the issue). So yes, we can use Pubchem and we have and that’s how Andy’s software discovered the mistake.

This was the first diagram we analysed. Does that mean that every paper in the literature contains mistakes?

Almost certainly yes.

But they have been peer-reviewed.

Yes – and we wrote software (OSCAR) 10 years ago that could do the machine reviewing. And it showed mistakes in virtually every paper.

So we plan to do this for every new paper. It’s technically possible. But if we do it what will happen?

If I sign the Elsevier content-mining click-through (I won’t) then I agree not to disadvantage Elsevier’s products. And pointing out publicly that they are full of errors might just do that. And if I don’t?…

Elsevier will cut off the University of Cambridge and the University will then contact me and tell me I have broken the sacred conditions that they have signed. Because no University ever challenges conditions that publishers set. The only thing that matters is price. So all universities have agreed with the publishers that readers cannot carry out text and data mining. They didn’t ask me – they just signed my rights away. If I continue I’ll probably face disciplinary action.

And the scientific literature will continue to be stuffed full of errors. And people will continue to die because of them.

Does anyone care? I don’t think so as no-one (ZERO) from a University has commented on my analysis of Elsevier’s restrictive TDM licence. They’ll just go ahead and sign it. Because it’s the easiest thing to do.