Data Liberation and the Long Tail: (and a puzzle)

Next Tuesday I am giving an invited talk at Oxford on Open Data http://digital-research.oerc.ox.ac.uk/ , http://digital-research.oerc.ox.ac.uk/programme , and also involved with a session run by the OKF immediately afterwards. As always I don’t know what I am going to say until 0200 of the morning of the talk – this gives a chance to talk with delegates and get a feel for what is valuable.

I’ll touch on at least the following:

  • The Long Tail. Scientific disciplines which have little formal information infrastructure but huge amounts of science. Disciplines such as bioscience (outside mainstream bioinformatics-support, such as phylogenetics), chemistry. Materials science, observational sciences (other than astronomy), much computational and simulation research. Much of the data is valuable but thrown away. I estimate billions (sic) of dollars is wasted through non-existent infrastructure
  • Graduate Students. A seriously misused resource. Much of the innovation comes from third-year postgraduates and we need to give them expression
  • Software/informatics as a first-class activity. Builders of scientific software are often denigrated as not “doing proper science”, but they are every bit as important as the scientists who build telescopes and other instruments.
  • Bottom-up communities. There is a huge cognitive/informatics surplus if we treat the citizen community as equals and not inferiors. (Much of the software we work with is developed outside “research universities”. We should be helping this grow.
  • Liberation software. I and others are building software which will free data in dark silos, repositories, theses, journals, etc. I’ll present some of this in the afternoon briefly. The main battle we face is closed minds and vested interests; liberation software will leapfrog many of these.

I’ll be showing some of this in action, but here’s a taster. It comes from the supplemental information in a paper behind a publisher’s firewall. I don’t know if I am allowed to show it, but I’ll take the chance. It’s a mass spectrum – in simple terms it measures the mass of a molecule (here to 4 sig figures). Here are some questions. (Please add answers as comments because then I know people are interested and also I might learn something). [BTW this is how it appears in the paper – I assume the journal prints text upside down to make it easy for Australians, but I have to hang from the ceiling to read this.

Questions (in order of difficulty):

  • What’s the constitutional formula of the compound? (relatively easy for chemists)
  • How many peaks are there? (harder than it looks)
  • How would you find where this diagram was published? (very hard)

On Tuesday I will show how Liberation Software AMI2 can be used to answer Q 3.