Teaching #ami2 to recognize biological names (binomial)


Erithacus rubecula (Wikimedia Commons) “the Robin”


#ami2 can now read the text of scientific articles as HTML (she has a little trouble with bold letters and strange fonts but we’ll teach her how to manage). Here is how she finds organisms in text. Having created the HTML (which is also XML) she can search it with XPath. XPath is one of the simplest and most powerful search tools for moderate chunk of information. Here she searches a page for italic phrases with at least one space (e.g.

I heard an Erithacus Rubecula
Erithacus rubecula today. (@rmounce points out the capitalization!)

AMI has extracted the HTML (<i>…</i> means italics)

<p>I heard an <i>Erithacus rubecula</i> today.</p>

Now she creates an xpath :

“.//html:i[contains(.,’ ‘)]”

This means:

  • .// anywhere in the document (we can increase the precision later)
  • html:i a chunk of italics
  • contains(.,’ ‘) which (.) contains a space (‘ ‘)

It’s not flowing prose but it’s trivial for AMI. And the result (using Jaxen query() in XOM) is:

  • & Evolution
  • 16S, COI
  • 16S, COI, COII
  • 16S, P
  • Achillea macrophylla, Adenostyles alliarae
  • Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
  • Advances in Chrysomelidae Biology 1.
  • Ae. triuncialis
  • Aegilops geniculata
  • Annals of the Entomological Society of
  • Annals of the Entomological Society of America
  • Annual Review of Ecology and
  • Applied Statistics
  • BMC Bioinformatics
  • BMC Evolutionary Biology
  • Bioinformatics 2005, 21(24):4423-4424. 69. Sikes DS, Lewis PO: PAUPRat: PAUP implementation of the parsimony ratchet.
  • Biological Journal
  • Biology and Evolution
  • Boston University, Boston,
  • COI (13 PPIc among 16 polymorphic sites) and
  • COII, P
  • Cladistics-the International Journal of the Willi Hennig Society
  • Current Biology
  • Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs
  • Diabrotica virgifera
  • Die Käfer Mitteleuropas.
  • Doronicum clusii
  • Doronicum grandiflorum

Clearly not all italics are organisms. Many are bibliographic indicators. There are two simple ways to improve the precision:

  • Remove false positives. We can probably remove most of the bibliography by context (they occur on title pages and in references)
  • Include only known species. This is probably the best way forward and we have an excellent Open Source tool (Linnaeus) from Casey Bergmann and colleagues at Manchester with > 10000 commonest species.

There are other ways:

  • Morphology and lexical analysis of digraphs (the letter frequency in organisms is very different from English prose – higher vowel frequency for example).
  • Local context (include Hearst patterns … but hey, I have to go…)

So we easily get:

  • Achillea macrophylla, Adenostyles alliarae
  • Achillea, Adenostyles, Cirsium, Doronicum, Petasites, Senecio
  • Ae. triuncialis
  • Aegilops geniculata
  • Diabrotica virgifera
  • Doronicum clusii
  • Doronicum grandiflorum

So I hope you are now clear about how powerful content-mining is, how it will revolutionise science and how it is a crime against human knowledge to restrict its deployment.