How ContentMine can help you! Our example looks for “tummy bug” for Natalie

Yesterday Tom, Natalie and I had coffee together. Natalie’s a Vet student – at Royal Veterinary College – and we got talking about her project – 8 weeks doing practical research on Isospora. I’ve never heard of it. No idea what it is.

But ContentMine will know, so we’ll ask it…

We’ll be showing you in later posts how it all works, but just accept that we type:

getpapers -q isospora -x

Wait a minute for ca 207 open access papers to be downloaded , and then

cmine isospora

And wait another minute for ami to crunch through the data. Ami has already created summary files and we’ll look at full.dataTables.html which gives an overall view of all the “plugins” we have used (species, genes, words, etc.). Here’s the first few papers:

Screen Shot 2016-04-29 at 14.04.19

No need to squint – We’ll describe them in larger detail. (Note: some of the links are broken and there are a few false positives, both are being cleaned up).


The first column results gives links to the papers (PMC2758902 is a PubMedCentral id and clicking it will link to the EuropePubMedCentral repository of full text papers). Yes, YOU can read them. 200 free papers. If your are interested in Isospora, they are all yours! So here’s the first paper of the 200..

PMC2758902 local


We still don’t know what Isospora is, so let’s click on Isospora belli . It’s linked to Wikipedia which says:

Cystoisospora belli, previously known as Isospora belli, is a parasite that causes an intestinal disease known as cystoisosporiasis.[1] This protozoan parasite is opportunistic in immune suppressed human hosts.[2] It primarily exists in the epithelial cells of the small intestine, and develops in the cell cytoplasm.[2] The distribution of this coccidian parasite is cosmopolitan, but is mainly found in tropical and subtropical areas of the world such as the Caribbean, Central and S. America, India, Africa, & S.E. Asia. In the U.S., it is usually associated with HIV infection and institutional living.[3]

So, to paraphrase,

“Isospora is the old name of a nasty tummy bug, found mainly, but not exclusively, in the sub/tropical world that can infect HIV-sufferers”

Biological science is often hard to read for newcomers, but with practice you learn how to translate. Here’s a sentence from one paper:

Coprological examination of fresh stool specimens revealed coccidian oocysts of the genus Isospora in 36% of the birds


We examined birdshit and found parasite eggs in 36%.

The long words are useful – they aren’t there just to put you off or be pompous. They help translate between human languages, and they increase precision. If we search for “parasite eggs in birds” we might end up with bird eggs, whereas “oocytes” is more precise. ContentMine loves precise words because it reduces false positives (results that aren’t relevant to what you want).

Column “words” is a list of the commonest word tokens. In this case it’s just “patients”. That confirms that the paper is probably about human infection (though Natalie and other Vets call animals “patients”). So were we right? Click on PMC2758902 and we’ll see:

Screen Shot 2016-04-29 at 14.49.41

So it’s about HIV, and drug treatment. Where’s the Isospora? Search down the full text and we find:

The reasons for hospitalization were: disseminated tuberculosis (month 5), reactivation of oropharyngeal Kaposi’s sarcoma (month 3), and Isospora belli diarrhea with severe dehydration

So if you are interested in finding all papers where Isospora has infected HIV papers, ContentMine can immediately help you.

Nataliie’s main interest is veterinary, so we’ll look at the next few papers. But that shows how much there is in just ONE paper. And why we need machines to help us. Natalie probably mainly wants papers about animals and we can address that as well…


… in the next blog post!