Content-mining: #animalgarden write a crawler

 

Chuff: Hello Gulliver! I’m the OKFN OKAPI.

Gulliver: Hello! I’m Gulliver, the BMC turtle.

C: I know – you’ve got a blog http://gulliverturtle.wordpress.com/ where you tell the world about #openaccess. What’s #ami2 doing?

G: She’s crawling BioMedCentral content.

C: Gosh! It looks painful.

G: It’s very painful for humans, but as you know #ami2 has no emotions. So it’s not a problem. She’s good at it.

C: Yes, she doesn’t get tired, angry, bored. She does exactly what she is told by PMR. So how does it work?

G: Here’s my content page: http://www.biomedcentral.com/content – that tells you where all the articles are

A: PMR told me to read each bibliodata and follow the link. I have to do that for all the papers.

C: Well there are only 25 so it won’t take too long.

G: We’re MUCH more popular than that. This is one PAGE. There’s nearly 6000 pages.

C: Wow! Is that because you’re Open Access?

G: Yes. PMR publishes many of his papers with me.

C: Let’s see. 25 * 6000 = 150,000. Wow! What a lot of Open Access papers. Can #ami2 do them all?

G: Yes. We’ll tell her how long she has to wait between reading each paper?

C: But #ami2′s very fast. She doesn’t need to wait.

G: If she tries to download papers too quickly – like 1000 per second –it might confuse our servers. Because that might look like a hostile robot.

C: But #ami2 IS a robot.

G: Yes, but she’s a friendly robot. We’ll tell her what the maximum speed is.

C: She told me it was 6 seconds for PLoSONE. (I wish PLoSONE would get a mascot – all #openaccess publishers should have animals).

G: Yes. PeerJ has Charlie the monkey. Anyway let’s do the sums. 6 secs is 10 papers per minute. We need 150,000 / 10 minutes. Which is 15,000 minutes.

C: which is about 10 days. That’s to get the backlog. How many would it be per day?

G: It think about 150. I’ll have to check with Amye. That’s BMC Amye!

C: Yes, I met her at the @OpenDrinks last week. That’s about the same number as PLoSONE. 150 articles is 15 minutes.

G: And you can do all BMC at the same time. Because you can alternate requests every 3 seconds.

C: That’s clever. Wow – there’s even a journal for data mining: http://www.biodatamining.org/. How big is an article?

G: depends what you want. http://www.biodatamining.org/content/pdf/1756-0381-6-21.pdf is about 6 MBytes. But there’s also HTML and XML.

C: Are they all the same?

G: not quite. The XML and HTML are quite similar. #ami2 can read them easily. They don’t have any pictures.

C: But I like pictures.

G: The pictures are there, but separate. You have to follow links.

C: #ami2 can do that. We have to organize it… But the PDF may contain vector graphics. PMR loves vector graphics because he can teach #ami2 to build real science from the pictures. He’s not so keen on PNGs and GIFs and TIFFs and JPEGs. But #ami2 isn’t perfect at reading PDFs. No one is perfect at reading PDFs except sighted humans. It’s hard to teach animals like #ami2. But she’s improving.

G: OK – sounds like you want the PDF AND the XML AND the HTML.

C: Sounds like it. So we have to let #ami2 know where they are. Trouble is every publisher does it differently. PMR’s just written AbstractCrawler.java .

G: So will that read BMC?

C: No. It won’t read anything. It only downloads things.

G: So will it download my articles?

C: Not yet. We have to write a special crawler for each publisher. PMR’s written PlosoneRecentCrawler. which extends
AbstractCrawler. He’s tried to make it easy to include a new publisher.

G: Please speak nicely to him and ask him to write a Gulliver Crawler.

C: I will try. He’s a bit tired. Humans need to rest.

G: How boring and inconvenient!

C: He sometimes writes code during the cricket. But he’s upset about the cricket…

G: I suppose #ami2 is pleased about the cricket?

C: No – remember she has the emotional apparatus of a FORTRAN compiler. PMR will point her at the BMC content page. She will start at the first article and then go on to the next. Let’s start at:

http://www.biomedcentral.com/content/?page=1&itemsPerPage=100

I *think* that means that we start at page 1. If we go to the next page we find:

http://www.biomedcentral.com/content/?page=2&itemsPerPage=100

G: That looks promising. But rember that these pages are updated. Their content could change as we add new papers>

C: Oh my stripes and paws! Let’s ask Amye.. She wrote to PMR: 

We have both a search API, an OAI-PMH API and an FTP location with a zip file of all our XML. The OAI-PMH API allows for retrieval of article metadata and fulltext XML for all articles or specific subsets. It also allows flexible date stamp restriction. So that should meet your needs. We also have a feed API (giving the latest articles, picks and most viewed) and a case report API for our Cases Database.

G: Well done. I think it will take another mail or skype to clarify…

C: Goodbye for now, and goodbye from #ami2