Content Mining starts today!

There is now an unstoppable interest and desire for content-mining. People want to know how, when where – what the problems are … all sorts of things. So Jenny Molloy and Katelyn Rogers (OKFN) have set up a mailing list. join in the normal way:

Here’s my second post:

Many thanks to Jenny Molloy and Katelyn Rogers for setting up this list.


Last night we had a get-together in London catalysed by PLoS with

representation from OKFN, BioMedCentral, CrossRef, eLife, … all the usual

suspects … and there was lots of discussion about content mining and I

encouraged people to post their ideas to this list.


Here are some potential topics:


* what’s a responsible way to run a crawler over content?

* what are current practises obtaining content

* what are the legal and contractual aspects of CM?

* what types of content can be mined? What are the technical, social,

contractual bases?

* what software exists?

* how do I do Natural language processing

* what can I get from images?

* where can we put the mined content?

* where can we find dictionaries for annotating content?

* where’s the next meeting on content-mining?




We are also developing the technology very rapidly. We have two trial datasets in the CKAN where we’ve extracted species and we’ll be discussing these over the next 2-3 days. The intention is to extract facts from about 150 PLoSONE articles every day and put them in the Datahub. We’re talking with Amye Kennall from BioMedCentral about the best way to crawl all of BMC daily and we’ll be revisiting BMC after Ross’s viva. We’ve asked Geoff Builder from CrossRef to post some exciting ideas which we discussed last night …


(Must rush to the OKF/BL hack/love-in today…)