Content-mining: how can I help?

I got a request today offering help for CM. Great! CM isn’t a single activity –ideally it’s a community of collaborating people and organizations combining resources. The first thing you can do is join, and post to, https://lists.okfn.org/mailman/listinfo/open-contentmining. Here’s what I replied:

I’m delighted to have had an enquiry of help for content-mining. The good news is:

*Everyone has a role to play in content-mining*

Here are some important areas – please submit others. There are lots of micro-tasks that everyone can become involved in.

==project==

* identifying a need

* coordinating a community effort

* summarising current practice (e.g. rights, barriers, resources)

* creating resources (e.g.corpora)

* running a project

==crawling==

* identifying sites to mine

* collecting bibliographic metadata (e.g. tables of content)

* agreeing web-friendly protocols (e.g. delay times)

* writing or finding crawlers

* creating or deploying crawl scripts

* managing workflow manually or or automatically

* recording crawl log

* saving crawled materials

==document==

* formalising structure of document (e.g. sections)

* creating or finding vocabularies for annotation

==generic tools==

* crawlers

* PDF readers

* flat text readers

* graphics analyzers

* image analyzers

==databases==

* customization

==natural language==

 

* collection of NLP tools

* vocabularies

* corpora for training

* training

* testing

* domain tools

== graphics==

* reconstruction of diagrams from primitives

* SVG tools

==images==

* selection

* croppings

* binarisation

* edge detection/segemnts

* optical character recognition

==text==

* fonts

==tables==

* reconstruction

* interpretation

==audio==

==video==

==semantics==

* annotation

* links

==domain==

* maths

* chemistry

* geo

* dates

* units of measurement

==argumentation==

* document structure

* sentiment analysis

==documentation==

==sociopoliticololegal==

==community==

* mailing lists

* crowdcrafting