Text and Data Mining: Overview

Text and Data Mining: Overview

Tomorrow The University of Cambridge Office of Scholarly Communication is running a 1-day Symposium on Text and Data Mining (https://docs.google.com/document/d/1l4N2fSFgpL3iMbjKC3IxHz7GpNVvERB5NzxqWp8jZQo/edit ). I have been asked to present http://ContentMine.org   , a project funded by the Shuttleworth Foundation through a personal Fellowship, evolved into a not-for-profit company.

I hope to write several blog posts before tomorrow , and maybe some afterwards. I have been involved in mining science from the semi-structured literature for about 40 years and shall give a scientific slant. As I have got 20-25 minutes I am recording thoughts here so people can have time to explore the more complex aspects.

Machines are now part of our information future and present, but many sectors, including academia, have not embraced this. Whereas supermarkets, insurance, social media are all modernised, scholarly communication still works with “papers”. These papers contain literally billions of dollars of unrealised value but very few people care about this. As a result we are not getting the full value of technical and medical funding, much of which is wasted through the archaic physical format and outdated attitudes.

These blog posts will cover the following questions – how many depends on how the story develops. They include:

  • What mining could be used for and why it could revolutionise science and scholarship
  • Why TDM in the UK and Europe (and probably globally) has been a total political and organizational failure.
  • What directions are we going in? (TL;DR you won’t enjoy them unless you are a monopolistic C21st exploiter, in which case you’ll rejoice.)
  • What I personally am doing to fight the cloud of digital enclosure.

There are 3 arms to ContentMine activities:

  • Advocacy/political. Trying to change the way we work top-down, through legal reforms, funding, etc. (TL;DR it’s not looking bright)
  • Tools. ContentMining needs a new generation of Open tools and we are developing these. The vision is to create intelligent scientific information rather than e-paper (PDF). Much of this is recently enhanced by the development of https://en.wikipedia.org/wiki/Wikidata
  • Community. The great hope is the creative activity of young people (and people young at heart). Young people are sick of the tired publisher-academic complex which epitomises everything old, with meretricious values.

This sounds very idealistic – and perhaps it is. But the Academic-Publisher complex is all-pervasive – it kills young people’s hopes and actions. Our values are managed by algorithms that Elsevinger sells to Vice-chancellors to manage “their” research. The AP complex has destroyed the potential of TDM in the UK and elsewhere and so we must look to alternative approaches.

For me there is a personal sadness. 15 years ago I could mine the literature and no-one cared. I had visions of building the Open shared scientific information of the future. I called it https://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix – after Gibson’s vision of the matric in cyberspace. It draws on the vision of TimBL and the semantic web, and the idea of global free information. It was technically ahead of its time by perhaps 15 years, but now – with Wikidata, and modern version control (Git) – we can actually build this.

So my vision is to mine the whole of the scientific literature and create a free scientific resource for the whole world.

It’s technically possible and we have developed the means to do it. And we’ve started. And we will show you how, and how you can help.

But we can only do it on a small part of the literature because the Academic-Publisher complex has forbidden it on the rest.