Documents and Data…

Last month I was on Dr. Kiki’s Science Hour. Besides being a lot of fun (despite my technical problems, which were part of my recent move to GNU/Linux and away from Mac!), I also discovered that at least one person I went to high school with is a fan of Dr. Kiki, because he told everyone about the show at my recent high school reunion. Good stuff.

In the show, I did my usual rant about the web being built for documents, not for data. And that got me a great question by email. I wrote a long answer that I decided was a better blog post than anything else. Here goes.

Although I’m familiar with the Creative Commons & Science Commons, the interview really help me understand the bigger picture of the work you do. Among many other significant and timely anecdotes, I received the message that the internet is built around document search and not data search. This comment intrigued me immensely. I want to explore that a little more to understand exactly what you meant. Most importantly, I want to understand what you believe the key differences between the documents and the data are. From one perspective, the documents contain the data, from another, the data forms the documents.

True, in some cases. But in the case of complex adaptive systems – like the body, the climate, or our national energy usage – the data are frequently not part of a document. They exist in massive databases which are loosely coupled, and are accessed by humans not through search engines but through large-scale computational models. There are so many layers of abstraction between user and data that it’s often hard to know where the actual data at the base of a model reside.

This is at odds with the fundamental nature of the Web. The Web is a web of documents. Those documents are all formatted the same way, using a standard markup language, and the same protocol to send copies of those documents around. Because the language allows for “links” between documents, we can navigate the Web of documents by linking and clicking.

There’s more fundamental stuff to think about. Because the right to link is granted to creators of web pages, we get lots of links. And because we get lots of links (and there aren’t fundamental restrictions on copying the web pages) we get innovative companies like Google that index the links and rank web pages, higher or lower, based on the number of links referring to those pages. Google doesn’t know, in any semantic sense, what the pages are about, what they mean. It simply has the power to do clustering and ranking at a scale never before achieved, and that turns out to be good enough.

But in the data world, very little of this applies. The data exist in a world almost without links. There is no accepted standard language, though some are emerging, to mark up data. And if you had that, then all you get is another problem – the problem of semantics and meaning. So far at least, the statistics aren’t good enough to help us really structure data the way they structure documents.

From what you posited and the examples you gave, I envision a search engine which has the capacity to form documents out of data using search terms, e.g. enter two variables and get a graph as a result instead of page results. Not too far from what ‘Wolfram Alpha’ is working on, but indexing all the data rather than pre-tabulated information from a single server/provider. Perhaps I’m close but I want to make sure we’re on the same sheet of music.

I’m actually hoping for some far more basic stuff. I am less worried about graphing and documents. If you’re at that level, you’ve a) already found the data you need and b) know what questions you want to ask about it.

This is the world in which one group of open data advocates live. It’s the world of apps that help you catch the bus in Boston. It’s one that doesn’t worry much about data integration, or data interoperability, because it’s simple data – where is the bus and how fast is it going? – and because it’s mapped against a grid we understand, which is…well, a map.

But the world I live in isn’t so simple. Doing deeply complex modeling of climate events, of energy usage, of cancer progression – these are not so easy to turn into iPhone apps. The way we treat them shouldn’t be with the output of a document. It’s the wrong metaphor. We don’t need a “map” of cancer – we need a model that tells us, given certain inputs, what our decision matrix looks like.

I didn’t really get this myself until we started playing around with massive-scale data integration at Creative Commons. But since then, in addition to what we do here, I’ve been to the NCBI, I’ve been to Oak Ridge National Lab, I’ve been to CERN…and the data systems they maintain are monstrous. They’re not going to be copied and maintained elsewhere, at least, not without lots of funding. They’re not “webby” like mapping projects are. There’s not a lot of hackers who can use them, nor is there a vast toolset to use.

So I guess I’m less interested in search engines for data than I am in making sure that people who are building the models can use crawlers to find the data they want, and that they can be legally allowed to harvest that data and integrate it. Doing so is not going to be easy. But if we don’t design for that world, for model-driven access, then harvest and integration will quickly approach NP levels of complexity. We cannot assume that the tools and systems that let us catch the bus will let us cure cancer. They may, someday, evolve into a common system, and I hope they do – but for now, the iphone approach is using a slingshot against an armored division.