Dorothea has written a typically good post challenging the role of RDF in the linked data web, and in particular, its necessity as a common data format.

I was struck by how many of her analyses were spot on, though my conclusions are different from hers. But she nails it when she says:

First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone?

I’m on about this all the time. The idea that we are in web-1995-land for data astounds me. I’d be happy if I were to be proven wrong – trust me, thrilled – but I don’t see the core base of infrastructure for a data web to explode. I see an exploding capability to generate data, and of computational capacity to process data. I don’t see the technical standards in place that enable the concurrent explosion of distributed, decentralized data networks and distributed innovation on data by users.

The Web sits on a massive stack of technical standards that pre-dated it, but that were perfectly suited to a massive pile of hypertext. The way that the domain name system gave human-readable domains to dotted quads lent itself easily to nested trees of documents linked to each other, and didn’t need any more machine readable context than some instructions to the computer about how to display the text. It’s vital to remember both that we as humans are socially wired in to use documents in a way that was deeply enabling to the explosion of the Web, because all we had to do was standardize what those documents looked like and where they were located.

On top of that, at exactly the moment in time that the information on the web started to scale, a key piece of software emerged – the web browser – that made the web, and in many ways the computer itself, easier to use. The graphic web browser wasn’t an obvious invention. We don’t have anything like it for data.

My instinct is that it’s going to be at least ten years’ worth of technical development, especially around drudgery like provenance, naming, versioning of data, but also including things like storage and federated query processing, before the data web is ready to explode. I just don’t see those problems being quick problems, because they aren’t actually technical problems. They’re social problems that have to be addressed in technology. And them’s the worst.

We simply aren’t yet wired socially for massive data. We’ve had documents for hundreds of years. We have only had truly monstrous-scale data for a couple of decades.

Take climate. Climate science data used to be traded on 9-track tapes – as recently as the 1980s. Each 9-track tape maxes out at 140MB. For comparison’s sake, I am shopping for a 2TB backup drive at home. 2TB in 9-tracks is a stack of tapes taller than the Washington Monument. We made that jump in less than 30 years, which is less than a full career-generation for a working scientist. The move to petabyte scale computing is having to be wedged into a system of scientific training, reward, incentives, and daily practice for which it is not well suited. No standard fixes that.

Documents were easy. We have a hundreds-of-years old system of citing others’ work that makes it easy, or easier, to give credit and reward achievement. We have a culture for how to name the documents, and an industry based on making them “trusted” and organized by discipline. You can and should argue about whether or not these systems need to change on the web, but I don’t think you can argue that the document culture is a lot more robust than the data culture.

I think we need to mandate data literacy the way we mandate language literacy, but I’m not holding my breath that it’s going to happen. Til then, the web will get better and better for scientists, the way the internet makes logistics easier for Wal-Mart. We’ll get simple mashups, especially of data that can be connected to a map. But the really complicated stuff, like oceanic carbon, that stuff won’t be usable for a long time by anyone not trained in the black arts of data curation, interpretation, and model building.

Dorothea raises another point I want to address:

“not all data are assertions” seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.

This “express Hamlet in RDF” argument is a Macguffin, in my opinion – it will be forgotten by the third act of the data web. But damn if it’s not a popular argument to make. Clay Shirky did it best.

But it’s irrelevant. We don’t need to express Hamlet in RDF to make expressing data in RDF useful. It’s like getting mad at a car because it’s not an apple. There is absolute boatloads of data out there that absolutely needs to be expressed in a common format. Doing climate science or biology means hundreds of databases, filling at rates unimaginable even a few years ago. I’m talking terabytes a day, soon to be petabytes a day. That’s what RDF is for.

It’s not for great literature. I’ll keep to the document format for The Bard, and so will everyone. But he does have something to remind us about the only route to the data web:

Tomorrow and tomorrow and tomorrow,
Creeps in this petty pace from day to day

It’s going to be a long race, but it will be won by patience and day by day advances. It must be won that way, because otherwise we won’t get the scale we need. Mangy approaches that work for Google Maps mashups won’t cut it. RDF might not be able to capture love, or literature, and it may be a total pain in the butt, but it does really well on problems like “how do i make these 49 data sources mix together so I can run a prediction of when we should start building desalination plants along the Pacific Northwest seacoast due to lower snowfall in the Cascade Mountains”.

That’s the kind of problem that has to be modelable, and it has to run against every piece of data possible. It’s an important question to understand as completely as can be. The lack of convenience imposed by RDF is a small price to pay for the data interoperability it brings in this context, to this class of problem.

As more and more infrastructure does emerge to solve this class of problem, we’ll get the benefits of rapid incremental advances on making that infrastructure usable to the Google Maps hacker. We’ll get whatever key piece, or pieces, of data software that we need to make massive scale data more useful. We’ll solve some of those social problems with some technology. We’ll get a stack that embeds a lot of that stuff down into something the average user never has to see.

RDF will be one of the key standards in the data stack, one piece of the puzzle. It’s basically a technology that duct-tapes databases to one another and allows for federated queries to be run. Obviously there needs to be more in the stack. SPARQL is another key piece. We need to get the names right. But we’ll get there, tomorrow and tomorrow and tomorrow…

Read the comments on this post…