World Opera, Collaborative Science, and Getting On The One

(blows off the dust since the last entry)

(Life trumped blogging; my first child was born in March)

Just before I went into the parent tunnel, which is awesome by the by, I attended a seminar conducted by Niels Windfeld Lund, General Manager of the World Opera.

Not my usual event. But music’s always been a passion for me, and I performed a lot as a kid – lots of trumpet, both the sort of american wind orchestra stuff (seated and marching…yes, a band geek) and some jazz, a little bit of drums. These days I plink around on an acoustic bass, badly, but well enough that I’ll be able to sing lullabyes to my newborn. Starting to play again made me realize how much of music is a conversation, just like science is a conversation (well, an argument). And so this world opera thing seemed like an interesting way to come at the problem from a field that is semantically a world away from science, but in design space is remarkably similar.

This was the second of two recent musical events for me that bear on open collaborative science. The first one we can draw a lesson from is the World Choir. It’s a collaborative choir of more than 2000 people and got tons of press about how it was a collaboration breakthrough, and was designed as an asynchronous request for videos, with a ton of post-processing to stitch them into a single video.

Then there’s the World Opera, which is all about actually performing an opera live with the performers in multiple cities around the world (it needs better marketing help – the first sentence in the website is “You may not have heard about World Opera”). There’s tons of dimensionality baked into that idea from the start. Niels got it funded in his northern Norwegian home of Tromsø after first pitching telemedicine, which has the same fundamental requirements as online opera: big fiber, low latency, great audio/video capability, and the ability to do meaningful real time interaction with remote sites. Surgery or string rehearsal, you have to be able to replicate an intense real life experience. The government apparently preferred opera, which is both wonderful and improbable to me.

They wrestled, or are wrestling, with technical and existential questions. How to use the inevitable delay between performers, turning it into something like the acoustics of a conference hall. Whether or not to use a live conductor as a distinct part of the performance, or to use a metronome, or something in between. Basic stuff, like how to practice together but apart. They are able to do it, but it takes much more work than it would in a regular opera.

It’s hard to get a group on the one. It’s hard when you’re all in the same place. That’s why good labs or departments (or startups) have regular journal clubs, regular lunch sessions, coffee machines that require a fair amount of time to prepare a drink. It helps create that extra time where the individuals involved fall into a rhythm together. Eric Schadt has called it the “clock gene” of a good lab. And it’s been hard to virtually create in the sciences.

My gut is that we have the two musical performances mixed up. A lot of what we mean by open science is the choir:we’ll do crowdsourced data collection, we’ll see a surge of data from impassioned observers into online groups like Sage Bionetworks, but that data will have to be painstakingly synced and organized before we get a beautiful model.
Real collaborative science is going to be hard, like the opera, because it’ll be hard to get on the one. Big questions, both technical and epistemic, have to get answered.

Collaborative opera is totally disruptive to regular opera. It will be resisted, its flaws will be evident with no post-processing to make it shiny. It’s not just sound, it’s a story, it’s acting, it’s interaction between the performers themselves and the audience. It’s going to suck compared to a purist’s opera – at first.

But as the group learns, and they will, it’ll suck a lot less. Then it’ll be really, really good. Incremental innovation will smooth a ton of edges. The performers will figure out if they want an avatar or a conductor. They’ll get used to the latency, and “hear” it when they play together. There’ll be a disrupting tech, probably made by a frustrated musician, that makes some vital but boring process suddenly either a) easier or b) stable or both. Online opera will become a vital part of opera.

These problems, this inherent resistance (in the electrical sense, not the political or incentive sense) is the sort of thing we have to get used to in open science. We can run a bunch of virtual choirs – that’s what 23andme is doing, and I’m a customer. But our infrastructure, and our design thinking, and most of all our expectation, has to support opera, because it is, like science, hard.

Read the comments on this post…

Documents and Data…

Last month I was on Dr. Kiki’s Science Hour. Besides being a lot of fun (despite my technical problems, which were part of my recent move to GNU/Linux and away from Mac!), I also discovered that at least one person I went to high school with is a fan of Dr. Kiki, because he told everyone about the show at my recent high school reunion. Good stuff.

In the show, I did my usual rant about the web being built for documents, not for data. And that got me a great question by email. I wrote a long answer that I decided was a better blog post than anything else. Here goes.

Although I’m familiar with the Creative Commons & Science Commons, the interview really help me understand the bigger picture of the work you do. Among many other significant and timely anecdotes, I received the message that the internet is built around document search and not data search. This comment intrigued me immensely. I want to explore that a little more to understand exactly what you meant. Most importantly, I want to understand what you believe the key differences between the documents and the data are. From one perspective, the documents contain the data, from another, the data forms the documents.

True, in some cases. But in the case of complex adaptive systems – like the body, the climate, or our national energy usage – the data are frequently not part of a document. They exist in massive databases which are loosely coupled, and are accessed by humans not through search engines but through large-scale computational models. There are so many layers of abstraction between user and data that it’s often hard to know where the actual data at the base of a model reside.

This is at odds with the fundamental nature of the Web. The Web is a web of documents. Those documents are all formatted the same way, using a standard markup language, and the same protocol to send copies of those documents around. Because the language allows for “links” between documents, we can navigate the Web of documents by linking and clicking.

There’s more fundamental stuff to think about. Because the right to link is granted to creators of web pages, we get lots of links. And because we get lots of links (and there aren’t fundamental restrictions on copying the web pages) we get innovative companies like Google that index the links and rank web pages, higher or lower, based on the number of links referring to those pages. Google doesn’t know, in any semantic sense, what the pages are about, what they mean. It simply has the power to do clustering and ranking at a scale never before achieved, and that turns out to be good enough.

But in the data world, very little of this applies. The data exist in a world almost without links. There is no accepted standard language, though some are emerging, to mark up data. And if you had that, then all you get is another problem – the problem of semantics and meaning. So far at least, the statistics aren’t good enough to help us really structure data the way they structure documents.

From what you posited and the examples you gave, I envision a search engine which has the capacity to form documents out of data using search terms, e.g. enter two variables and get a graph as a result instead of page results. Not too far from what ‘Wolfram Alpha’ is working on, but indexing all the data rather than pre-tabulated information from a single server/provider. Perhaps I’m close but I want to make sure we’re on the same sheet of music.

I’m actually hoping for some far more basic stuff. I am less worried about graphing and documents. If you’re at that level, you’ve a) already found the data you need and b) know what questions you want to ask about it.

This is the world in which one group of open data advocates live. It’s the world of apps that help you catch the bus in Boston. It’s one that doesn’t worry much about data integration, or data interoperability, because it’s simple data – where is the bus and how fast is it going? – and because it’s mapped against a grid we understand, which is…well, a map.

But the world I live in isn’t so simple. Doing deeply complex modeling of climate events, of energy usage, of cancer progression – these are not so easy to turn into iPhone apps. The way we treat them shouldn’t be with the output of a document. It’s the wrong metaphor. We don’t need a “map” of cancer – we need a model that tells us, given certain inputs, what our decision matrix looks like.

I didn’t really get this myself until we started playing around with massive-scale data integration at Creative Commons. But since then, in addition to what we do here, I’ve been to the NCBI, I’ve been to Oak Ridge National Lab, I’ve been to CERN…and the data systems they maintain are monstrous. They’re not going to be copied and maintained elsewhere, at least, not without lots of funding. They’re not “webby” like mapping projects are. There’s not a lot of hackers who can use them, nor is there a vast toolset to use.

So I guess I’m less interested in search engines for data than I am in making sure that people who are building the models can use crawlers to find the data they want, and that they can be legally allowed to harvest that data and integrate it. Doing so is not going to be easy. But if we don’t design for that world, for model-driven access, then harvest and integration will quickly approach NP levels of complexity. We cannot assume that the tools and systems that let us catch the bus will let us cure cancer. They may, someday, evolve into a common system, and I hope they do – but for now, the iphone approach is using a slingshot against an armored division.

Read the comments on this post…

Marking and Tagging the Public Domain

I am cribbing significant amounts of this post from a Creative Commons blogpost about tagging the public domain. Attribution is to Diane Peters for the stuff I’ve incorporated 🙂

The big news is that, 18 months since we launched CC0 1.0, our public domain waiver that allows rights holders to place a work as nearly as possible into the public domain, worldwide…it’s been a success. CC0 has proven a valuable tool for governments, scientists, data providers, providers of bibliographic data, and many others throughout world. CC0 has been used by the pharmaceutical industry giant GSK as well as by the emerging open data leader Sage Bionetworks (disclosure – I’m on the Board of Sage – though not of GSK!).

At the time we published CC0, we made note of a second public domain tool under development — a tool that would make it easy for people to tag and find content already in the public domain. That tool, our new “Public Domain Mark” is now published for comment.

The PDM allows works already in the public domain to be marked and tagged in a way that clearly communicates the work’s PD status, and allows it to be easily discoverable. The PDM is not a legal instrument like CC0 or our licenses — it can only be used to label a work with information about its public domain copyright status, not change a work’s current status under copyright. However, just like CC0 and our licenses, PDM has a metadata-supported deed and is machine readable, allowing works tagged with PDM to be findable on the Internet. (Please note that the example used on the sample deed is purely hypothetical at the moment.)

We are also releasing for public comment general purpose norms — voluntary guidelines or “pleases” that providers and curators of PD materials may request be followed when a PD work they have marked is thereafter used by others. Our PDM deed as well as an upcoming enhanced CC0 deed will support norms in addition to citation metadata, which will allow a user to easily cite the author or provider of the work through copy-paste HTML.

This is absolutely critical to science, because it addresses at last the biggest reason that people mis-use copyright licenses on uncopyrightable materials and data sets: the confusion of the legal right of attribution in copyright with the academic and professional norm of citation of one’s efforts. Making it easy to cite, regardless of the law, is one of the keys to making the public domain something that we can construct through individual private choice at scale, not just by getting governments to adopt.

The public comment period will close on Wednesday, August 18th. Why so short? For starters, PDM is not a legal tool in the same sense our licenses and CC0 are legally operative — no legal rights are being surrendered or affected, and there is no accompanying legal code to finesse. Just as importantly, however, we believe that having the mark used soon rather than later will allow early adopters to provide us with invaluable feedback on actual implementations, which will allow us to improve the marking tool in the future.

The primary venue for submitting comments and discussing the tool is the cc-licenses mailing list. We look forward to hearing from you!

There are a lot of fascinating projects around how to do the non-legal work of data. The Sage Commons has seen a bunch of them come together, but in this context I want to call out – the SageCite project driven by UKOLN, the University of Manchester, and the British Library – which is going to develop and test an entire framework for citation, not attribution, using bioinformatics as a test case.

My own hope is that by making citation inside Creative Commons legal tools that work on the public domain a cut-and-paste process, we can facilitate the emergence of frameworks like SageCite so that the legal aspects fade away on the data sets and databases themselves, and the focus can be on the more complex network models of complex adaptive systems. And I’m tremendously excited to see members of the community leveraging the Sage project to do independent, crucial work on the topic of citation. Like Wikipedia, Sage won’t work unless it is something that we all own together and work on for our own reasons.

This is only still the beginning of really open data – public domain data – that complies with the Panton Principles. Creative Commons has spent six long years studying the open data issue, and rolling out policy and tools and technologies that make it possible for end users from the Dutch government to the Polar Information Commons to create their own open data systems.

We still have to avoid the siren song of property rights on data, and of license proliferation. But it’s starting to feel like momentum is gaining on public domain data, and for the Creative Commons tools that make it a reality. Making citation one-click, and making it easy to tag and mark the public domain, is part of that momentum. Please help us by commenting on the tools, and by promoting their use when you run across any open data project where the terms are unclear.

Read the comments on this post…

rdf:about="Shakespeare"

Dorothea has written a typically good post challenging the role of RDF in the linked data web, and in particular, its necessity as a common data format.

I was struck by how many of her analyses were spot on, though my conclusions are different from hers. But she nails it when she says:

First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone?

I’m on about this all the time. The idea that we are in web-1995-land for data astounds me. I’d be happy if I were to be proven wrong – trust me, thrilled – but I don’t see the core base of infrastructure for a data web to explode. I see an exploding capability to generate data, and of computational capacity to process data. I don’t see the technical standards in place that enable the concurrent explosion of distributed, decentralized data networks and distributed innovation on data by users.

The Web sits on a massive stack of technical standards that pre-dated it, but that were perfectly suited to a massive pile of hypertext. The way that the domain name system gave human-readable domains to dotted quads lent itself easily to nested trees of documents linked to each other, and didn’t need any more machine readable context than some instructions to the computer about how to display the text. It’s vital to remember both that we as humans are socially wired in to use documents in a way that was deeply enabling to the explosion of the Web, because all we had to do was standardize what those documents looked like and where they were located.

On top of that, at exactly the moment in time that the information on the web started to scale, a key piece of software emerged – the web browser – that made the web, and in many ways the computer itself, easier to use. The graphic web browser wasn’t an obvious invention. We don’t have anything like it for data.

My instinct is that it’s going to be at least ten years’ worth of technical development, especially around drudgery like provenance, naming, versioning of data, but also including things like storage and federated query processing, before the data web is ready to explode. I just don’t see those problems being quick problems, because they aren’t actually technical problems. They’re social problems that have to be addressed in technology. And them’s the worst.

We simply aren’t yet wired socially for massive data. We’ve had documents for hundreds of years. We have only had truly monstrous-scale data for a couple of decades.

Take climate. Climate science data used to be traded on 9-track tapes – as recently as the 1980s. Each 9-track tape maxes out at 140MB. For comparison’s sake, I am shopping for a 2TB backup drive at home. 2TB in 9-tracks is a stack of tapes taller than the Washington Monument. We made that jump in less than 30 years, which is less than a full career-generation for a working scientist. The move to petabyte scale computing is having to be wedged into a system of scientific training, reward, incentives, and daily practice for which it is not well suited. No standard fixes that.

Documents were easy. We have a hundreds-of-years old system of citing others’ work that makes it easy, or easier, to give credit and reward achievement. We have a culture for how to name the documents, and an industry based on making them “trusted” and organized by discipline. You can and should argue about whether or not these systems need to change on the web, but I don’t think you can argue that the document culture is a lot more robust than the data culture.

I think we need to mandate data literacy the way we mandate language literacy, but I’m not holding my breath that it’s going to happen. Til then, the web will get better and better for scientists, the way the internet makes logistics easier for Wal-Mart. We’ll get simple mashups, especially of data that can be connected to a map. But the really complicated stuff, like oceanic carbon, that stuff won’t be usable for a long time by anyone not trained in the black arts of data curation, interpretation, and model building.

Dorothea raises another point I want to address:

“not all data are assertions” seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.

This “express Hamlet in RDF” argument is a Macguffin, in my opinion – it will be forgotten by the third act of the data web. But damn if it’s not a popular argument to make. Clay Shirky did it best.

But it’s irrelevant. We don’t need to express Hamlet in RDF to make expressing data in RDF useful. It’s like getting mad at a car because it’s not an apple. There is absolute boatloads of data out there that absolutely needs to be expressed in a common format. Doing climate science or biology means hundreds of databases, filling at rates unimaginable even a few years ago. I’m talking terabytes a day, soon to be petabytes a day. That’s what RDF is for.

It’s not for great literature. I’ll keep to the document format for The Bard, and so will everyone. But he does have something to remind us about the only route to the data web:

Tomorrow and tomorrow and tomorrow,
Creeps in this petty pace from day to day

It’s going to be a long race, but it will be won by patience and day by day advances. It must be won that way, because otherwise we won’t get the scale we need. Mangy approaches that work for Google Maps mashups won’t cut it. RDF might not be able to capture love, or literature, and it may be a total pain in the butt, but it does really well on problems like “how do i make these 49 data sources mix together so I can run a prediction of when we should start building desalination plants along the Pacific Northwest seacoast due to lower snowfall in the Cascade Mountains”.

That’s the kind of problem that has to be modelable, and it has to run against every piece of data possible. It’s an important question to understand as completely as can be. The lack of convenience imposed by RDF is a small price to pay for the data interoperability it brings in this context, to this class of problem.

As more and more infrastructure does emerge to solve this class of problem, we’ll get the benefits of rapid incremental advances on making that infrastructure usable to the Google Maps hacker. We’ll get whatever key piece, or pieces, of data software that we need to make massive scale data more useful. We’ll solve some of those social problems with some technology. We’ll get a stack that embeds a lot of that stuff down into something the average user never has to see.

RDF will be one of the key standards in the data stack, one piece of the puzzle. It’s basically a technology that duct-tapes databases to one another and allows for federated queries to be run. Obviously there needs to be more in the stack. SPARQL is another key piece. We need to get the names right. But we’ll get there, tomorrow and tomorrow and tomorrow…

Read the comments on this post…

Of Pepsi and ScienceBlogs…

I’ve gotten a few emails about the Pepsi-ScienceBlogs tempest. It’s clearly taken a toll on ScienceBlogs’ credibility. Some of my SciBlings have resigned in protest, and others are taking shots on the topic.

Sponsorship is part of scientific publishing, even in the peer reviewed world. Remember how Merck published an entire fake journal to promote Vioxx? How much money gets spent on reprints that support a company’s position, on articles paid for with corporate research funds?

Today’s hullaballoo is more honest than either of those. My gut reaction is: calm down, world. This was a miserable rollout in which a lack of transparency and community engagement turned a little fire into a conflagration.

I’m not going to resign my blog here, at least, not now. I am not a sponsored blog. I receive salary from my (non profit) employer, Creative Commons, and I also take in a little consulting revenue and the odd speaking fee, all in the service of promoting the digital commons. I am checking with all my arrangements to see if I’m allowed to fully disclose, and if I can, I’ll publish my list here.

I also know Adam Bly pretty well on a personal level. He’s a good guy. In full disclosure, he’s been a supporter of Creative Commons personally, and Seed supports the organization professionally as well. But I don’t think that colors my opinion today. He’s not out to sell bad science; he’s out to transform scientific publishing on the internet. I have no doubt that this decision was arrived at after lengthy debate and internal argument.

But I don’t know anything more than what I’ve read on the internet. I filter my SB mail for reading once a week or so, as I get too much email every day as there is. So I found out about this when my twitter feed exploded today.

I am sanguine about the realities of running a site like ScienceBlogs. It’s not free. I’ve run a company. I know what it’s like to hire people in fat times and to lay them off in lean times, how hard it is to tell investors that the revenues are drying up. It’s a perspective that is hard earned. It’s a reality that forces decisions in support of shareholders, not just in support of bloggers and readers. And in a massive recession that becomes even more true.

But that perspective means that the choice is understandable, not that the situation was handled well. If a site like SB is going to do this, then the entire process must be painfully transparent. I’ve watched as sites I love, like Fark.com and some of the various Gawker blogs, began to accept sponsored links – but they are LABELED as such. These SB blogs need to be plastered with the fact that they are indeed bought space, bought by companies, not by individuals thinking freely (like the rest of us). Different graphic design, disclaimer text in the templates, that sort of thing. I would personally love to see an piece of RDFa that my browser can auto-ignore, just as I block pop-up ads.

This screenshot makes it pretty clear it’s sponsored, and that it’s “advertorial” content. It’s a good start, though too late to stop the frenzy that is an internet blamestorm.

It’s not something that can be done post-hoc, is the problem. The distinguishing between content and advertisement needs to be done in a fashion that is transparent to the community at large, because although the decision to accept sponsored blogs may help shareholders, it affects the people who the sponsored blogs want to associate with (us free thinking bloggers) and those they want to read the sponsored blogs (that’s you, people). And we’re the community that got smeared by the rollout.

That’s the anger, that’s what is driving the reaction here. We weren’t consulted (the royal we) in advance. And even if we hadn’t said anything smart or interesting, getting the chance to chime in on this type of thing would have released the tension in a way that created more trust, not less trust. I’m going to argue for more transparency from SB as to their finances and decisionmaking, but I’m not going to leave because of one false step.

Because I’ve made some myself, and I believe in treating others as I would like to be treated.

Obviously, I reserve the right to change my mind as new data rolls in. If the site continues to display the sort of tin ear it has displayed in this one, then I’ll have to refactor. But for today, I’m sticking around, and urging calm thinking and open minds.

Read the comments on this post…

Kaitlin Thaney moves on…

I tend to want to make posts on Creative Commons related topics at the CC blog, but this is essentially a personal post, and I also want to have it as widely read in our community as possible.

Today is Kaitlin Thaney‘s last day at CC. She’s been working for us on the Science Commons project for a long time – starting part time in mid 2006, full time in early 2007 – and she’s been an absolutely essential part of our success over the years.

I first met Kaitlin because she was interning, while finishing at Northeastern, for a joint MIT-Microsoft project called iCampus. She started showing up at science talks and asking good questions, and I poached her so we could have her help us with our first Science Commons international data sharing conference, held at the US National Academies. Here are her hands organizing nametags that day.

From the get-go, she’s been an incredible employee. She has taken on every task without question, and shown a remarkable level of skill and savvy and capacity, moving from the boring (nagging me on projects) to the remarkable (working with the Polar Information Commons to get their data towards the public domain) to the ridiculous (finding ironically appropriate plush kidney toys to give to our counsel). She’s also become a dear friend, all the way to flying to Brazil to be a part of my wedding last year.

Kaitlin leaves us for a remarkable opportunity in online science that I am not going to describe in detail here. She’ll be speaking for herself on this topic later in July. All I know is that we, as a group, will miss her, and that I as an individual will miss her too. It’s a good move she’s making, and I wish her all the best. You should follow her on twitter and subscribe to her blog.

Thank you, KT. You’ve been a linchpin.

Read the comments on this post…

Brains Open Access Initiative

sbzombies_common-knowledge.png

An old tradition and a new technology have converged to make possible an unprecedented good. The old tradition is the willingness of scientists and scholars to publish the fruits of their research – coming from their brains – in scholarly journals without payment, for the sake of inquiry and knowledge. The problem with this approach is that the brains are not exposed, just the thoughts, and that the brains available have been those physically accessible, such as those at the local university. Thus, those desiring to gain new knowledge through the consumption of peer-reviewed brains have been restricted in their capacity by physical and economic realities.

The new technology that changes everything is the low-cost economy airline. The public good the low-cost airline makes possible is the world-wide distribution of peer-reviewed brains and completely free and unrestricted access to them by all scientists, scholars, teachers, students, and anyone hungry for brains. It is also now possible to visit new areas and taste brains from multiple disciplines, multiple nationalities, and in multiple cuisines.

Removing access barriers to these brains will accelerate satiety, enrich custards, share the brains of the rich with the poor and the poor with the rich, and lay the foundation for uniting humanity in a common intellectual conversation and quest for good brains recipes.

For various reasons, this kind of free and unrestricted availability, which we will call open access, has so far been limited to small portions of the world’s brains. But even in these limited collections, many different initiatives have shown that open access to brains is economically feasible, that it gives us extraordinary power to find and make use of relevant brains, and that it gives brains and their works vast and measurable new visibility, readership, impact, and fresh, seasonal preparations. To secure these benefits for all, we call on all interested institutions and individuals to help open up access to the rest of these brains and remove the barriers, especially the price barriers, that stand in the way. The more who join the effort to advance this cause, the sooner we will all enjoy the benefits of open access to brains.

The brains that should be freely accessible online are those which scholars give to the world without expectation of payment. Primarily, this category encompasses their peer-reviewed academic brains, but it also includes any unreviewed child brains that they might wish to expose for comment or to alert colleagues to tasty young brains. There are many degrees and kinds of wider and easier access to these brains. By “open access” to brains, we mean their free availability via the public airport system, permitting any users to bake, saute, fry, sear, grill, slow cook under pressure, or use as an ingredient in a savory tart, to muddle them for soup, pass them as basis for stock, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the air system itself. The only constraint on cooking and distribution, and the only role for property rights in this domain, should be to give chefs control over the integrity of their work and the right to be properly acknowledged and cited.

While  the peer-reviewed brains should be accessible without cost to eaters, brains are not costless to produce. However, experiments show that the overall costs of providing open access to brains are far lower than the costs of traditional forms of dissemination. With such an opportunity to save money and expand the scope of dissemination at the same time, there is today a strong incentive for professional associations, restaurants, grocery stores, Costco, and others to embrace open access as a means of advancing their missions. Achieving open access will require new cost recovery models and financing mechanisms, but the significantly lower overall cost of dissemination is a reason to be confident that the goal is attainable and not merely preferable or utopian.

To achieve open access to scholarly brains, we recommend two complementary strategies. 

I.  Self-Archiving: First, scholars need the tools and assistance to deposit their brains in open archives, a practice commonly called self-archiving. When these archives conform to standards created by the Open Archives Initiative, then search engines and other tools can treat the separate archives as one. Users then need not know which archives exist or where they are located in order to find and make use of their brains.

II. Open-Access Brains: Second, scholars need the means to launch a new generation of brains committed to open access, and to help existing brains that elect to make the transition to open access. Because brains should be disseminated as widely as possible, these new brains will no longer invoke physical property rights to restrict access to and use of the grey matter, white matter, and surrounding fluids. Instead they will use property rights and other tools to ensure permanent open access to all the brains they review. Because price is a barrier to access, these new brains will not charge subscription or access fees, and will turn to other methods for covering their expenses. There are many alternative sources of funds for this purpose, including the foundations and governments that fund procreation, the universities and laboratories that possess students inclined to make more brains, endowments set up by discipline or institution, friends of the cause of open access, profits from the sale of add-ons to the basic brains, funds freed up by the demise or cancellation of brains charging traditional subscription or access fees, or even contributions from the researchers themselves. There is no need to favor one of these solutions over the others for all disciplines or nations, and no need to stop looking for other, creative alternatives.

Open access to peer-reviewed brains is the goal. Self-archiving (I.) and a new generation of openaccess brains (II.) are the ways to attain this goal. They are not only direct and effective means to this end, they are within the reach of scholars themselves, immediately, and need not wait on changes brought about by markets or legislation. While we endorse the two strategies just outlined, we also encourage experimentation with further ways to make the transition from the present methods of brain dissemination to open access. Flexibility, experimentation, and adaptation to local circumstances are the best ways to assure that progress in diverse settings will be rapid, secure, and mouthwatering.

The Open Brain Institute, the foundation network founded by philanthropist George Romero, is committed to providing initial help and funding to realize this goal. It will use its resources and influence to extend and promote institutional self-archiving, to launch new openaccess brains, and to help an openaccess brain system become economically self-sustaining. While the Open Brain Institute’s commitment and resources are substantial, this initiative is very much in need of other organizations to lend their effort and resources.

We invite governments, restaurants, grocers, cooking shows, home cooks, learned societies, professional associations, and individual scholars who share our vision to join us in the task of removing the barriers to open access and building a future in which brains in every part of the world are that much more free to poach in a light cream sauce.

(based on, and with apologies to, the Budapest Open Access Initiative)
(H/T to Joseph Hewitt and Ataraxia Theater for the wicked cool zombie image!)
(for some really open access brain stuff, check out the Neurocommons.)

Read the comments on this post…

Open Data and Creative Commons: It’s About Scale…

As part of the series of posts reflecting on the move of Science Commons to Creative Commons HQ, I’m writing today on Open Data.

I was inspired to start the series with open data by the remarkable contribution, by GSK, to the public domain of more than 13,000 compounds known to be active against malaria. They were the first large corporation to implement the CC0 tool for making data into open data. CC0 is the culmination of years of work at Creative Commons, and the story’s going to require at least two posts to tell…

Opening up data was a founding aspect of the Science Commons project at CC. I came to the Creative Commons family after spending six years mucking about in scientific data, first trying to make public databases more valuable at my startup Incellico, and later at the World Wide Web Consortium (W3C) where I helped launch the interest group on the semantic web for life sciences. When I left the W3C in late 2004, data was my biggest passion – and it remains a driving focus of everything we do at Creative Commons.

Data is tremendously powerful. If you haven’t read the Halevy, Norvig & Pereira article on the unreasonable effectiveness of data, go do so, then come back here. It’s essential subtext for what we do at Creative Commons on open data. But suffice to say that with enough data, a lot of problems become tractable that were not tractable before.

Perhaps most important in the sciences, data lets us build and test models. Models of disease, models of climate, models of complex interactions. And as we move from a world in which we analyze at a local scale to one where we analyze at global scale, the interoperability of data starts to be an absolutely essential pre-condition to successful movement across scales. Models rest on top of lots of data that wasn’t necessarily collected to support any given model, and scalable models are the key to understanding and intervening in complex systems. Sage Bionetworks is a great example of the power of models, and the Sage Commons Congress a great example of leveraging the open world to achieve scale.

Building the right model, and responding to the available data, is the difference between thinking we have 100,000 genes or 20,000. Between thinking carbon is a big deal in our climate or not. And scale is at the heart of using models. Relying on our brains to manage data doesn’t scale. Models – the right ones – do scale.

My father (yeah, being data-driven runs in the family!) has done years of important work of the importance of scale that I strongly recommend. His work relates to climate change and climate change adaptation, but it applies equally to most of the complex, massive-scale science out there today. Scale – and integration – are absolutely essential aspects of data, and it is only by reasoning backward from the requirements imposed by scale and integration that we are likely to arrive at the right uses cases and tasks for the present day, whether that be technical choices or legal choices about data.

We chose a twin-barreled strategy at Creative Commons for open data.

First, the semantic web was going to be the technical platform. This wasn’t in a belief that somehow the semantic web would create a Star Trek world in which one could declaim “computer, find me a drug!” and get a compound synthesized. We instead arrived at the semantic web by working backward from the goal of having databases interoperate the way the Web interoperates, where a single query into a search engine would yield results from across tens of thousands of databases, whether or not those databases were designed to work together.

We also wanted, from the start, to make it easy to integrate the web of data and the scholarly literature, because it seemed crazy that articles based on data were segregated away from the data itself. The semantic web was the only option that served both of those tasks, so it was an easy choice – it supports models, and it’s a technology that can scale alongside its success.

The second barrel of the strategy was legal. The law wasn’t, and isn’t, the most important part of open data – the technical issues are far more problematic, given that I could send out a stack of paper-based data with no legal restraints and it’d still be useless.

But dealing with the law is an essential first step. We researched the issue for more than two years, examining the application of Creative Commons copyright licenses for open data, the potential to use the wildly varied and weird national copyright regimes or sui generis regimes for data, the potential utility of applying a contract regime (like we did in our materials transfer work), and more. Lawyers and law professors, technologists and programmers, scientists and commons advocates all contributed.

In our search for the right legal solution for open data, we held conferences at the US National Academy of Science, informal study sessions at three international iSummit conferences, and finally a major international workshop at the Sorbonne. We drew in astronomers, anthropologists, physicists, genomicists, chemists, social scientists, librarians, university tech transfer offices, funding agencies, governments, scientific publishers, and more. We heard a lot of opinions, and we saw a pattern emerge. The successful projects that scaled – like the International Virtual Observatory Alliance or the Human Genome Project – used the public domain, not licenses, for their data, and managed conflicts with norms, not with law.

We also ran a technology-driven experiment. We decided to try to integrate hundreds of life science data resources using the Linux approach as our metaphor, where each database was actually a database package and integration was the payoff. We painstakingly converted resource ater resource to RDF/OWL packages. We wrote the software that wires them all together into a single triple store. We exposed the endpoint for SPARQL queries. And we made the whole thing available for free.

As part of this, we had to get involved in the OWL 2 working group, the W3C’s technical architecture group, and more. We had to solve very hairy problems about data formats. We even had to develop a new set of theories about how to promote shared URIs for data objects.

Like I said, the technology was a hell of a lot hairier than the law. But it worked. We get more than 37,000 hits a day on the query endpoint. There’s at least 20 full mirrors in the wild that we know of. It’s beginning to scale.

But because of the law, we also had to eliminate good databases with funky legal code, including funky legal code meant to foster more sharing. We learned pretty quickly that the first thing you do with those resources is to throw them away, even if they meant well in licensing. The technical work is just too hard. Adding legal complexity to the system made the work intolerable. When you actually try to build on open data, you learn quickly that unintended use tends to rub up against legal constraints of any kind, share-alike constraints equal to commercial constraints.

We would never have learned this lesson without actually getting deep into the re-use of open databases ourselves. Theory, in this case, truly needed to be informed by practice.

What we learned, first and foremost, was that the combination of truly open data and semantic web supports the use of that data at Web scale. It’s not about open spreadsheets, or open databases hither and yon. It’s not about posting tarballs to one’s personal lab page. Those are laudable activities, but they don’t scale. Nor does applying licenses to the data that impose constraints on downstream use, because the vast majority of uses of data aren’t yet known. And today’s legal code might well prevent them. Remember that the fundamental early character of the Web was public domain, not copyleft. Fundamental stuff needs fundamental treatment.

And data is fundamental. We can’t treat it, technically or legally, like higher level knowledge products if we want it to serve that fundamental role. The vast majority of it is in fact not “knowledge” – it is the foundation upon which knowledge is built, by analysis, by modeling, by integration into other data and data structures. And we need to begin thinking of data as foundation, as infrastructure, as a truly public good, if we are to make the move towards a web of data, a web that supports models, a world in which data is useful at scale.

I’ll return to the topic in my next post to outline exactly how the Creative Commons toolkit – legal, technical, social – serves the Open Data community.

Read the comments on this post…

On Science Commons’ Moving West…

I’ve kept this blog quiet lately – for a wide range of reasons – but a few questions that have come in have prompted me to start up a new series of posts.

The main reason for the lack of posts around here is that I’ve been very busy, and for the most part, I’ve used this blog for a lot of lengthy posts on weighty topics. At least, weighty to me. If you want a more informal channel, you can follow me on twitter, as I prefer tweeting links and midstream thoughts to rapid-fire short blog entries. The joy of a blog like this for me is the chance to explore subjects in greater depth. But it also means that during times of extreme hecticness, I won’t publish here as much.

Anyhow. I’ve been busy with a pretty big task, which is getting me, my family, and the Science Commons operation moved from Boston to San Francisco. We’re moving from our longtime headquarters at MIT into the main Creative Commons offices, and it’s a pretty complex set of logistics on both personal and professional levels.

As an aside, I’m now very close to some downright amazing chicken and waffles, and that’s exciting.

Now, I would have thought this would have been interpreted by the world in the clear manner that I see it: us Science Commons folks are, and have always been, part and parcel of the Creative Commons team, so this didn’t strike me as super-important if you’re not one of the people who has to move. If you email us, our addresses end with @creativecommons.org. That’s where our paychecks come from. So having us integrate into the headquarters offices doesn’t seem such a big deal. But I keep getting rumbles that people think we’re somehow “going away” or “disappearing” – that’s why there’s going to be a series of posts on the move and its implications.

So let me be as blunt as possible: Science at Creative Commons, and the work we do at the Science Commons project, isn’t going anywhere. We are only going to be intensifying our work, actually. You can expect some major announcements in the fall about some major new projects, and you’ll learn a lot about the strategic direction we plan to take then. I can’t talk about it all yet, because not all the moving pieces are settled, but suffice to say the plans are both Big and Exciting. We’ve already added a staff member – Lisa Green – who is both a Real Scientist and experienced in Bay Area science business development, to help us realize those plans.

Our commitments and work over the past six years of operations aren’t going anywhere either. We will continue to be active, vocal, and visible proponents of open access and open data. We will continue to work on making biological materials transfer, and technology transfer, a sane and transparent process. And our commitment to the semantic web – both in terms of its underlying standards and in terms of keeping the Neurocommons up and running – is a permanent one.

You can catch up with our achievements in later posts, or follow our quarterly dispatches. We get a lot of stuff done for a group of six people, and that’s not going to change either.

Some things *are* likely to change. For example, I don’t like the Neurocommons name for that project much any more – it’s far more than neuroscience in terms of the RDF we distribute, and the RDFHerd software will wire together any kind of database that’s formatted correctly. But those changes are changes of branding, not of substance in terms of the work.

It is, however, now time to get our work and the powerful engine that is the Creative Commons headquarters together. I’m tired of seeing the fantastic folks that I work with twice a year. We’re missing a ton of opportunities to bring together knowledge in the HQ – especially around RDFa and metadata for things like scholarly norms – by being physically separated. Not to mention that the San Francisco Bay Area is perhaps the greatest place on earth to meet the people who change the world, every day, through technology.

I’m also tired of living on the road. I’m nowhere near Larry Lessig and Joi Ito in terms of my travel, but I’m closing in on ten years of at least 150,000 miles a year in airplanes. It gets old. Most of our key projects at this point are on the west coast, like Sage Bionetworks and the Creative Commons patent licenses, and we’re developing a major new project in energy data that is going to be centered in the Bay Area as well. The move gives me the advantage of being able to support those projects, which are much more vital to the long term growth of open science than conference engagements, without 12 hours of roundtrip plane flights.

I’ll be looking back at the past years of work in Boston over the coming weeks here. I’m in a reflective mood and it’s a story that needs to be told. We’ve learned a lot, and we’ve had some real successes. And we’re not abandoning a single inch of the ground that we’ve gained in those years. So if you hear tell that we’re disappearing or going away, kindly point them here and let them know they will have us around for quite some time into the future…

Read the comments on this post…

Open Hardware

Creative Commons was fortunate enough to be involved in a fascinating workshop last week in New York on Open Hardware. Video is at the link, photos below.

The background is that I met Ayah Bdeir at the Global Entrepreneurship Week festivities in Beirut, and we started talking about her LittleBits project (which is, crudely, like Legos for electrics assembly – even someone as spatially impaired as me could build a microphone or pressure sensor in minutes).

Ayah introduced me to the whole open hardware (OH) world and asked a lot of very good, hard to answer questions about how to use CC in the context of OH. It became clear that a lot of the people involved in the movement didn’t have a clear grasp of how the various layers of intellectual property might or might not apply.

Ayah suggested in February that we put together a little workshop – almost a teach-in – around a meeting of Arduino advocates happening in NYC on the 18-19 of March. In a matter of three weeks, we got representatives from a bunch of major players to commit: Arduino (world’s largest open hardware platform), BugLabs, Adafruit, Chumby, Make magazine, even Chris Anderson. Mako Hill from the Free Software Foundation came and @rejon made it there at the last minute too, wearing his openmoko and qi hardware hats. Eyebeam hosted it for free, and we picked up the snacks and cheese trays.

I gave a very short intro laying out how the science commons project @ creative commons has spent a lot of time looking at IPRs as a layered problem, dealing with it at data levels, materials levels, and patent levels, as well as the fact-idea-expression relationships in science. This was to create some context for why we might have interesting ideas.

Thinh proceeded to deliver a masterful lecture on IP that went on for hours, though intended to be 30 minutes. It was an interactive, give-and-take, wonderful session to watch, ranging from copyrights to mask works to trade secrets to trademarks and patents. The folks there liked it enough to suspend the break period after five minutes and dive back into IP.

After that we had a lengthy interactive session driven by the OH folks in which they tried to decide what a declaration of principles might look like, how detailed to get, how to engage in existing efforts to do similar things (like OHANDA), the role of the publishers like Wired and Make to support definitions of open hardware, and how open one had to be in order to be open.

There was no formal outcome at the close of business, but I expect a declaration or statement of some sort to emerge (akin to the Budapest Declaration on Open Access from my own world of scholarly publishing). There’s clearly a lot of work to be done. And the reality is that copyrights and patents and trademarks and norms and software and hardware are going to be hard to reconcile into a simple, single license that “makes copyleft hardware” a reality. But it was fun to be in a room with so many passionate, brilliant people who want to make the world a better place through collaborative research.

More to come once results emerge…

Read the comments on this post…

Reaching Agreement On The Public Domain For Science

4370186974_363d182500.jpg

Photo outside the Panton Arms pub in Cambridge, UK, licensed to the public under Creative Commons Attribution-ShareAlike by jwyg (Jonathan Gray).

Today marked the public announcement of a set of principles on how to treat data, from a legal context, in the sciences. Called the Panton Principles, they were negotiated over the summer between myself, Rufus Pollock, Cameron Neylon, and Peter Murray-Rust. If you’re too busy to read them directly, here’s the gist: publicly funded science data should be in the public domain, full stop.

If you know me and my work, this is nothing new. We have been saying this since late 2007. I’ve already gotten a dozen emails asking me why this is newsworthy, when it’s actually a less normative version of the Science Commons protocol for open access to data (we used words like “must” and “must not” instead of the “should” and “should not” of the principles).

It’s newsworthy to me because it represents a ratification of the ideals embodied in the protocol by two key groups of stakeholders. First, real scientists – Cameron and Peter are two of the most important working scientists in the open science movement. Getting real scientists into the fold, endorsing the importance of the public domain, is essential. They’re also working in the UK, which has some copyright issues around data that can complicate things in a way we forget about here in the post-colonial Americas.

Second, it’s newsworthy because Rufus and I both signed it. Rufus helped to start the Open Knowledge Foundation, and he’s an important scholar of the public domain. We’re in many ways in the same fraternity – we care about “open” deeply, and we want the commons to scale and grow, because we believe in its role in innovation and creation…indeed, in its role in humanity.

But we’re on different sides of a passionate debate about data and licenses. I’m not going to recapitulate it here, you can find it in the googles if you want. Suffice to say we have argued about the role of the public domain as a first principle in general for data, as opposed to the specifics of data in public funded science. But for both of us to sign onto something like this means that even in the midst of heated argument we can find common ground – public money should mean public science, no licenses, no controls on innovation and reuse, globally.

It’s important for the science part. It’s also a good lesson, I hope, that even those of us who find themselves on opposite sides of arguments inside open are usually fighting for the same overall goals. I’ll keep arguing for my points, and Rufus will keep arguing for his, but that should never keep us from remembering the truly common goals we share inside the movement. I’m proud to be a part of it.

Read the comments on this post…

Tech4Society, Day 3

I’ll start my final post on the Tech4Society conference by giving thanks to the Ashoka folks for getting me here to be a part of this conference. Most of the time, even in the developing world, I’m surrounded by digital natives, or people who emigrated to the digital nation. It’s an enveloping culture, one that can skew the perception of the world to one where everyone worries about things like copyrights and licenses, and whether or not data should be licensed or in the public domain.

There’s a big world of entrepreneurs out there just hacking in the real world. First life, if you will. High touch, not high tech.

Being enveloped in their world for a few days gave me a lot of new perspectives on the open access and open educational resources movements. As always with this blog, my intention to write may exceed my delivery of text, but I’m going to try to chew through the perspectives. Getting off the road in a few weeks is going to help.

But I now get at a deep level the way that obsessive cultures of information control in the scholarly and educational literature represent a high tax, inbound and outbound, on the entrepreneur, whether social or regular. If you don’t know the canon, you’re doomed to repeat it. And we don’t have the time, the money, or the carbon to repeat experiments we know won’t work. We can’t afford to let good ideas go un-amplified, because we need tens of thousands of good ideas.

At my panel today on scale, we focused mainly on why scale is hard, the problems of scale. The CC experience – going from 2 people in a basement at Stanford to 50 countries in 6 years – is an example of what I called “catastrophic success”. It’s a nice way to think of what I also like to call the Jaws moment, after the scene in the 1970s action film where, having hoped to find a shark to catch, they find one muuuuuch bigger than they expected. The relevant quote is “we’re gonna need a bigger boat” – and that is what happens sometimes at internet scale. Entrepreneurs need to know why they want to scale, what scale means to them, and how to measure success, especially social entrepreneurs. Because if cash isn’t the only metric, the metrics you choose will wind up defining your success at scale.

There was a great question about scaling passion. I am going to try and address that in another post. I’m not quite in a mental state to get that post out yet, though.

It wasn’t just the social entrepreneurs, but also CC community experiences. Gautam John challenged me, eloquently and at length, about the way that Creative Commons engages with its community. I went into the argument convinced of my position, and left much less so. That’s as good as arguments get for me.

Ashoka and Lemelson foundations are doing great work, supporting inventors around the world (though I would have liked to have seen some Eastern bloc inventors – a curious lack of Slavic accents – wonder why). It was an honor to crash their party.

Read the comments on this post…

Tech4Society, Day 2

Getting ready to head up to Tech4Society’s final day. I’m on a panel called the tipping point, about how to scale social entrepreneurial success beyond a local region or state. My instinct is to say “pack your suitcase and start traveling” but that’s not very helpful. Even if it’s how I have been approaching the problem.

Yesterday I wasn’t on a panel. It was a good moment to do some listening. I sat in on a few panels, but was most moved by the trends in Africa session. In other trends panels, the trends were things like “open source” – positive trends. In Africa it was all about how difficult the governance problems are, how an innovator or social entrepreneur is looked on with at best skepticism or out worst outright hostility, by both local society and by the government.

It was still amazing to hear the breadth of ingenuity at work. I heard about training rats to sniff out landmines, clay refrigerators that allow girls to go to school rather than hawking the harvest before it spoils…and in the same breath, about how it takes five hours to get one hour of work done, because of the difficulty of keeping a steady power supply.

At lunch I crashed the Indonesian table, where I was asked if I was part of the youth venture group. Nicest age-related compliment I’ve gotten in a while (the youth venture folks are like 16 years old). But it does strip away any pretense of gravitas I thought I might have had.

I also got to spend some quality time with Richard Jefferson of CAMBIA. Richard is a seasoned social entrepreneur who has been hacking away at the patent problem in “open” biotech for about 20 years now. I always learn a lot from him.

At the end of the day the heat and the jetlag caught me, and I fell asleep before the dinner, which is a bummer.

I’m looking forward to having some time off the road in a few weeks to try and integrate this experience with the other travel over the past four months. There’s a long way between the World Economic Forum at Davos and this. The entrepreneurs here are doing what they do against such long odds that it can make the whole “cult of the successful entrepreneur” in the US look kind of lame.

It doesn’t take a hero to make a social networking site, it just takes some Ruby code. We have layer upon layer upon layer of infrastructure that makes it easy to innovate in the US. We have stable power grids, for the most part, and communications lines. You can buy a computer for under $500, slap Linux on it, and you’re ready to start a software company. You don’t have to pay a registration fee that takes six months, or worry that the government is going to crack down on you (despite what some crackpots may think) if you protest or run a business that disagrees with the ruling elites.

That level of social, political, and technical infrastructure lifts us all up who benefit from it. It’s invisible to most of us most of the time, and it’s a good thing to be reminded that it’s not something to be taken for granted.

Off to day 3.

Read the comments on this post…

On the Nature of Ideas

I did an interview recently where the author, clearly having done some homework, called out an old quote of mine arguing that ideas aren’t like widgets or screws, that they’re not industrial objets.

I’d said that a long time ago, inspired by John Perry Barlow’s Declaration of Independence of Cyberspace. Here’s the money quote: “Your increasingly obsolete information industries would perpetuate themselves by proposing laws, in America and elsewhere, that claim to own speech itself throughout the world. These laws would declare ideas to be another industrial product, no more noble than pig iron. In our world, whatever the human mind may create can be reproduced and distributed infinitely at no cost. The global conveyance of thought no longer requires your factories to accomplish.”

The world that John Perry was talking about has not come to pass, completely. The governments have certainly moved to impose more and greater controls. But as Lessig noted just a few years later in Code: and Other Laws of Cyberspace, the aspects of cyberspace that promised liberation, a nation of the Mind…those aspects were the output of human-controlled systems, and humans could and would change the rules if they didn’t like the outcomes.

I was there for parts of these conversations. Gave JPB a ride around town, harassing him about the declaration and about Cassidy. I put together Lessig’s book party for Code when it came out. But the thing about ideas stuck with me more than the rest.

I’d studied epistemology, the theory of knowledge. You get a lot of examples of attempting to codify ideas (and brains, the storage tanks of ideas) into the machinery of the time (See the masterful book, “Memory Practices in the Sciences” for more). But in the end, ideas resist complete capture.

They’re ethereal. We’ve spent thousands of years trying to codify them, into Plato’s forms, into machines, now into code. The dominant industrial paradigm tends to be the stuff we use to try and understand them and their human substrates – pumps and machines to explain the brain in the industrial age, circuits and pathways in the digital. This ethereal nature makes it hard to get the ideas into the powerful information systems of the day, which are based on bits and bytes. It’s one of the reasons that the most powerful idea transmissions systems we have are humanist – text, sound, video. It’s why something as lousy as powerpoint can take over, because it’s a way for people to talk to people.

It’s hard to make ideas into widgets or screws because of this. It’s also hard because we all see the world differently, even those of us who agree. We use common words as proxies to help convey that this red ball is an apple and this green ball is also an apple. Making the word apple into an abstracted computation tool is hard, because you have to decide what it means, and convince others to use your meaning rather than their own. Cyc’s been pushing on this for 25 years and we still don’t have the Star Trek computer recognizing our voices.

But we’re starting to have to try to make ideas at least representable as widgets. The problem is that the information space is overwhelming us as people. We can, using robots in the lab, sensor networks in the ocean, miniature microphones in public spaces, genotype smears on red light signals, generate data at such a level that we simply cannot use our own brains to proces the data into an information state that lets us extract, test, and generate ideas.

There’s two things we can do, one easy and one hard. First, we can make the existing technologies for idea transmission (writing it down onto paper and publishing it) more democratic and network friendly. That starts with good formats: putting ideas into PDFs is a terrible idea. The format blocks the ability to take the text out, remix it, translate it, reformat it, text mine it, have it read to a blind person via text-to-speech, and on and on. It continues with open access (so we don’t create a digital divide first, and so we enable the entrepreneurs of the world, wherever they are).

I’m at a conference in Hyderabad called Tech4Society that is packed to the gills with inventors and social entrepreneurs, who for the most part have no access to the scientific and technical literature. It’s all in English – which many, but not all, speak here. It’s very expensive – nuclear physics journals can cost more – per year – than a new car. And this is a tax on the entrepreneurs of the world.

Inventors have to invent. It’s in their blood. And they have the capacity to rapidly combine information from multiple sources to assemble new projects. I heard today of systems that leverage sugar palms in Indonesia to power villages, of local decentralized power panels for wind and solar to give each house its own power, and more and more and more. But this is being done without the newest knowledge, knowledge that is on the web somewhere…but locked up by paywalls.

We as Americans send a lot of money. We’d be a damned sight better if we sent a lot of knowledge.

The Open Access movement is being driven mainly inside the developed world. US and EU librarians feel the pinch of the serials pricing crisis, and funders like the US National Institute of Health and the Wellcome Trust take policy directions that lead towards the availability of the biomedical research. And it’s wonderful that the solutions to these problems all lift the developing world along the way. It seems that the scholarly literature will, in fits and starts, and in some disciplines faster or slower, find its proper place on the net, free of commercial restrictions, one of these days.

But it’s not just ideas, it’s what to do with the ideas. Richard Jefferson today made the lovely point that the patent literature is a giant database of recipes to make inventions. And that if you can find the inventions that were patented in the US, but not in India, you’ve got a lot of good stuff to work on in India. This is true. And deeply important.

But I got a little melancholy thinking of the stuff that comes before an inventor becomes a social entrepreneur, ready to apply for funding or speak in front of 200 people at a conference. Maybe they can’t read the patents and understand the information. Maybe they just need to build some furniture for their house, or fix the stove. I had a sense-memory of long shelves of the books in Home Depot, the how-to guides, the recipes for doing simple stuff, unpatented stuff, but essential stuff, and I look at the amazing user-driven innovative spirit that rules the day in India, and I want to cry at the amount of knowledge that is deprived. Give these folks the books and get out of the way!

I wish we could come together as a culture and create an open source set of how-to books to parallel the scholarly literature. Those book are how I learned to rewire sockets, to fix plumbing. Where I learned what was dangerous and what was safe. They’re a place where those ideas, laid out in the papers that are becoming free, became methods that I could use. Where the ideas became actionable for me. Imagine if those books were movable from my server where I wrote them, to a server in Africa who translated them into Kiswahili, or Chichewa. If they could be formatted to be read on the mobile phones ubiquitous across the world. If they could lead to one more hour of light per night through the creation of lightweight photovoltaics.

Has anyone out there done this yet? Anyone interested in doing it? Anyone immediately get a rash and freak out? All of those reactions are interesting to me.

The second part of why ideas are hard will have to wait for the next post. Suffice to say the word “semantic” will feature prominently.

I’ll post more from day 2 tomorrow. Jetlag over and out.

Read the comments on this post…