Monthly Archives: July 2010
In Brief: Applications Are Currently Being Accepted for the 2011 Access to Learning Awards
Report on the 2009 Joint CENDI/NKOS Workshop – Knowledge Organization Systems: Managing to the Future
Conference Report by Marcia Lei Zeng, Kent State University
No-Fault Peer Review Charges: The Price of Selectivity Need Not Be Access Denied or Delayed
Opinion by Stevan Harnad, Université du Québec à Montréal and University of Southampton
Semantically Enhancing Collections of Library and Non-Library Content
Article by James E. Powell, Linn Marks Collins and Mark L. B. Martinez, Los Alamos National Laboratory
The Benefits of Integrating an Information Literacy Skills Game into Academic Coursework: A Preliminary Evaluation
Article by Karen Markey, Fritz Swanson, Chris Leeder, Brian J. Jennings, Beth St. Jean, Victor Rosenberg, Soo Young Rieh, Robert L. Frost, Loyd Mbabu, and Andrew Calvetti, University of Michigan; Gregory R. Peters, Jr., Cyber Data Solutions LLC, and Geoffrey V. Carter and Averill Packard, Saginaw Valley State University
Document Management in the Open University of Catalunya (UOC) Classrooms
Article by Albert Cervera, Universitat Oberta de Catalunya
nature.com OpenSearch: A Case Study in OpenSearch and SRU Integration
Article by Tony Hammond, Nature Publishing Group
Three Topics
Editorial by Laurence Lannom, CNRI
Support for open access policy in Denmark
The International Association of Scientific, Technical and Medical Publishers, or STM – the body that represents primarily the highly profitable sector of scholarly publishing – has issued the “STM submission on “Recommendations for implementation of Open Access in Denmark”. Following are some comments on this submission.
Highlights
In brief: STM makes a case for a EU benefiting from balance of trade. Denmark may wish to consider whether it is a supplier to other countries of this balance, and whether this situation risks loss of Danish-funded research. This is the situation for the vast majority of countries in the world. STM greatly overstates their contributions to scholarly publishing, for example claiming to “underwrite the creation” of scientific information. This is patent nonsense; for a publicly funded research article, it is the public funder, university and scholarly researcher / author who create the information. STM claims to have taken on the role of preservation; it would be bizarre indeed for the private sector to take on this role. STM bases their economic case on the high cost of moving to the online environment in the 1990’s. Any industry that has not noticed the tremendous decrease in costs of information technology in the last two decades is rather obviously out of touch. Speaking of out of touch, STM claims to have been involved with scholarly publishing for 350 years. In reality, almost all scholarly publishing was in the hands of the not-for-profit sector until after the second world war, and today, the open access publishing community, not at all mentioned in this letter, is sizable and growing, with over 5,000 fully open access, peer-reviewed scholarly journals listed in DOAJ, which is growing by more than 2 titles per day.
Details
STM:
STM makes a “Euro 3 billion contribution to the EU’s balance of trade”.
Comments:
While I cannot comment on the overall balance of trade, I would suggest it worthwhile for Denmark to consider the Denmark-specific situation. The number of very large, highly profitable STM publishers in the world is very small, and I don’t think any of them are centered in Denmark – which makes it highly likely that Denmark, like the vast majorities of countries in the world, is on the supply side of this balance of trade, i.e. like Canada, in Denmark the cash flow is likely very much from Denmark outwards. Another similarity is that Denmark, like Canada, in giving away the results of Danish-funded research to highly profitable publishers in other countries, risks losing access to the results of the research that it has funded, should Denmark be in a position to not be able to afford the pricey packages of these publishers.
STM:
STM publishers recognize and applaud the efforts of private sector organizations and government institutions to supply funds to support scholarly research activity. These funding activities are an essential component of today’s well-functioning and interdependent system of scholarly scientific communication that relies on each of its major stakeholders (e.g. authors, researchers, primary and secondary publishers, libraries, universities, federal government) to perform a key role in the development and dissemination of peer-reviewed papers. The essential role that publishers play in this system is to underwrite the creation, registration, certification, formalization,
improvement, dissemination, preservation and use of scientific information.
Comments:
It is misleading to include the efforts of private sector organizations in this response; a government mandate for government-funded research will obviously not include privately funded research. Also, STM is greatly exaggerating the role of the publisher. STM does not “underwrite the creation” of scientific information, for example. With publicly funded research, it is the public funder, the university, and the scholarly author(s) who create the information. It is voluntary peer-reviewers and often semi-voluntary editors who are responsible for peer review (registration and certification).
Preservation: it makes no sense to ask the private sector to take on the role of long-term preservation. Any private organization has a right to change its focus of operation, or cease to exist, at any time. The role of preservation should be undertaken by the public sector; this has long been the role of libraries, and should continue to be so.
Consider how ludicrous it would be to mandate that private publishers undertake the role of preservation – a policy that would say something like, “if you undertake to publish scholarly information, it shall be thy responsibility to maintain it, for all eternity!” Now THAT would be an unfunded and unsustainable mandate, wouldn’t it?
STM:
“STM publishers support the maximum sustainable dissemination of the published record of science and for 350 years we have helped create, disseminate and (now) preserve the world’s body of knowledge”.
Comment:
The commercial sector in scholarly publishing emerged after the Second World War. Up until this point in time, scholarly publishing was conducted almost exclusively by not-for-profit scholarly publishers, and the not-for-profit sector continues to produce a very large portion of scholarly publishing. Far from supporting the maximum sustainable dissemination of the published record of science, STM has taken advantage of an inelastic market to consistently raise prices above inflationary levels, creating a serials crisis.
STM:
Today over 2,000 scientific and scholarly publishers worldwide (including large and small commercial, university presses and learned societies) manage and fund the processing of some 2-3 million manuscripts submitted from researchers and finally produce annually in excess of 1.5 million peer-reviewed published journal articles in some 25,000 journals.
Comments:
STM neglects to mention that today, a large and growing share of these publishers are either fully open access publishers (the vetted Directory of Open Access Journals now lists more than 5,000 fully open access, peer-reviewed scholarly journals, and is adding titles at the rate of more than 2 per day), or publishers who provide open access after a temporary embargo period (often 6 months to a year) on a voluntary basis. A majority of publishers allow for author self-archiving, as listed on the SHERPA RomEO Publisher Copyright Policies and Self-Archiving site. In other words, the number of publishers who would have to change current policies and practices to conform to a public open access policy mandate is small, and shrinking.
On the funding: it would be more accurate to say that the highly profitable publishers are FUNDED BY the system, rather than that they fund it. It is understandable if their view is different, however, scholarly publishing is different from a typical business, as it is the scholars themselves who are the largest group by far of readers as well as authors. It is more accurate to say that libraries, acting of behalf of their scholar reader/authors, who fund the system, indirectly through publishers.
STM:
Since the early 1990s, STM publishers have invested heavily in the migration from print based products into electronic, digital versions, with the result that 96% of scientific, technical and medical journals1 and 87% journals in arts, humanities and social sciences are available electronically, fully searchable, and accessible on the world wide web.
Comment:
In the early 1990’s, it was indeed expensive to move into the online environment. However, technology has evolved, as has publishing software, and the costs of computing have decreased – very dramatically so. Nowadays, using freely available open source software such as Open Journal Systems, the technology requirements of scholarly publishing are a very great deal less than was the case in the 1990’s. It makes no sense at all to base the economic present and future on the basis of one time-limited expense.
For more background, please see my book, Scholarly Communication for Librarians. Links to two open access chapters can be found from the main page of my blog. Requests for clarification about any of the details of this post are most welcome.
rdf:about="Shakespeare"
Dorothea has written a typically good post challenging the role of RDF in the linked data web, and in particular, its necessity as a common data format.
I was struck by how many of her analyses were spot on, though my conclusions are different from hers. But she nails it when she says:
First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone?
I’m on about this all the time. The idea that we are in web-1995-land for data astounds me. I’d be happy if I were to be proven wrong – trust me, thrilled – but I don’t see the core base of infrastructure for a data web to explode. I see an exploding capability to generate data, and of computational capacity to process data. I don’t see the technical standards in place that enable the concurrent explosion of distributed, decentralized data networks and distributed innovation on data by users.
The Web sits on a massive stack of technical standards that pre-dated it, but that were perfectly suited to a massive pile of hypertext. The way that the domain name system gave human-readable domains to dotted quads lent itself easily to nested trees of documents linked to each other, and didn’t need any more machine readable context than some instructions to the computer about how to display the text. It’s vital to remember both that we as humans are socially wired in to use documents in a way that was deeply enabling to the explosion of the Web, because all we had to do was standardize what those documents looked like and where they were located.
On top of that, at exactly the moment in time that the information on the web started to scale, a key piece of software emerged – the web browser – that made the web, and in many ways the computer itself, easier to use. The graphic web browser wasn’t an obvious invention. We don’t have anything like it for data.
My instinct is that it’s going to be at least ten years’ worth of technical development, especially around drudgery like provenance, naming, versioning of data, but also including things like storage and federated query processing, before the data web is ready to explode. I just don’t see those problems being quick problems, because they aren’t actually technical problems. They’re social problems that have to be addressed in technology. And them’s the worst.
We simply aren’t yet wired socially for massive data. We’ve had documents for hundreds of years. We have only had truly monstrous-scale data for a couple of decades.
Take climate. Climate science data used to be traded on 9-track tapes – as recently as the 1980s. Each 9-track tape maxes out at 140MB. For comparison’s sake, I am shopping for a 2TB backup drive at home. 2TB in 9-tracks is a stack of tapes taller than the Washington Monument. We made that jump in less than 30 years, which is less than a full career-generation for a working scientist. The move to petabyte scale computing is having to be wedged into a system of scientific training, reward, incentives, and daily practice for which it is not well suited. No standard fixes that.
Documents were easy. We have a hundreds-of-years old system of citing others’ work that makes it easy, or easier, to give credit and reward achievement. We have a culture for how to name the documents, and an industry based on making them “trusted” and organized by discipline. You can and should argue about whether or not these systems need to change on the web, but I don’t think you can argue that the document culture is a lot more robust than the data culture.
I think we need to mandate data literacy the way we mandate language literacy, but I’m not holding my breath that it’s going to happen. Til then, the web will get better and better for scientists, the way the internet makes logistics easier for Wal-Mart. We’ll get simple mashups, especially of data that can be connected to a map. But the really complicated stuff, like oceanic carbon, that stuff won’t be usable for a long time by anyone not trained in the black arts of data curation, interpretation, and model building.
Dorothea raises another point I want to address:
“not all data are assertions” seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.
This “express Hamlet in RDF” argument is a Macguffin, in my opinion – it will be forgotten by the third act of the data web. But damn if it’s not a popular argument to make. Clay Shirky did it best.
But it’s irrelevant. We don’t need to express Hamlet in RDF to make expressing data in RDF useful. It’s like getting mad at a car because it’s not an apple. There is absolute boatloads of data out there that absolutely needs to be expressed in a common format. Doing climate science or biology means hundreds of databases, filling at rates unimaginable even a few years ago. I’m talking terabytes a day, soon to be petabytes a day. That’s what RDF is for.
It’s not for great literature. I’ll keep to the document format for The Bard, and so will everyone. But he does have something to remind us about the only route to the data web:
Tomorrow and tomorrow and tomorrow,
Creeps in this petty pace from day to day
It’s going to be a long race, but it will be won by patience and day by day advances. It must be won that way, because otherwise we won’t get the scale we need. Mangy approaches that work for Google Maps mashups won’t cut it. RDF might not be able to capture love, or literature, and it may be a total pain in the butt, but it does really well on problems like “how do i make these 49 data sources mix together so I can run a prediction of when we should start building desalination plants along the Pacific Northwest seacoast due to lower snowfall in the Cascade Mountains”.
That’s the kind of problem that has to be modelable, and it has to run against every piece of data possible. It’s an important question to understand as completely as can be. The lack of convenience imposed by RDF is a small price to pay for the data interoperability it brings in this context, to this class of problem.
As more and more infrastructure does emerge to solve this class of problem, we’ll get the benefits of rapid incremental advances on making that infrastructure usable to the Google Maps hacker. We’ll get whatever key piece, or pieces, of data software that we need to make massive scale data more useful. We’ll solve some of those social problems with some technology. We’ll get a stack that embeds a lot of that stuff down into something the average user never has to see.
RDF will be one of the key standards in the data stack, one piece of the puzzle. It’s basically a technology that duct-tapes databases to one another and allows for federated queries to be run. Obviously there needs to be more in the stack. SPARQL is another key piece. We need to get the names right. But we’ll get there, tomorrow and tomorrow and tomorrow…
Read the comments on this post…
rdf:about=”Shakespeare”
Dorothea has written a typically good post challenging the role of RDF in the linked data web, and in particular, its necessity as a common data format.
I was struck by how many of her analyses were spot on, though my conclusions are different from hers. But she nails it when she says:
First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone?
I’m on about this all the time. The idea that we are in web-1995-land for data astounds me. I’d be happy if I were to be proven wrong – trust me, thrilled – but I don’t see the core base of infrastructure for a data web to explode. I see an exploding capability to generate data, and of computational capacity to process data. I don’t see the technical standards in place that enable the concurrent explosion of distributed, decentralized data networks and distributed innovation on data by users.
The Web sits on a massive stack of technical standards that pre-dated it, but that were perfectly suited to a massive pile of hypertext. The way that the domain name system gave human-readable domains to dotted quads lent itself easily to nested trees of documents linked to each other, and didn’t need any more machine readable context than some instructions to the computer about how to display the text. It’s vital to remember both that we as humans are socially wired in to use documents in a way that was deeply enabling to the explosion of the Web, because all we had to do was standardize what those documents looked like and where they were located.
On top of that, at exactly the moment in time that the information on the web started to scale, a key piece of software emerged – the web browser – that made the web, and in many ways the computer itself, easier to use. The graphic web browser wasn’t an obvious invention. We don’t have anything like it for data.
My instinct is that it’s going to be at least ten years’ worth of technical development, especially around drudgery like provenance, naming, versioning of data, but also including things like storage and federated query processing, before the data web is ready to explode. I just don’t see those problems being quick problems, because they aren’t actually technical problems. They’re social problems that have to be addressed in technology. And them’s the worst.
We simply aren’t yet wired socially for massive data. We’ve had documents for hundreds of years. We have only had truly monstrous-scale data for a couple of decades.
Take climate. Climate science data used to be traded on 9-track tapes – as recently as the 1980s. Each 9-track tape maxes out at 140MB. For comparison’s sake, I am shopping for a 2TB backup drive at home. 2TB in 9-tracks is a stack of tapes taller than the Washington Monument. We made that jump in less than 30 years, which is less than a full career-generation for a working scientist. The move to petabyte scale computing is having to be wedged into a system of scientific training, reward, incentives, and daily practice for which it is not well suited. No standard fixes that.
Documents were easy. We have a hundreds-of-years old system of citing others’ work that makes it easy, or easier, to give credit and reward achievement. We have a culture for how to name the documents, and an industry based on making them “trusted” and organized by discipline. You can and should argue about whether or not these systems need to change on the web, but I don’t think you can argue that the document culture is a lot more robust than the data culture.
I think we need to mandate data literacy the way we mandate language literacy, but I’m not holding my breath that it’s going to happen. Til then, the web will get better and better for scientists, the way the internet makes logistics easier for Wal-Mart. We’ll get simple mashups, especially of data that can be connected to a map. But the really complicated stuff, like oceanic carbon, that stuff won’t be usable for a long time by anyone not trained in the black arts of data curation, interpretation, and model building.
Dorothea raises another point I want to address:
“not all data are assertions” seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.
This “express Hamlet in RDF” argument is a Macguffin, in my opinion – it will be forgotten by the third act of the data web. But damn if it’s not a popular argument to make. Clay Shirky did it best.
But it’s irrelevant. We don’t need to express Hamlet in RDF to make expressing data in RDF useful. It’s like getting mad at a car because it’s not an apple. There is absolute boatloads of data out there that absolutely needs to be expressed in a common format. Doing climate science or biology means hundreds of databases, filling at rates unimaginable even a few years ago. I’m talking terabytes a day, soon to be petabytes a day. That’s what RDF is for.
It’s not for great literature. I’ll keep to the document format for The Bard, and so will everyone. But he does have something to remind us about the only route to the data web:
Tomorrow and tomorrow and tomorrow,
Creeps in this petty pace from day to day
It’s going to be a long race, but it will be won by patience and day by day advances. It must be won that way, because otherwise we won’t get the scale we need. Mangy approaches that work for Google Maps mashups won’t cut it. RDF might not be able to capture love, or literature, and it may be a total pain in the butt, but it does really well on problems like “how do i make these 49 data sources mix together so I can run a prediction of when we should start building desalination plants along the Pacific Northwest seacoast due to lower snowfall in the Cascade Mountains”.
That’s the kind of problem that has to be modelable, and it has to run against every piece of data possible. It’s an important question to understand as completely as can be. The lack of convenience imposed by RDF is a small price to pay for the data interoperability it brings in this context, to this class of problem.
As more and more infrastructure does emerge to solve this class of problem, we’ll get the benefits of rapid incremental advances on making that infrastructure usable to the Google Maps hacker. We’ll get whatever key piece, or pieces, of data software that we need to make massive scale data more useful. We’ll solve some of those social problems with some technology. We’ll get a stack that embeds a lot of that stuff down into something the average user never has to see.
RDF will be one of the key standards in the data stack, one piece of the puzzle. It’s basically a technology that duct-tapes databases to one another and allows for federated queries to be run. Obviously there needs to be more in the stack. SPARQL is another key piece. We need to get the names right. But we’ll get there, tomorrow and tomorrow and tomorrow…
Canada’s Digital Economy Consultation: deadline extended to July 13th
The deadline for Canada’s Digital Economy Consultation has been extended to July 13. Please take a few moments to register and vote for open access policy for Canadian funded research (under Canada’s Digital Content in the Ideas Forum), and many other fine Ideas such as open data, net neutrality, and support for the Community Access Program.
The July 9 updated version of the open access submission (with added names) has been submitted today.