Monthly Archives: September 2011
In Brief: The POCOS Project: Addressing the Challenges of Preserving Complex Visual Digital Objects
In Brief: New Guidelines: CrossRef DOIs to be Displayed as URLS
A New Way to Find: Testing the Use of Clustering Topics in Digital Libraries
Article by Kat Hagedorn and Michael Kargela, University of Michigan; Youn Noh, Yale University; and David Newman, University of California-Irvine
Digitization Practices for Translations: Lessons Learned from the Our Americas Archive Partnership Project
Article by Lorena Gauthereau-Bryson, Robert Estep, and Monica Rivero, Rice University
Automating the Production of Map Interfaces for Digital Collections Using Google APIs
Article by Anna Neatrour, Anne Morrow, Ken Rockwell, and Alan Witkowski, The University of Utah
MapRank: Geographical Search for Cartographic Materials in Libraries
Article by Markus Oehrli, Zentralbibliothek Zürich; Petr P?idal, Klokan Technologies and Moravian Library; Susanne Zollinger, ETH-Bibliothek Map Library; and Rosi Siber, EAWAG
Long-term Preservation for Spatial Data Infrastructures: a Metadata Framework and Geo-portal Implementation
Article by Arif Shaon, Science and Technology Facilities Council, Rutherford Appleton Laboratory and Andrew Woolf, The Bureau of Meteorology, Canberra, Australia
Useful & Interesting
Editorial by Laurence Lannom, CNRI
Open APIS; My attempts to be Openly Open
Having argued that we need to define Open APIs better I’ll share my experiences as a contributor to Open Knowledge and the challenges that this poses. The first message:
Openness takes effort
Creating copyrighted works costs no effort. Every keystroke, every digital photo is protected by copyright. Whereas to make something Open you have to specifically add permission to use. This is really tedious. It’s particularly difficult when the tools used have no natural support for adding permissions. So
Most of the time I fail to add explicit permissions
That’s a fact. I am not proud of it. But, to give an example I have just discovered that almost all my material in Cambridge DSpace does not provide Open permissions. That was never my intention. But the tools don’t allow me to click a single button and change it. I have to add rights to every single document (I have ca. 90). Meanwhile the automatic system continues to pronounce “Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.” This may be true, but it isn’t the sign of a community trying to make material Open.
I have spent the last 20 minutes trying to find out how to add permissions to my material. It’s impossible. (DSpace is one of the worst systems I have encountered). So, unless you get everything right in DSpace at time of submission you cannot edit the permissions. And, of course, by default the permissions are NO, NO, NO. I should add that some of the material has been uploaded by my colleagues. So next rule:
Unless EVERYONE involved in the system understands and operates Openness it won’t get added.
So all my intended Openness has been wasted. No-one can re-use my material because we cannot find out how to do it.
Now take this blog. I started without a licence. Then I added an NC one. Or rather I asked the blogmeister to add it (I was not in control of the Blog). And that’s the point. Very often you have to ask someone else to actually add the Openness. Then I was convinced of the error of NC and changed to CC-BY.
However in later 2010 the sysadmins changed the way the blog was served. This caused many problems, but among others it destroyed the CC-BY notice and licence. So:
Sometimes the systems will destroy the Openness.
So I would have to put in a ticket to have the CC-BY licence restored. All these little things add up to a lot of hassle. Just to stay where we are.
So to summarise so far. (Remember I WANT everything to be OKD-Open unless indicated).
- Blog. Not formally Open. Needs a licence added to the site and to each post?
- Dspace. Almost nothing formally Open. Unable to change that personally. Would have to spend time with the repositarians.
- Posts to lists. I *think* the OKF lists have a blanket Openness. But I’m not sure.
- Tweets. No. And I wouldn’t know how to make them Open.
- Papers. When we publish in BMC or PLoS they are required to be Open and obviously are. Pubs in IUCr are “Green” because we publish on CIF and this is free.
Now some better news. All our software is in Open Source Repositories.
- Software/Sourceforge. Required to be Open. Licence in (some of) the source code. Probably LICENSE.txt indicating Artistic Licence.
-
Bitbucket. Ditto.
Open Source software Openness is fairly trivial to assert.
Services.
- OPSIN. (http://opsin.ch.cam.ac.uk/ ). This is a free service (intended to be OKD-Open, but not labelled) which converts chemical names to structures. Software is Open Source (Artistic). Data comes from user. Output should be labelled as Open (but isn’t). Would require Daniel to add licence info into website. The primary software (OPSIN) is Open Source. However I expect the webserver is the typical local configuration of Java/python/Freemarker etc. and doesn’t easily transport. So what do we have to do about glueware? If it doesn’t port, is Openness relevant?
Services are naturally non-Open by default.
Glueware is a problem.
Data.
-
Crystaleye. 250,000 crystal structures (http://wwmm.ch.cam.ac.uk/crystaleye ). We worked hard on this. We have added “All data on this site is licensed under PDDL and all Open Data buttons point implicitly to the PDDL licence.” On the top page, “All of the data generated is ‘Open’, and is signified by the appearance of the
icon on all pages.” At the bottom, and the Open Data button on each page. And yet, according to CKAN, it’s still not Open because it cannot be downloaded in bulk. (Actually it can, and Jim Downing wrote a downloader. This was non-trivial effort, and I’ll come on to this later). So this is our best attempt (other than software) at Openness.
Even when the data items are OKD-Open, the site is not necessarily so.
So here we are, trying to be Open, and finding it a big effort, and failing on several counts. I bet that is generally true for others, especially if the didn’t plan it from the start. So
Openness has to be planned from the start as part of the service.
(The trouble is that much of what we have done wasn’t formally planned! Much is successful experimentation over several years).
Open APIs: fundamentals and the cases of KEGG and Wikipedia
It is now urgent and vital that we define what is an “Open API”. The phrase is widely used, usually without any indication of what it offers and what – if any restrictions – it imposes. This blog is a first pass – I don’t expect to get everything “right” and I hope we have comments that evolve towards something generally workable. Among other things we shall need:
- An agreement that this matters and that we must strive for OKD-open
- Tools to help us manage it
- A period of constructive development in trying to create fully Open APIs and a realisation of the problems and costs involved
I shall also list some additional criteria that I think are important or critical
Firstly the word “Open” (capitalised as such) is intended to convey the letter and the spirit of the Open Definition:
“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”.
This is a necessary but not sufficient condition for an “Open API”. What is the “API” bit?
Its stands for “Application Programming Interface” (http://en.wikipedia.org/wiki/Application_programming_interface ). In the current context it means a place (usually a website) where specific pieces of data can be obtained on request. It is often called a “service” and hence comes under the Open Service Definition.
“A service is open if its source code is Free/Open Source Software and non-personal data is open as in the Open Knowledge Definition (OKD).”
This is necessary, but not sufficient for what we now need. The rationale for the addition of F/OSS software is explained in
http://www.opendefinition.org/software-service/
The Open Software Service Definition defines ‘open’ in relation to online (software) services.
An online service, also known under the title of Software as a Service (SaaS), is a service provided by a software application running online and making its facilities available to users over the Internet via an interface (be that HTML presented by a web-browser such as Firefox, via a web-API or by any other means).
PMR: generally agreed. This can cover databases, repositories and other services. I shall try to illustrate
With an online-service, in contrast to a traditional software application, users no longer need to ‘possess’ (own or license) the software to use it. Instead they can simply interact via a standard client (such as web-browser) and pay, where they do pay, for use of the ‘service’ rather than for ‘owning’ (or licensing) the application itself.
PMR: I don’t fully understand this. I think there has to be an option for gratis access, else how does the system qualify as Open. But we do have to consider costs
The Definition
An open software service is one:
- Whose data is open as defined by the Open Knowledge Definition with the exception that where the data is personal in nature the data need only be made available to the user (i.e. the owner of that account).
-
Whose source code is:
- Free/Open Source Software (that is available under a license in the OSI or FSF approved list — see note 3).
- Made available to the users of the service.
- Free/Open Source Software (that is available under a license in the OSI or FSF approved list — see note 3).
I shall revisit “whose data” later and particularly the need to add a phrase such as “and is made available”
Notes
- The Open Knowledge Definition requires technological openness. Thus, for example, the data shouldn’t be restricted by technological means such as access control and should be available in an open format.
PMR: Agreed. It may also mean that you do not need to buy/licence proprietary tools to access the data. Is a PDF document Open? The software required to READ it is usually closed. An additional concern here is the use of DRM (Digital Rights management)
- The OKD also requires that data should be accessible in some machine automatable manner (e.g. through a standardized open API or via download from a standard specified location).
PMR: This is critical. I read this as “ALL the data”.
- The OSI approved list is available at: http://www.opensource.org/licenses/ and the FSF list is at: http://www.gnu.org/philosophy/license-list.html
- For an online-service simply using an F/OSS licence is insufficient since the fact that users only interact with the service and never obtain the software renders many traditional F/OSS licences inoperative. Hence the need for the second requirement that the source code is made publicly available.
PMR: Services almost always involve a hotch-potch of code at the server side (e.g. servlets, database wrappers, etc.) This can be a problem.
- APIs: all APIs associated with the service will be assumed to be open (that is their form may be copied freely by others). This would naturally follow from the fact that the code and data underlying any of the APIs are open.
PMR: This relates to documentation, I assume
- It is important that the service’s code need only be made available to its users so as not to impose excessive obligations on providers of open software services.
PMR I read this as “here’s the source code but we are not under any obligation to install it for you or to make it work”. I agree with this
As examples the OSD cites Google Maps: Not Open and Wikipedia: Open
- Code: Mediawiki is currently F/OSS (and is made available)
- Data: Content of Wikipedia is available under an ‘open’ licence.
One of the oft-quoted aspects of F/OSS is the “freedom to fork”. (http://en.wikipedia.org/wiki/Fork_%28software_development%29 , and http://lwn.net/Articles/282261/ ). Forking is often a “bad idea” but is the ultimate tool in preserving Openness. Because it means that if the original knowledge stops being Open (becomes closed, dies, is inoperable) then at least in theory someone can take the copy and continue the existence. I think this is fundamental for Open APIs.
The APIs must provide (implicitly or explicitly) the ability for someone to fork the software and content.
It doesn’t have to be easy and it doesn’t have to be cost-free. It just has to be *possible*.
The case of KEGG (Kyoto Encyclopedia of Genes and Genomes , http://www.genome.jp/kegg/ ) is a clear example of an Open service being closed (http://www.genome.jp/kegg/docs/plea.html ). IN brief the laboratory running the services used to make everything freely available (and implicitly Open) but now:
Starting on July 1, 2011 the KEGG FTP site for academic users will be transferred from GenomeNet at Kyoto University to NPO Bioinformatics Japan, and it will be available only to paid subscribers. The publicly funded portion, the medicus directory, will continue to be freely accessible at GenomeNet. The KEGG FTP site for commercial customers managed by Pathway Solutions will remain unchanged. The new FTP site is available for free trial until the end of June.
…
I would like to emphasize that the KEGG web services, including the KEGG API, will be unaffected by the new mechanism to be introduced on July 1, 2011. Our policy for the use of the KEGG web site will remain unchanged. The only change will be to FTP access. We have already introduced a “Download KGML” link for the KGML files that used to be available only by FTP, and will continue to improve the functionality of KEGG API. I would be very grateful if you could consider obtaining a KEGG FTP subscription as your contribution to the KEGG project.)
I am not passing any moral judgment – you cannot pay people with promises. But the point is that an Open Service has become closed. With the “right-to-fork” it is possible to “clone” all the Open material (possibly with FTP) before the closure date and maintain an Open version. This may or may not be cost-effective, but it’s possible.
So what is the KEGG API mentioned above and is it Open? Almost certainly not. It may be useful but it is clear that neither the software nor the complete contents of the database are available.
By contrast Wikipedia remains an Open API. It’s possible to clone enough of the software that matters and all of the content. Installing the software is probably non-trivial (yes, I can run Mediawiki but there are all sorts of other things, configuration files, quality bots, etc. And cloning the content means dumping a snapshot at a given time. But at least, if we care enough it is LEGALLY and technically possible.
In the next post I will examine some of our own resources and how close they are to “OKD and OSD-open”. We fall down on details but we succeed in motivation.
Why do we continue to use Citations?
I have just got the following mail from Biomed Central about an article we published earlier this year (edited to remove marketing spiel, etc.)
Dear Dr Murray-Rust,
We thought you might be interested to know how many people have read your article:
ChemicalTagger: A tool for semantic text-mining in chemistry
Lezan Hawizy, David M. Jessop, Nico Adams and Peter Murray-Rust
Journal of Cheminformatics, 3:17 (16 May 2011)
http://www.jcheminf.com/content/3/1/17
Total accesses to this article since publication: 2117
This figure includes accesses to the full text, abstract and PDF of the article on the Journal of Cheminformatics website. It does not include accesses from PubMed Central or other archive sites (see http://www.biomedcentral.com/info/libraries/archive). The total access statistics for your article are therefore likely to be significantly higher.
Your article is ‘Highly accessed’ relative to age. See http://www.biomedcentral.com/info/about/mostviewed/ for more information about the ‘Highly accessed’ designation.
These high access statistics demonstrate the high visibility that is achieved by open access publication.
I agree. It does not, of course, mean that 2117 people have read the whole article. I imagine it removes obvious bots. Of course there could be something very compelling in the words in the title. After all (http://blogs.ch.cam.ac.uk/pmr/2011/07/08/plos-one-text-mining-metrics-and-bats/ ) the word “bats” in the title of one PLOSOne paper got 200,000 accesses (or it might have been “fellatio” – I wouldn’t like to guess). So I looked up “tagger” in Urban Dictionary and its main meaning is a graffiti writer. Maybe some of those could use a “chemicaltagger”? But let’s assume it’s noise.
So “Chemicaltagger” has been heavily accessed and probably even read by some accessors. Let’s assume that 10% of accessors – ca 200 – have read at least parts of the paper. That possibly means the paper is worth something. But not to the lords of the assessment exercise. Only the holy citation matters. So how many citations? Google Scholar (using its impenetrable, but at least free-if-not-open system) gives 3. Where from? Well from our prepublication manuscripts in DSpace. If we regard these as self-citations (disallowed by some metricmeisters) we get a Humpty sum:
3 – 3 = 0
So the paper is worthless.
If we wait 5 years maybe we’ll get 20 citations (I don’t know). But it’s a funny world where you have to wait 5 years to find out whether something electronic is valued.
So aren’t accesses better than citations? After all don’t we use box office receipts to tell us how good films are? Or viewing figures to tell us the value of a program? [“good” and “value” having special meanings, of course]. So why this absurd reliance on citations? After all Wakefield got 80 citations for his (retracted) paper on MMR Vaccine and autism. Many were highly critical. But it ups the index!
The reason we use citations as a metric is not that they are good – they are awful – but that they are easy. Before online journals the only way we could find out if anyone had noticed a paper was in the reference list. Of course references can be there for many reasons – some positive, some neutral, some negative and many completely ritual. They weren’t devised as a way of measuring value but as a way of helping readers understand the context of the paper and giving due credit (positive and negative) to others.
But, because academia is largely incapable of developing its own system of measuring value, it now relies on others to gather figures. And pays them lots of money. Citations are big business – probably 200->1000 million USD per year. So it’s easier for academia to pay others precious funds. And all parties have a vested interest in keeping this absurd system going. Not because it’s good, but because it saves trouble. And of course the vendors of citation data will want to preserve the market.
This directly stifles academic research in textmining of typed citation data (i.e. trying to understand WHY a citation was provided). Big business with lawyers (e.g. Google) are allowed to mine data from academic papers. Researchers such as us are forbidden. Because bibliometrics is a massive business. And any disruptive technology (e.g. Chemicaltagger, which could also be used for citations) must be prevented by legal means. And we have to deprecate access data because that threatens to holy cow and holy income of citation sales.
The sooner we get academic texts safely minable – in bulk – the sooner we shall be able to have believable information. But I think there are many vested interests who will be preventing this. After all what does objectivity matter?
Implementing OA – policy cases and comparisons
Armbruster, Chris, Implementing Open Access Policy: First Case Studies. Chinese Journal of Library and Information Science, Vol. 3, No. 4, pp.1-22, 2010. [“a concise summary of many of the pioneering (e.g. QUT, Wellcome, Zurich, HHMI, FWF), comprehensive (e.g. PMC, ukPMC, INRIA/France) and international (e.g. SCOAP3) implementation efforts.”]
Armbruster, Chris, Open Access Policy Implementation: First Results Compared. Learned Publishing, Vol. 24, No. 3, 2011. [“a comparative evaluation discussing the most salient issues, such policy mandates and matching infrastructure requirements, content capture and the issue of scholarly compliance, benefits to authors, and efforts to provide access and enable usage”]
Armbruster, Chris, Implementing Open Access: Policy Case Studies (October 14, 2010). [original report]
Chris Armbruster’s policy cases, comparisons and conclusions make several useful points, some new, others already noted and published by others.
There is also a lot missing from Armstrong’s policy cases, comparisons and conclusions, partly because they do not take into account what has already been observed and published on the subject of OA policy and outcome, and partly because Armstrong fails to cover several of the key institutional repositories and their policies, including the first of them all, and among the most successful: the U Southampton School of Electronics and Computer Science green OA self-archiving mandate was adopted in 2003, provided the model for mandatory OA policies in the BOAI Handbook, and continues to provide both OA repository guidance and (free) OA repository software and services; it is also the source of most of the OA policy variants at the institutions that Armbruster does take into account.
There are also some rather important confusions in Armstrong’s conclusions, notably about versions, embargoes, “digital infrastructure,” and the nature of green vs. gold OA.
For those who seek a clear, practical picture of the woods, rather than a rather impressionistic sketch of some of the trees, what both institutions and funders need to do is:
1. Mandate deposit of the author’s final refereed draft, immediately upon acceptance for publication, in the author’s institutional repository.
2. Designate repository deposit as the sole mechanism for submitting refereed publications for institutional performance evaluation and for national research assessment.
3. Implement the email-eprint-request button to tide over researcher needs during the embargo, for any publisher-embargoed deposits.
Once institutions and funders have done that, all the rest will take care of itself (including versions, embargoes, “digital infrastructures” and gold OA.
Beginning this autumn, guidance to institutions and funders worldwide on implementing OA policies will begin to be provided by EnablingOpenScholarship (EOS), founded by the rector of the University of Liege, another institution whose highly successful OA policy Chris Armbruster neglected to mention in his comparisons.
Stevan Harnad
Implementing OA – policy cases and comparisons
Armbruster, Chris, Implementing Open Access Policy: First Case Studies. Chinese Journal of Library and Information Science, Vol. 3, No. 4, pp.1-22, 2010. [“a concise summary of many of the pioneering (e.g. QUT, Wellcome, Zurich, HHMI, FWF), comprehensive (e.g. PMC, ukPMC, INRIA/France) and international (e.g. SCOAP3) implementation efforts.”]
Armbruster, Chris, Open Access Policy Implementation: First Results Compared. Learned Publishing, Vol. 24, No. 3, 2011. [“a comparative evaluation discussing the most salient issues, such policy mandates and matching infrastructure requirements, content capture and the issue of scholarly compliance, benefits to authors, and efforts to provide access and enable usage”]
Armbruster, Chris, Implementing Open Access: Policy Case Studies (October 14, 2010). [original report]
Chris Armbruster’s policy cases, comparisons and conclusions make several useful points, some new, others already noted and published by others.
There is also a lot missing from Armstrong’s policy cases, comparisons and conclusions, partly because they do not take into account what has already been observed and published on the subject of OA policy and outcome, and partly because Armstrong fails to cover several of the key institutional repositories and their policies, including the first of them all, and among the most successful: the U Southampton School of Electronics and Computer Science green OA self-archiving mandate was adopted in 2003, provided the model for mandatory OA policies in the BOAI Handbook, and continues to provide both OA repository guidance and (free) OA repository software and services; it is also the source of most of the OA policy variants at the institutions that Armbruster does take into account.
There are also some rather important confusions in Armstrong’s conclusions, notably about versions, embargoes, “digital infrastructure,” and the nature of green vs. gold OA.
For those who seek a clear, practical picture of the woods, rather than a rather impressionistic sketch of some of the trees, what both institutions and funders need to do is:
1. Mandate deposit of the author’s final refereed draft, immediately upon acceptance for publication, in the author’s institutional repository.
2. Designate repository deposit as the sole mechanism for submitting refereed publications for institutional performance evaluation and for national research assessment.
3. Implement the email-eprint-request button to tide over researcher needs during the embargo, for any publisher-embargoed deposits.
Once institutions and funders have done that, all the rest will take care of itself (including versions, embargoes, “digital infrastructures” and gold OA.
Beginning this autumn, guidance to institutions and funders worldwide on implementing OA policies will begin to be provided by EnablingOpenScholarship (EOS), founded by the rector of the University of Liege, another institution whose highly successful OA policy Chris Armbruster neglected to mention in his comparisons.
Stevan Harnad
Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal
David J Solomon
College of Human Medicine, Michigan State University, E. Lansing, MI USA
Email dsolomon@msu.edu
Bo?Christer Björk
Management and Organization, Hanken School of Economics, Helsinki, Finland
Email bo?christer.bjork@hanken.fi
Accepted Version 08-18-11 Version as accepted for publication by the Journal of the American Society for Information Science and Technology.
(Note:This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Submitted Version 6-30-2011 as submitted to the Journal of the American Society for Information Science and Technology.
Supporting Tables