Estimating Japan’s Annual Rate of Journal Article Self-Archiving

Note added 17 September: Many thanks to Hideki Uchijima, Librarian of the Kanazawa University Library, for providing a very comprehensive and conscientious response.
    The response provides more accurate estimates of the percentage (11.1%) of Japanese annual refereed research article output that is currently being self-archived in the 158 Japanese Institutional Repositories that are being harvested by JAIRO, basing the estimate on the ISI Thompson-Reuters subset, an excellent first approximation (which we and others have also used to do such estimates), and confirming that Japan’s unmandated self-archiving rate indeed falls within the global average baseline of 5-25%.
    Hideki Uchijima also adds the good news that Hokkaido University (already registered in ROARMAP in 2008 as having an OA policy, but not yet an OA mandate) might soon be upgrading to a self-archiving mandate (and this might encourage further universities in Japan to do likewise).
    And last, Hideki Uchijima will now also try to persuade the IR managers of the remaining 81 Japanese universities (out of the 158 JAIRO total) who (unlike Hokkaido University and 76 other Japanese universities) have not yet done so, to register their IRs in ROAR.
    If all librarians, IR managers and OA activists worldwide were as attentive and responsive as Kanazawa University’s librarian, the world would reach its goal of 100% OA far sooner. (Many are, but far, far more need to be!)

Note added 18 September: Andrew A Adams (Meiji University) wrote:
    “During Open Access Week in October both Otaru University of Commerce and Hokkaido University will be holding meetings to promote deposit and adoption of a mandate. I have accepted invitations to speak at both events, arranged by Shigeki Sugita of the library at Otaru University of Commerce and Masako Suzuki of the library at Hokkaido University. Both are keen supporters of Green OA and a deposit mandate and are working hard to persuade managers and faculty at these two very different though physically close universities to adopt mandates (Otaru, being small and with limited funds has an access problem itself, whereas Hokkaido is one of the top ten universities in Japan…”

Congratulations to Japan’s JAIRO for harvesting the 700,000 full-texts (out of one million total) self-archived in Japan’s 158 Institutional Repositories since 2007.

To understand what this figure means, however, the fundamental question is whether or not it represents an increase over the worldwide baseline average for spontaneous (i.e. unmandated) self-archiving, which varies between 5-25% of the total annual output of the primary target content of the Open Access movement: the 2.5 million articles per year published in the planet’s 25,000 peer-reviewed journals across all disciplines and languages.

Of JAIRO’s 700K full-text total, about 110K (15.5%) consisted of journal articles, based on JAIRO’s statistical data.

From the growth chart (if I have interpreted it correctly), about 75% of 50,000 articles (i.e., 35,000 full-texts) were deposited in 2009. If we can assume that those deposits were all articles published within that same year (or the preceding one), then the question is: What percentage of Japan’s (or of those 158 institutions’) annual portion of the 2.5 million articles published yearly worldwide do these 35,000 full-texts represent? Does it exceed the worldwide unmandated baseline of 5-25%?

The reason I raise this question is because absolute figures — even absolute growth rates across years — are not meaningful in themselves. They are only meaningful if expressed as the percentage of total annual output. For a single institutional repository, this means the percentage of that institution’s annual output of refereed journal articles. For Japan’s 158 institutional repositories, it means the percentage of the total annual output of those 158 institutions.

On the conservative assumption that research-active universities publish at least 1000 refereed journal articles per year, the estimate would be that those 35K articles represent at most about 22% of those institutions’ annual refereed journal article output, which falls within the global 5-25% unmandated baseline.

The reason I stress this point is that it is important that we do not content ourselves with absolute self-archiving totals and growth rates that look sizeable considered in isolation. The figure to beat is the unmandated baseline of 5-25%, and the only institutions that consistently beat it are those that mandate self-archiving. Their deposit rates jump to 60% and approach 100% within a few years.

There are already 170 self-archiving mandates worldwide registered in ROARMAP — 96 institutional, 24 departmental and 46 funder mandates — but alas none yet from Japan. If there are any, it would be very helpful if they would be registered in ROARMAP.

Also, although Japan has at least 158 repositories, only 77 of them are registered in ROAR:

It would be very helpful if the rest were registered in ROAR too…


Björk B-C, Welling P, Laakso M, Majlender P, Hedlund T, et al. (2010) Open Access to the Scientific Journal Literature: Situation 2009. PLOS ONE 5(6): e11273.

Gargouri, Y., Hajjem, C., Lariviere, V., Gingras, Y., Brody, T., Carr, L. and Harnad, S. (2010) Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research. PLOS ONE (in press)

Harnad, S, (2008) Estimating Annual Growth in OA Repository Content. Open Access Archivangelism. August 9 2008

Sale, Arthur (2006) Researchers and institutional repositories, in Jacobs, Neil, Eds. Open Access: Key Strategic, Technical and Economic Aspects, chapter 9, pages 87-100. Chandos Publishing (Oxford) Limited.

Sale, A. (2006) The Impact of Mandatory Policies on ETD Acquisition. D-Lib Magazine April 2006, 12(4).

Sale, A. (2006) Comparison of content policies for institutional repositories in Australia. First Monday, 11(4), April 2006.

Sale, A. (2006) The acquisition of open access research articles. First Monday, 11(9), October 2006.

Sale, A. (2007) The Patchwork Mandate D-Lib Magazine 13 1/2 January/February

Stevan Harnad
American Scientist Open Access Forum

Eight More Green Open Access Self-Archiving Mandates Registered in ROARMAP

Eight more Green OA self-archiving mandates have been registered in ROARMAP since June, bringing the worldwide total to 170 (96 institutional mandates, 24 departmental mandates and 26 funder mandates):

Swedish Research Council Formas (SWEDEN)

Central Scientific Library of V.N. Karazin Kharkiv National University (UKRAINE)

Universidade Aberta (PORTUGAL)

Instituto Politécnico de Bragança (PORTUGAL)

University of Surrey (UK)

Erasmus University Rotterdam (NETHERLANDS)

Heart and Stroke Foundation of Canada (CANADA)

Institut français de recherche pour l’exploitation de la mer (Ifremer) (FRANCE)

Documents and Data…

Last month I was on Dr. Kiki’s Science Hour. Besides being a lot of fun (despite my technical problems, which were part of my recent move to GNU/Linux and away from Mac!), I also discovered that at least one person I went to high school with is a fan of Dr. Kiki, because he told everyone about the show at my recent high school reunion. Good stuff.

In the show, I did my usual rant about the web being built for documents, not for data. And that got me a great question by email. I wrote a long answer that I decided was a better blog post than anything else. Here goes.

Although I’m familiar with the Creative Commons & Science Commons, the interview really help me understand the bigger picture of the work you do. Among many other significant and timely anecdotes, I received the message that the internet is built around document search and not data search. This comment intrigued me immensely. I want to explore that a little more to understand exactly what you meant. Most importantly, I want to understand what you believe the key differences between the documents and the data are. From one perspective, the documents contain the data, from another, the data forms the documents.

True, in some cases. But in the case of complex adaptive systems – like the body, the climate, or our national energy usage – the data are frequently not part of a document. They exist in massive databases which are loosely coupled, and are accessed by humans not through search engines but through large-scale computational models. There are so many layers of abstraction between user and data that it’s often hard to know where the actual data at the base of a model reside.

This is at odds with the fundamental nature of the Web. The Web is a web of documents. Those documents are all formatted the same way, using a standard markup language, and the same protocol to send copies of those documents around. Because the language allows for “links” between documents, we can navigate the Web of documents by linking and clicking.

There’s more fundamental stuff to think about. Because the right to link is granted to creators of web pages, we get lots of links. And because we get lots of links (and there aren’t fundamental restrictions on copying the web pages) we get innovative companies like Google that index the links and rank web pages, higher or lower, based on the number of links referring to those pages. Google doesn’t know, in any semantic sense, what the pages are about, what they mean. It simply has the power to do clustering and ranking at a scale never before achieved, and that turns out to be good enough.

But in the data world, very little of this applies. The data exist in a world almost without links. There is no accepted standard language, though some are emerging, to mark up data. And if you had that, then all you get is another problem – the problem of semantics and meaning. So far at least, the statistics aren’t good enough to help us really structure data the way they structure documents.

From what you posited and the examples you gave, I envision a search engine which has the capacity to form documents out of data using search terms, e.g. enter two variables and get a graph as a result instead of page results. Not too far from what ‘Wolfram Alpha’ is working on, but indexing all the data rather than pre-tabulated information from a single server/provider. Perhaps I’m close but I want to make sure we’re on the same sheet of music.

I’m actually hoping for some far more basic stuff. I am less worried about graphing and documents. If you’re at that level, you’ve a) already found the data you need and b) know what questions you want to ask about it.

This is the world in which one group of open data advocates live. It’s the world of apps that help you catch the bus in Boston. It’s one that doesn’t worry much about data integration, or data interoperability, because it’s simple data – where is the bus and how fast is it going? – and because it’s mapped against a grid we understand, which is…well, a map.

But the world I live in isn’t so simple. Doing deeply complex modeling of climate events, of energy usage, of cancer progression – these are not so easy to turn into iPhone apps. The way we treat them shouldn’t be with the output of a document. It’s the wrong metaphor. We don’t need a “map” of cancer – we need a model that tells us, given certain inputs, what our decision matrix looks like.

I didn’t really get this myself until we started playing around with massive-scale data integration at Creative Commons. But since then, in addition to what we do here, I’ve been to the NCBI, I’ve been to Oak Ridge National Lab, I’ve been to CERN…and the data systems they maintain are monstrous. They’re not going to be copied and maintained elsewhere, at least, not without lots of funding. They’re not “webby” like mapping projects are. There’s not a lot of hackers who can use them, nor is there a vast toolset to use.

So I guess I’m less interested in search engines for data than I am in making sure that people who are building the models can use crawlers to find the data they want, and that they can be legally allowed to harvest that data and integrate it. Doing so is not going to be easy. But if we don’t design for that world, for model-driven access, then harvest and integration will quickly approach NP levels of complexity. We cannot assume that the tools and systems that let us catch the bus will let us cure cancer. They may, someday, evolve into a common system, and I hope they do – but for now, the iphone approach is using a slingshot against an armored division.

Read the comments on this post…