Cornell, Arxiv and Institutional vs. Central Repositories

On Thu, 7 Oct 2010, Joseph Esposito wrote (in liblicense):

What is the current uptake on arXiv for physics articles? Is it 100%, that is, are there any articles in the field that are published in traditional physics journals that do not appear in arXiv?

It varies by field. In HEP and Astro, most published journal articles are also self-archived in Arxiv.

To understand the meaning of this, however, it is important to note that extremely few papers that are self-archived in Arxiv are not (eventually) published in journals: Arxiv is an access-provider — to published and pre-publication research papers. Arxiv is not a publisher: Arxiv neither peer-reviews its contents, nor does it certify that they have been peer-reviewed; the publisher does that.

Hence, like all open access repositories, Arxiv is a supplement to publication, not a substitute for it.

Considering the centrality of arXiv to the physics community, it is difficult to imagine that it would ever disappear (or that anyone would want it to).

No one wants Arxiv to disappear, but I’ll bet that within a decade or sooner Arxiv will just be another automated central harvester of distributed local deposits from authors’ own institutional repositories (IRs), not a central locus of direct, institution-external deposit. In the age of IRs, it is no longer necessary — nor does it make sense — for authors to self-archive institution-externally. It is also a needless central expense to manage deposit centrally. It makes much more sense to deposit institutionally and harvest centrally.

My understanding is that arXiv is funded by a combination of support from Cornell, a large government grant, and contributions from other research universities. If this funding were to disappear (I heard it was threatened a year or two ago), would arXiv be resurrected by the community?

Once all universities have IRs and IR self-archiving mandates, there will be no need to fund repositories for institution-external deposit. Harvesting is cheap. And each university’s IR will be a standard part of its online infrastructure.

Finally, once again taking the centrality of arXiv to the community it serves into consideration, what would happen if a modest deposit fee were assessed–say, $50 per article?

The IR cost per paper deposited will be closer to 50c than $50, once all universities are hosting their own output, and mandating that it be deposited.

I am not suggesting that this should or should not happen; I am simply wondering what the outcome would be. (BioMed Central, PLoS, and Hindawi all charge more than this, though they provide additional services.) Would the number of deposits remain about the same? Would the number drop? And if it dropped, how precipitously?

Guess again! Once the burden of hosting, access-provision and archiving is offloaded onto each author’s institution, the only service that journals will need to provide is peer review, and hence journals will be charging institutions a lot less than they are charging now. (Print editions as well as online editions and their costs will be gone too.)

On Fri, Oct 8, 2010 at 12:57 AM, Simeon Warner wrote (in jisc-repositories):

The IR cost per paper deposited will be closer to 50c than $50, once all universities are hosting their own output, and mandating that it be deposited.

I do not think the 50c number is supported by fact or by trend. I know that for Cornell’s IR the number is much closer to $50 than to 50c if one divides cost to operate by the number of new submissions in the same period. (I would love to see data for other IRs.)

Simeon, I can only repeat the premise under which that prediction is made:

“once all universities are hosting their own output, and mandating that it be deposited.”

Cornell has not mandated deposit, and it is far from hosting all of its annual output. Ditto for all but about 100 universities so far worldwide.

(Not to mention that Cornell and many other universities may not have picked the optimal free IR software solution either ;>) …)

For arXiv the number is <$7. We have the benefit of significant scale (65k submissions/year) and a user community that require very little hand-holding.

Yes, you have significant scale. But, for Arxiv, it is Cornell, a federal grant, plus funds from some universities that are paying for all the deposits, from all universities, in that one central repository.

To repeat: The sensible solution (and probably the only practical, affordable, sustainable one) is for Arxiv — and any other central archives like it in other fields — to harvest their respective content automatically from Institutional Repositories that host their own research output. (Institutions, after all, are the universal providers of all that content.)

The annual cost per paper deposited will be far less for an Institutional Repository — hosting only its own research output — once the institutions are indeed hosting all of their own annual research output — and not just a small fragment of it, as now.

Most institutions today have IRs that are still near-empty rather than at full capacity (as far as OA’s target content is concerned). (The cost/benefit of universities hosting their own grey literature output and other kinds of content they generate is another matter, but not to be reckoned into this comparison with Arxiv regarding per-article cost. IRs can archive lots of kinds of things, including departmental reports or family photo albums, if desired…)

And Cornell, of course, has the double burden of hosting a near-empty, unmandated IR for its own refereed research output, plus the (partial) expense of hosting Arxiv for the rest of the world!


Annual Costs Per Deposit of Hosting Refereed Research Output Centrally Versus Institutionally

Why Cornell’s Institutional Repository Is Near-Empty


This is not to say that IRs aren’t worth the support from their local institution! Compared with the cost of doing research resulting in an article, $50 is pocket change. I think that a key driver for IRs is that they align well funding with mission. At Cornell we consider it a worthwhile service for our faculty to provide considerably more support for the IR than arXiv could provide its users.

There are many valid reasons for institutions creating and supporting their IRs — but only if they mandate that they be filled with their target content.

Among those many valid reasons are economic ones:

ABSTRACT: Among the many important implications of Houghton et al?s (2009) timely and illuminating JISC analysis of the costs and benefits of providing free online access (?Open Access,? OA) to peer-reviewed scholarly and scientific journal articles one stands out as particularly compelling: It would yield a forty-fold benefit/cost ratio if the world?s peer-reviewed research were all self-archived by its authors so as to make it OA. There are many assumptions and estimates underlying Houghton et al?s modelling and analyses, but they are for the most part very reasonable and even conservative. This makes their strongest practical implication particularly striking: The 40-fold benefit/cost ratio of providing Green OA is an order of magnitude greater than all the other potential combinations of alternatives to the status quo analyzed and compared by Houghton et al. This outcome is all the more significant in light of the fact that self-archiving already rests entirely in the hands of the research community (researchers, their institutions and their funders), whereas OA publishing depends on the publishing community. Perhaps most remarkable is the fact that this outcome emerged from studies that approached the problem primarily from the standpoint of the economics of publication rather than the economics of research.

Harnad, S. (2010) The Immediate Practical Implication of the Houghton Report: Provide Green Open Access Now. Prometheus 28 (1). pp. 55-59.

(As a side note I mention that at arXiv we consider free access and free submission to be foundational and thus did not consider an author-pays model. See for more details of our business planning process.)

Arxiv is a repository for articles that have been or will be refereed and published by journals. There is an “author pays” model for paying for that refereeing and publishing through author/institution publication fees (for OA journals, and a subscription model for non-OA journals, which are still the vast majority). — But there is not, never was, and never need be an “author pays” model merely to pay for the deposit of the author’s draft of those same articles.

Arxiv is a repository, providing access, not a publisher of refereed research. It is the many different journals in which Arxiv’s depositors publish who are still the ones doing the refereeing and the publishing (i.e., implementing the peer review process and certifying the outcome, if successful, as having met that journal’s established quality standards). And journals need to recover the costs of providing that essential service, either via journal subscriptions tolls or via “author pays” (i.e., article publication fees)

Once the burden of hosting, access-provision and archiving is offloaded onto each author’s institution, the only service that journals will need to provide is peer review, and hence journals will be charging institutions a lot less than they are charging now. (Print editions as well as online editions and their costs will be gone too.)

Overlay journals are also very interesting and I hope will grow in number. This does not seem to be happening yet though. A trend we see right now is a rather problematic increase in the number of low quality author-pays website-and-little-else online journals. They aggressively promote their articles through openaccess services such as arXiv while established journals wrestle with the transition.

On this you are entirely right, Simeon (though I think the term “overlay journals” is a misdescription of what may eventually come to pass, once all refereed, published articles are being self-archived in their author’s IR).

(And Cornell is aiding and abetting the very trend you mention, by agreeing pre-emptively to subsidize “author pays” costs for (some of) Cornell authors’ articles while failing to mandate self-archiving of all of Cornell authors’ articles, cost-free!)


Harnad, S. (2009) The PostGutenberg Open Access Journal. In: Cope, B. & Phillips, A (Eds.) The Future of the Academic Journal. Chandos.

In all of this the tools necessary to use IR content effectively still lag well behind the facilities offered by subject repositories.

Many of the necessary tools are not needed at the individual IR level, because search takes place at the harvester level.

What IRs lack is not tools, but content. Once we have the OA’s target content (refereed journal articles), developing the tools is a piece of cake.

One should also not underestimate the cost of building effective collections over harvested data (see, for example, the NSDL experience)

We can cross that bridge when we get to it — if Google Scholar does not cross it for us — once the target content is indeed being deposited in the IRs, globally — because deposit has been universally mandated at long last.

Stevan Harnad
American Scientist Open Access Forum