National Bike Month comes to a close

2105252468_9e829a638b National Bike Month which takes place every May, ends this week. To celebrate a month of cycling focused activities, EveryONE is highlighting some recent cycling research published in PLOS ONE.

As biking becomes ever more popular and bike-sharing programs expand, such as in New York City last weekend, cycling injuries and fatalities may increase as well. Although most people acknowledge the utility of wearing a helmet, encouraging cyclists to actually use them can be difficult.  A study examining the efficacy of several helmet-promotion measures showed that attitudes about helmets making people “look ridiculous” or “old-fashioned” can be hard to counter. Even providing cyclists with free helmet was only mildly successfully in convincing non-helmet users to wear one. The most effective measures included pressure from family or friends to wear it, as well as shifting the safety dialogue from helmets as head and brain protection to promoting helmets as “face-protecting” devices. So folks, as you get on your bikes this weekend, protect your face and wear a helmet.

Aesthetics aside, competitive cyclists are frequently seeking ways to improve their performance and speed recovery. Compression sportswear, from sleeves to knee-high socks to shorts, is one current performance-enhancing trend. These products tout improved arterial blood flow from the compression as a way to increase speed, reduce chances of injury and shorten recovery time. A pair of compression cycling shorts can be quite expensive though, so before you go purchasing your way to shorter race times, evidence in recent research indicates that claims of their efficacy may be overreaching. In fact, a study on athletes wearing compression shorts showed blood flow to the muscle actually decreased, contradicting many of the claims from these sportswear companies. Getting faster may just require more time in the saddle.

Theft of bikes is a persistent issue facing casual and competitive cyclists alike, but there’s some good news on this front: recent research showed a relatively simple deterrent to be surprisingly effective. The study found a 62% decrease in bicycle theft in locations where an ominous sign showing a person’s eyes and the words “Cycle Thieves We Are Watching You” (below) was posted above the bike racks. Theft in locations without these posters rose.


For more research on bikes and cycling performance, visit PLOS ONE.

Citation:  Constant A, Messiah A, Felonneau M-L, Lagarde E (2012) Investigating Helmet Promotion for Cyclists: Results from a Randomised Study with Observation of Behaviour, Using a Semi-Automatic Video System. PLoS ONE 7(2): e31651. doi:10.1371/journal.pone.0031651

Sperlich B, Born D-P, Kaskinoro K, Kalliokoski KK, Laaksonen MS (2013) Squeezing the Muscle: Compression Clothing and Muscle Metabolism during Recovery from High Intensity Exercise. PLoS ONE 8(4): e60923. doi:10.1371/journal.pone.0060923

Nettle D, Nott K, Bateson M (2012) ‘Cycle Thieves, We Are Watching You’: Impact of a Simple Signage Intervention against Bicycle Theft. PLoS ONE 7(12): e51738. doi:10.1371/journal.pone.0051738

Image Credits: cyclist by jesse.millan, poster from pone.0051738

Brain and Behavior: Issue 3 Highlights and Introducing Altmetric!

BRB 3 3Brain and Behavior’s latest issue brings together research on increased risk of anxiety from cigarette smoking, the effects of bisphenol A on social behavior in mice, and how the brain responds to cognitive load.  The cover features an image from, Neuroanatomical and neuropharmacological approaches to postictal antinociception-related prosencephalic neurons: the role of muscarinic and nicotinic cholinergic receptors by Renato Leonardo de Freitas,  Luana Iacovelo Bolognesi, André Twardowschy, Fernando Morgan Aguiar Corrêa, Nicola R. Sibson, Norberto Cysne Coimbra.

We’re also excited to announce that Brain and Behavior is participating in a pilot program offered by the service Altmetric to offer authors and users alternative metrics for articles and datasets to measure their impact on both traditional and social media. 

Below is another highlight chosen by the editorial team. 

purple_lock_open Crucifixion and median neuropathy
By Jacqueline M. Regan, Kiarash Shahlaie, Joseph C. Watson
Abstract: Crucifixion as a means of torture and execution was first developed in the 6th century B.C. and remained popular for over 1000 years. Details of the practice, which claimed hundreds of thousands of lives, have intrigued scholars as historical records and archaeological findings from the era are limited. As a result, various aspects of crucifixion, including the type of crosses used, methods of securing victims to crosses, the length of time victims survived on the cross, and the exact mechanisms of death, remain topics of debate. One aspect of crucifixion not previously explored in detail is the characteristic hand posture often depicted in artistic renditions of crucifixion. In this posture, the hand is clenched in a peculiar and characteristic fashion: there is complete failure of flexion of the thumb and index finger with partial failure of flexion of the middle finger. Such a “crucified clench” is depicted across different cultures and from different eras. A review of crucifixion history and techniques, median nerve anatomy and function, and the historical artistic depiction of crucifixion was performed to support the hypothesis that the “crucified clench” results from proximal median neuropathy due to positioning on the cross, rather than from direct trauma of impalement of the hand or wrist.

Link to the full table of contents here.

Submit your research here.

Don’t miss an issue!  Register for email table of contents alerts here.

Jailbreaking the PDF – 4; Making text from characters

In previous posts I have shown how we can, in most cases, create a set of Unicode characters from a PDF. If the original authors (e.g. many Government documents) were standard-compliant this is almost trivial. For scholarly publications, where the taxpayer/student pays 5000 USD per paper, the publishers refuse to use standards. So we have to use heuristics on this awful mess. (I have not yet found a scholarly publisher which is compliant and makes a syntactically manageable PDF – we pay them and they corrupt the information). But we have enough experience that for a given publisher we are correct 99->99.999% of the time (depending on the discipline – maths is harder than narrative text).

So now we have pages and on each page we have an UNORDERED list of characters. (We cannot rely on the order in which characters are transmitted – I spent two “wasted” months trying to use sequences and character groupings). We have to reconstruct text from the following STANDARD information for each character:

  • Its XY coordinates (raw PDF uses complex coordinates, PDFBox normalises to the page (0-600, 0-800))
  • Its FontFamily (e.g. Helvetica). This is because semantics are often conveyed by Fonts – monospace implies code or data. (I shall upset typographical purists as I should use “typeface” ( ) and not “font” or “font family”. But “FontFamily” is universal in PDF and computer terminology.
  • Its colour. This can be moderately complex – a character has an outline (stroke) and body (fill) and there are alpha overlays, transparency, etc. But most of the time it’s black.
  • Its font Weight. Normal or Bold. It’s complicated when publishers use fonts like MediumBold (greyish)
  • Its Size. The size is the actual font-size in pixels and not necessarily the points as in .

    Characters in the same font have different extents because of ascenders and descenders:

  • Its width. Monospaced fonts ( ) have equal width for all characters:

    Note that “I” and “m” have the same width. Any deliberate spaces also have the same width. That makes it easy to create words. The example above would have words “Aa”, “Ee”, “Qd”. (A word here is better described as a space-separated token, but “word” is simpler. It doesn’t mean it makes linguistic or numeric sense.

    If the font is not monospaced then we need to know the width. Here’s a proportional font ( ):

    See how the “P” is twice as wide as the “I” or “l” in the proportional font. We MUST know the width to work out whether there is a space after it. Because there are NO SPACES in PDFs.

  • Its style. Conflated with “slope”. Most scientists simply think “italic” (as in Java). But we find “oblique” and “underline” and many others. We need to project these to “italic” and “underline” as these have semantics.

Note that NormalBold , Normal|Italic, Normal|Underline can be multiplied to give 8 variants. Conformant PDF makes this easy – PDFBox has an API which includes:

  • public float getItalicAngle()
  • public float getUnderlineThickness();
  • public float getItalicAngle()
  • public
    boolean isBold(Font font)


If we have all this information then it isn’t too difficult to reconstruct:

  • words
  • Weight of words (bold)
  • Style of word (italic or underline)

Which already takes us a long way.

Do scholarly publishers use this standard?


(You probably guessed this.) For example I cannot get the character width out of ELife, the new Wellcome/MPI/HHMI journal. This seems to be because ELife hasn’t implemented the standard. They launched in 2012. There is no excuse for a modern publisher not being standards-compliant.

So the last posts have shown non-compliance in Elife, PeerJ, BMC. Oh, and PLoSOne also uses opaque fontFamilies (e.g. AdvP49811) . So the Open Access publishers all use non-standard fonts.

Do you assume that because closed access publishers charge more, they do better?

I can’t answer that because they have more money to pay lawyers.

I’ll let you guess. Since #AMI2 is Open Source you can do it yourself.

Global Research Council: Counting Gold OA Chicks Before the Green OA Eggs Are Laid

The Global Research Council?s Open Access Action Plan is, overall, timely and welcome, but it is far too focused on OA as (?Gold?) OA publishing, rather than on OA itself (online access to peer-reviewed research free for all).

And although GRC does also discuss OA self-archiving in repositories (?Green? OA), it does not seem to understand Green OA?s causal role in OA itself, nor does it assign it its proper priority.

There is also no mention at all of the most important, effective and rapidly growing OA plan of action, which is for both funders and institutions to mandate (require) Green OA self-archiving. Hence neither does the action plan give any thought to the all-important task of designing Green OA mandates and ensuring that they have an effective mechanism for monitoring and ensuring compliance.

The plan says:

?The major principles and aims of the Action Plan are simple: they are (a) encouragement and support for publishing in open access journals, (b) encouragement and support for author self-deposit into open access repositories, and (c) the creation and inter-connection of repositories.?

Sounds like it covers everything — (a) Gold, (b) Green, and (c) Gold+Green ? but the devil is in the details, the causal contingencies, and hence the priorities and sequence of action.

?In transitioning to open access, efficient mechanisms to shift money from subscription budgets into open access publication funds need to be developed.?

But the above statement is of course not about transitioning to OA itself, but just about transitioning to OA publishing (Gold OA).

And the GRC?s action plans for this transition are putting the cart before the horse.

There are very strong, explicit reasons why Green OA needs to come first — rather than double-paying for Gold pre-emptively (subscriptions plus Gold) without first having effectively mandated Green, since it is Green OA that will drive the transition to Gold OA at a fair, affordable, sustainable price:

Plans by universities and research funders to pay the costs of Open Access Publishing (“Gold OA”) are premature. Funds are short; 80% of journals (including virtually all the top journals) are still subscription-based, tying up the potential funds to pay for Gold OA; the asking price for Gold OA is still high; and there is concern that paying to publish may inflate acceptance rates and lower quality standards. What is needed now is for universities and funders to mandate OA self-archiving (of authors’ final peer-reviewed drafts, immediately upon acceptance for publication) (“Green OA”). That will provide immediate OA; and if and when universal Green OA should go on to make subscriptions unsustainable (because users are satisfied with just the Green OA versions) that will in turn induce journals to cut costs (print edition, online edition, access-provision, archiving), downsize to just providing the service of peer review, and convert to the Gold OA cost-recovery model; meanwhile, the subscription cancellations will have released the funds to pay these residual service costs. The natural way to charge for the service of peer review then will be on a “no-fault basis,” with the author’s institution or funder paying for each round of refereeing, regardless of outcome (acceptance, revision/re-refereeing, or rejection). This will minimize cost while protecting against inflated acceptance rates and decline in quality standards.

Harnad, S. (2010) No-Fault Peer Review Charges: The Price of Selectivity Need Not Be Access Denied or Delayed. D-Lib Magazine 16 (7/8).

Action 5: Develop an integrated funding stream for hybrid open access

Worst of all, the GRC action plan proposes to encourage and support hybrid Gold OA, with publishing not just being paid for doubly (via subscriptions to subscription publishers + via Gold OA fees to Gold OA publishers) but, in the case of hybrid Gold, with the double-payment going to the very same publisher, which not only entails double-payment by the research community, but allows double-dipping by the publisher.

That is the way to leave both the price and the timetable for any transition to OA in the hands of the publisher.

Action 6: Monitor and assess the affordability of open access

There is no point monitoring the affordability of Gold OA today, at a stage when it is just a needless double-payment, at the publisher?s current arbitrary, inflated Gold OA asking price.

What does need monitoring is compliance with mandates to provide cost-free Green OA, while subscriptions are still paying in full (and fulsomely) for the cost of publication, as they are today.

Action 7: Work with scholarly societies to transition society journals into open access

The only thing needed from publishers today ? whether scholarly or commercial ? is that they not embargo Green OA. Most (60%) don?t.

The transition to Gold OA will only come after Green OA has made subscriptions unsustainable, which will not only induce publishers to cut obsolete costs, downsize and convert to Gold OA, but it will also release the concomitant institutional subscription cancellation windfall savings to pay the price of that affordable, sustainable post-Green Gold.

Action 8: Supporting self-archiving through funding guidelines and copyright regulations
?The deposit of publications in open access repositories is often hampered not only by legal uncertainties, but also by the authors? reluctance to take on such additional tasks. Funding agencies will address this issue by exploring whether and how authors can be encouraged and supported in retaining simple copyrights as a precondition to self-archiving. In doing so, funders will also address authors? need to protect the integrity of their publications by providing guidance on suitable licenses for such purpose.?

Yes, Green OA needs to be supported. But the way to do that is certainly not just to ?encourage? authors to retain copyright and to self-archive.

It is (1) to mandate (require) Green OA self-archiving (as 288 funders and institutions are already doing: see ROARMAP), (2) to adopt effective mandates that moot publisher OA embargoes by requiring immediate-deposit, whether or not access to the deposit is embargoed, and (3) to designate institutional repository deposit as the mechanism for making articles eligible for research performance review. Then institutions will (4) monitor and ensure that their own research output is being deposited immediately upon acceptance for publication.

Action 9: Negotiate publisher services to facilitate deposit in open access repositories

Again, the above is a terribly counterproductive proposal. On no account should it be left up to publishers to deposit articles.

For subscription publishers, it is in their interests to gain control over the Green OA deposit process, thereby making sure that it is done on their timetable (if it is done at all).

For Gold OA, it?s already OA, so depositing it in a repository is no challenge.

It has to be remembered and understood that the ?self? in self-archiving is the author. The keystrokes don?t have to be personally executed by the author (students, librarians, secretaries can do the keystrokes too). But they should definitely not be left to publishers to do!

Green OA mandates are adopted to ensure that the keystrokes get done, and on time. Most journal are not Gold OA, but a Green OA mandate requires immediate deposit whether or not the journal is Gold OA, and whether or not access to the deposit is embargoed.

Action 10: Work with publishers to find intelligent billing solutions for the increasing amount of open access articles

The challenge is not to find ?billing solutions? for the minority of articles that are published as Gold OA today. The challenge if to adopt an effective, verifiable Green OA mandate to self-archive all articles.

Action 11: Work with repository organisations to develop efficient mechanisms for harvesting and accessing information

This is a non-problem. Harvesting and accessing OA content is already powerful and efficient.

It can of course be made incomparably more powerful and efficient. But there is no point or incentive in doing this while the target content is still so sparse ? because it has not yet been made OA (whether Green or Gold)!

Only about 10 ? 40% of content is OA most fields.

The way to drive that up to the 100% that it could already have been for years is to mandate Green OA.

Then (and only then) will be there be the motivation to ?develop [ever more] efficient mechanisms for harvesting and accessing [OA] information?

Action 12: Explore new ways to assess quality and impact of research articles

This too is happening already, and is not really an OA matter. But once most articles are OA, OA itself will generate rich new ways of measuring quality and impact.

Harnad, S. (2009) Open Access Scientometrics and the UK Research Assessment Exercise. Scientometrics 79 (1)

(Some of these comments have already been made in connection with Richard Poynder’s intreview of Johannes Fournier.)

Evolutionary Applications Issue 6.4 Now Live

EVA_6_4_coverEvolutionary Applications has now published its latest issue. This issue includes a number of top papers highlighted by Editor-in-Chief, Louis Bernatchez:

purple_lock_open The impact of natural selection on health and disease: uses of the population genetics approach in humans by Estelle Vasseur and Lluis Quintana-Murci

purple_lock_open Genomic and environmental selection patterns in two distinct lettuce crop–wild hybrid crosses by Yorike Hartman, Brigitte Uwimana, Danny A. P. Hooftman, Michael E. Schranz, Clemens C. M. van de Wiel, Marinus J. M. Smulders, Richard G. F. Visser and Peter H. van Tienderen

purple_lock_open Evolutionary rescue in populations of Pseudomonas fluorescens across an antibiotic gradient by Johan Ramsayer, Oliver Kaltz and Michael E. Hochberg

The journal continues to receive a high number of submissions across all areas of evolutionary biology and we would encourage you to submit your paper to the journal. Evolutionary Applications publishes papers that utilize concepts from evolutionary biology to address biological questions of health, social and economic relevance. In order to better serve the community, we also now strongly encourage submissions of papers making use of modern molecular and genetic methods to address important questions in any of these disciplines and in an applied evolutionary framework. Two of the papers highlighted above are good examples of this sort of paper. Further information about the journal’s aims and scopes can be found on the website.

Make sure that you never miss an issue by signing up for free table of content alerts here >

“Licences4Europe” has not accepted “The Right to Read is the Right to Mine”

One sentence summary (this link has all the documentation)

Stakeholders representing the research sector, SMEs and open access publishers withdraw from Licences for Europe


I have formally been a member of EC-L4E-WG4 a working group of the European Commission concentrating on Text and Data Mining (TDM, though I prefer “Content Mining”). I haven’t attended meetings (due to date clashes) but Ross Mounce has stood in for me and given brilliant presentations). The initial idea of the WG was to facilitate TDM as an added value to conventional publications and other sources. (The current problem is that copyright can be interpreted as forbidding TDM). When I and others joined this effort it was on the assumption that we would be looking for positive ways forward to encourage TDM.

When I buy a book I can do what I like with it. I can write on it.

from ( ) I can cut it up into bits. I can give/sell the book to someone else. I can give/sell the cut-out bits to someone else. I can stick the cut-out bits into a new book. I can transcribe the factual content. I can do almost anything other than copy non-facts.

With scholarly articles I can’t do any of this. I cannot own an article, I can only rent it. (Appalling concession #1 by Universities went completely unnoticed – I shall blog more). I cannot extract facts from it. (Even more Appalling concession #2 by Universities went completely unnoticed – I shall blog more). So the publishers have dictated to Universities that we cannot anything with the 10,000,000,000 USD we give to the publishers each year.

The publishers are now proposing that if we want to use any of OUR content (which we have already paid for) we should pay the publishers MORE. That TDM is an “added service” provided by publishers. It’s not. I can TDM without any help from the publishers. The only thing the publishers are doing is holding us to ransom.

If you don’t feel this is unjust and counterproductive stop reading. Back to “Licences for Europe”…

The L4E group has had no chance to set the group assumptions. From the outset the chair has insisted that this group is “L4E”, licences for Europe. The default premise is that document producers can and should add additional restrictions through licences. In short – we have fought this publicly and the chair has failed to listen to us, let alone consider our arguments. Who are we?

  • The Association of European Research Libraries (LIBER)
  • The Coalition for a Digital Economy
  • European Bureau of Library Information and Documentation Associations (EBLIDA)
  • The Open Knowledge Foundation
  • Communia
  • Ubiquity Press Ltd.
  • Trans?Atlantic Consumer Dialogue
  • National Centre for Text Mining, University of Manchester
  • European Network for Copyright in support of Education and Science (ENCES)
  • Jisc

Not a lightweight list. Here’s the formal history:

We welcomed the orientation debate by the Commission in December 2012 and the subsequent commitment to adapt the copyright framework to the digital age. We believe that any meaningful engagement on the legal framework within which data driven innovation exists must, as a point of centrality, address the issue of limitations and exceptions. Having placed licensing as the central pillar of the discussion, the “Licences for Europe” Working Group has not made this focused evaluation possible. Instead, the dialogue on limitations and exceptions is only taking place through the refracted lens of licensing. This incorrectly presupposes that additional relicensing of already licensed content (i.e. double licensing) – and by implication also licensing of the open internet– is the solution to the rapid adoption of TDM technology.

We wrote expressing our concerns (March 14) – some sentences (highlighting is mine):

10. Data driven innovation requires the lowest barriers possible to reusing content. Requiring the relicensing of copyright works one already has lawful access to for a non – competing use is entirely disproportionate, and raises strong ethical questions as it will affect what computer based medical and scientific research can and cannot be undertaken in the EU.

11. A situation where each proposed TDM based research or use of content, to which one already has lawful access, has to be submitted for approval is unscalable*, and will raise barriers to research and reduce online innovation. It will slow medical discoveries and data driven innovation inexorably, and will only serve to drive jobs, research, health and wealth – creation elsewhere.

12. For the full potential of data driven innovation to become a reality, a limitation and exception that allows text and data mining for any purposes, which cannot be over – ridden by private contracts is required in EU law.

13. Subject to point 3, we must be able to share the results of text and data mining with no hindrances irrespective of copyright laws or licensing terms to the contrary. 14. In the European information society, the right to read must be the right to mine.

(I am particularly pleased that my phrase “the right to read must be the right to mine” expresses our message succinctly.

Unfortunately the response ( ) was anodyne and platitudinal (“win-win solutions for all stakeholders”). It became clear that this group could not make any useful progress and at worse would legitimize the interests of the “content owners”.

So we have withdrawn.

Having placed licensing as the central pillar of the discussion, the “Licences for Europe” Working Group has not made this focused evaluation possible. Instead, the dialogue on limitations and exceptions is only taking place through the refracted lens of licensing. This incorrectly presupposes that additional relicensing of already licensed content (i.e. double licensing) – and by implication also licensing of the open internet– is the solution to the rapid adoption of TDM technology.

Therefore, we can no longer participate in the “Licences for Europe” process. We maintain that a vibrant internet and a healthy scholarly publishing community need not be at odds with a modern copyright framework that also allows for the barrier – free extraction of facts and data. We have already expressed this view sufficiently well within the Working Group.

And we have concerns about transparency.

We would like to reiterate our request for transparency around the “Licences for Europe” dialogue and kindly request that the following actions be taken:

  • That the list of organisations participating in all of the “Licenses for Europe” Working Groups be made publicly available on the “Licences for Europe” website;
  • That the date of withdrawal for organisations leaving the process is also recorded on this list;
  • That it is made clear on any final documents that the outputs from the working group on TDM are not endorsed by our organisations and communities.


If you feel that we have a right to mine our information, then help us fight for it. Because inaction simply hands our rights to vested interests.

Sharing was Caring for Ancient Humans and Their Prehistoric Pups

huskies_1While the tale of how man’s best friend came to be (i.e., domestication) is still slowly unfolding, a recently published study in PLOS ONE may provide a little context—or justification?—for dog lovers everywhere. It turns out that even thousands of years ago, humans loved to share food with, play with, and dress up their furry friends.

In the study titled “Burying Dogs in Ancient Cis-Baikal, Siberia: Temporal Trends and Relationships with Human Diet and Subsistence Practices,” biologists, anthropologists, and archaeologists joined forces to investigate the nature of the ancient human-dog relationship by analyzing previously excavated canid remains worldwide, with a large portion of specimens in modern-day Eastern Siberia, Russia. The authors performed genetic analysis and skull comparisons to establish that the canid specimens were most likely dogs, not wolves, which was an unsurprising but important distinction when investigating the human-canine bond. The canid skulls from the Cis-Baikal region most closely resembled large Siberian huskies, or sled dogs. Radiocarbon dating from previous studies also provided information regarding the dates of death and other contextual information at the burial sites.

The researchers found that the dogs buried in Siberia, many during the Early Neolithic period 7,000-8,000 years ago, were only found at burial sites shared with foraging humans. Dogs were found buried in resting positions, or immediately next to humans at these sites, and their graves often included various items or tools seemingly meant for the dogs. One dog in particular was adorned with a red deer tooth necklace around its neck and deer remnants by its side, and another was buried with what appears to be a pebble or toy in its mouth.

prehistoric dog_3

By analyzing the carbon and nitrogen in human and dog specimens in this region, the researchers were able to determine similarities in human and dog diets, both of which were rich in fish. This finding may be somewhat surprising because one might assume that dogs helped humans hunt terrestrial game, and would consequently be less likely found among humans that ate primarily fish.

The authors speculate that dogs were considered spiritually similar to humans, and were therefore buried at the same time in the same graves. The nature of the burials and the similarities in diet also point toward an intimate and personal relationship, both emotional and social, between humans and their dogs—one that involved sharing food and giving dogs the same burial rites as the humans they lived among. Ancient dogs weren’t just work animals or hunters, the authors suggest, but important companion animals and friends as well.

Citation: Losey RJ, Garvie-Lok S, Leonard JA, Katzenberg MA, Germonpré M, et al. (2013) Burying Dogs in Ancient Cis-Baikal, Siberia: Temporal Trends and Relationships with Human Diet and Subsistence Practices. PLoS ONE 8(5): e63740. doi:10.1371/journal.pone.0063740

Image Credits: Losey RJ, Garvie-Lok S, Leonard JA, Katzenberg MA, Germonpré M, et al. (2013) Burying Dogs in Ancient Cis-Baikal, Siberia: Temporal Trends and Relationships with Human Diet and Subsistence Practices. PLoS ONE 8(5): e63740. doi:10.1371/journal.pone.0063740

Siberian husky photo by Pixel Spit

Invite: SPARC Europe Open Session at the Pre-LIBER conference in Munich Germany

You are warmly invited to SPARC Europe Open Session in conjunction with the Pre-LIBER conference in Munich, Germany, the 25th of June at 2.30pm-5.30pm.

Venue: Hilton Park Hotel Am Tucherpark 7, 80538.

We will discuss how libraries can make open access work now that open access is moving into the mainstream. Can libraries/universities reallocate funds from the big deals to the support of open access publishing? Speakers for this topic are:

Prof. Björn Brembs, Universität Regensburg, Germany
Prof. Dr. Susanne Weigelin-Schwiedrzik, Pro Vice Chancellor, University of Vienna, Austria
Anna Lundén, Coordinator, the Swedish Library Consortium, BIBSAM, National Library of Sweden
Berndt Dugall, Library Director, Johann Wolfgang Goethe-Universität, Frankfurt am Main, Germany

There will also be an informal discussion between Dr Celina Ramjoué, European Commission and Dr Alma Swan, SPARC Europe’s Director for Advocacy, on how organisations are devoting considerable resources to working for a good open access policy for the EU. How far have we come and what is the nature of progress?

Please send a mail to if you wish to participate in the session. 

Invite in German:

Jailbreaking the PDF -3; Styles and fonts and the problems from Publishers.

Many scientific publications use specific styling to add semantics. In converting to XML it’s critical we don’t throw these away at an early stage, yet many common tools discard such styles. #AMI2 does its best to preserve all these and I think is fairly good. There are different reasons for using styles and I give examples from OA publishers…

  • Bold – used extensively for headings and inline structuring. Note (a) the bold for the heading and (b) the start-of-line

  • Italic. Species are almost always rendered this way.

  • Monospaced.
    Most computer code is represented in this (abstract) font.

This should have convinced you that fonts, and styles matter and should be retained. But many PDF2xxx systems discard them, especially for scholarly publications. There’s a clear standard in PDF for indicating bold, for italic and PDFBox gives a clear API for this. But many scholarly PDFs are awful (did I mention this before?). The BMC fonts don’t declare they are bold even though they are. Or italic. So we have to use heuristics. If a BMC font has “+20″ after its name it’s probably bold. And +3 means italics.

Isn’t this a fun puzzle?

No. It’s holding science back. Science should be about effective communication. If we are going to use styles rather than proper markup, let’s do it properly. Let’s tell the world it’s bold. Let’s use 65 to mean A.

There are a few cases where an “A” is not an “A”. As in

Most of these have specific mathematical meanings and uses and most have their own Unicode points. They are not letters in the normal sense of the word – they are symbols. And if they are well created and standard then they are mangeable

But now an unnecessary nuisance from PeerJ (and I’m only using Open Access publishers so I don’t get sued):

What are the blue things? They look like normal characters, but they aren’t:

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”284.784″ y=”162.408″ font-weight=”normal”></text>

<text fill=”#00a6fc” svgx:fontName=”MinionExp-Regular” svgx:width=”299.0″

x=”281.486″ y=”162.408″ font-weight=”normal”></text>

They are weird codepoints, outside the Unicode range:

These two seem to be small-capital “1″ and “0″. They aren’t even valid Unicode characters. Some of our browsers won’t display them:

(Note the missing characters).

Now the DOI is for many people a critically important part of the paper! It’s critical that it is correct and re-usable. But PeerJ (which is a modern publisher and tells us how it has used modern methods to do publishing better and cheaper) seems to have deliberately used totally non-standard characters for DOIs to the extent that my browser can’t even display them. I’m open to correction – but this is barmy. (The raw PDF paper displays in Firefox, but that’s because the font is represented by glyphs rather than codepoints.) No doubt I’ll be told that it’s more important to have beautiful fonts to reduce eyestrains for humans and that corruption doesn’t matter. Most readers don’t even read the references – they simply cut and paste them.

So let’s look at the references:

Here the various components are represented in different fonts and styles. (Of course it would be better to use approaches such as BibJSON and even BibTeX, but that would make it too easy to get it right). So here we have to use fonts and styles to guess what the various bits mean. Bold are the authors, Followed by a period. And bold number for the year. And a title in normal font. A Journal in italics. More Bold for the volume number. Normal for the pages. Light blue is DOI.

But at least if we keep the styles then #AMI2 can hack it. Throwing away the styles makes it much harder and much more error prone.

So to summarise #AMI2=PDF2SVG does things that most other systems don’t do:

  • Manages non-standard fonts (but with human labour)
  • Manages styles
  • Converts to Unicode

AMI2 can’t yet manage raw glyphs, but she will in due time.(Unless YOU wish to volunteer – it actually is a fun machine-learning project).

NOTE: If you are a large commercial publisher then your fonts are just as bad.

Jailbreaking the PDF – 2; Technical aspects (Glyph processing)

A lot of our discussion in Jailbreaking related to technical issues, and this is a – hopefully readable – overview.

PDF is a page description format (does anyone use pages any more? other than publishers and letter writers?) which is designed for sighted humans. At its most basic it transmits a purely visual image of information, which may simply be a bitmap (e.g. a scanned document). That’s currently beyond our ability to automate (but we shall ultimately crack it). More usually it consists of glyphs ( the visual representation of character). All the following are glyphs for the character “a”.

The minimum that a PDF has to do is to transmit one of these 9 chunks. It can do that by painting black dots (pixels) onto the screen. Humans can make sense of this (they get taught to read but machines can’t. So it really helps when the publisher adds the codepoint for a character. There’s a standard for this – it’s called Unicode and everyone uses it. Correction: MOST people, but NOT scholarly publishers. Many publishers don’t include codepoints at all but transmit the image of the glyph (this is sometimes a bitmap, sometimes a set of strokes (vector/outline fonts)). Here’s a bitmap representation the first “a”.

You can see it’s made of a few hundred pixels (squares). The computer ONLY knows these are squares. It doesn’t know they are an “a”. We shall crack this in the next few months – it’s called Optical Character Recognition OCR and usually done by machine learning – we’ll pool our resources on this. Most characters in figures are probably bitmapped glyphs, but some are vectors.

In the main text characters SHOULD be represented by a codepoint – “a” is Unicode codepoint 97. (Note that “A” is different and codepoint 65 – I’ll use decimal values). So every publishers represent “a” by 97?

Of course not. Publishers PDFs are awful and don’t adhere to standards. That’s a really awful problem. Moreover some publishers use 97 to mean . Why?? because in some systems there is a symbol font and it only has Greek characters and they use the same numbers.

So why don’t publishers fix this? It’s because (a) they don’t care and (b) they can extract more money from academia for fixing it. They probably have the correct codepoint in their XML but they don’t let us have this as they want to charge us extra to read it. (That’s another blog post). Because most publishers use the same typesetters these problems are endemic in the industry. Here’s an example. I’m using BioMedCentral examples because they are Open. I have high praise for BMC but not for their technical processing. (BTW I couldn’t show any of this from Closed publishers as I’d probably be sued).

How many characters are there in this? Unless you read the PDF you don’t know. The “BMC Microbiology” LOGO is actually a set of graphics strokes and there is no indication it is actually meaningful text. But I want to concentrate on the “lambda” in the title. Here is AMI2′s extracted SVG/XML (I have included the preceding “e” of “bacteriophage”)

<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvOT46dcae81″

svgx:width=”500.0″ x=”182.691″ y=”165.703″ font-size=”23.305″


<text stroke=”none” fill=”#000000″ svgx:fontName=”AdvTT3f84ef53″

svgx:width=”0.0″ x=”201.703″ y=”165.703″ font-size=”23.305″


Note there is NO explicit space. We have work it out from the coordinates (182.7 + 0.5*23 << 201.7). But the character 108 is a “l” (ell) and so an automatic conversion system creates

This is wrong and unacceptable and potentially highly dangerous – a MU would be convtered to an “EM”, so Micrograms could be converted to Milligrams.

All the systems we looked at yesterday made this mistake except #AMI2. So almost all scientific content mining systems will extract incorrect information unless they can correct for this. And there are three ways of doing this:

  • Insisting publishers use Unicode. No hope in hell of that. Publishers (BMC and other OA publishers excluded) in general want to make it as hard as possible to interpret PDFs. So nonstandard PDFs are a sort of DRM. (BTW it would cost a few cents per paper to convert to Unicode – that could be afforded out of the 5500 USD they charge us).
  • Translating the glyphs into Unicode. We are going to have to do this anyway, but it will take a little while.
  • Create lookups for each font. So I have had to create a translation table for the non-standard font AdvTT3f84ef53 which AFAIK no one other than BMC uses and isn’t documented anywhere. But I will be partially automating this soon and it’s a finite if soul-destroying task

So AMI2 is able to get:

With the underlying representation of lambda as Unicode 955:

So AMI2 is happy to contribute her translation tables to the Open Jalibreaking community. She’d also like people to contribute, maybe through some #crowdcrafting. It’s pointless for anyone else to do this unless they want to build a standalone competitive system. Because it’s Open they can take AMI2 as long as they acknowledge it in their software. Any system that hopes to do maths is almost certainly going to have to use a translator or OCR.

So glyph processing is the first and essential part of Jailbreaking the PDF.


Jailbreaking the PDF; a wonderful hackathon and a community leap forward for freedom – 1

Yesterday we had a truly marvellous hackathon in Montpellier, in between workshops and main Eur Semantic Web Conference. The purpose was to bring together a number of groups who value semantic scholarship and free information from the traditional forms of publication. I’ll be blogging later about the legal constraints imposed by the publishing industry, but Jailbreaking is about the technical constraints of publishing information as PDF.

The idea Jailbreaking was to bring together people who have developed systems, tools, protocols, communities for turning PDF into semantic form. Simply, raw PDF is almost uninterpretable, a bit like binary programs. For about 15 years the spec was not Open and it was basically a proprietary format from Adobe. The normal way of starting to make any sense of PDF content is to buy tools from companies such as Adobe, and there has been quite a lot of recent advocacy from Adobe staff to consider using PDF as a universal data format. This would be appalling – we must use structured documents for data and text and mixtures. Fortunately there are now a good number of F/OSS tools, my choice being and these volunteers have laboured long and hard in this primitive technology to create interpreters and libraries. PDF can be produced well, but most scholarly publishers’ PDFs are awful.

It’s a big effort to create a PDF2XML system (the end goal). I am credited with the phrase “turning a hamburger into a cow” but it’s someone else’s. If we sat down to plan PDF2XML, we’d conclude it was very daunting. But we have the modern advantage of distributed enthusiasts. Hacking PDF systems by oneself at 0200 in the morning is painful. Hacking PDFs in the company of similar people is wonderful. The first thing is that it lifts the overall burden from you. You don’t have to boil the ocean by yourself. You find that others are working on the same challenge and that’s enormously liberating. They face the same problems and often solve them in different ways or have different priorities. And that’s the first positive takeaway – I am vastly happier and more relaxed. I have friends and the many “I“s are now we. It’s the same liberating feeling as 7 years ago when we created the community for chemistry. Jailbreaking has many of the shared values, though coming from different places.

Until recently most of the tools were closed source, usually for-money though occasionally free-as-in-beer for some uses or communities. I have learnt from bitter experience that you can never build an ongoing system on closed source components. At some stage they will either be withdrawn or there will be critical things you want to change or add and that’s simply not possibly. And licensing closed source in an open project is a nightmare. It’s an anticommmons. So, regretfully, I shall not include Utopia/pdfx from Manchester in my further discussion because I can’t make any use of it. Some people use its output, and that’s fine – but I would/might want to use some of its libraries.

There was a wonderful coming-together of people with open systems. None of us had the whole picture , but together we covered all of it. Not “my program is better than your program”, but “our tools are better than my system“. So here a brief overview of the open players who came together (I may miss some individuals, please comment if I have done you an injustice). I’ll explain the technical bits is a later post – here I am discussing the social aspects.

  • LA-PDFText (
    Gully Burns). Gully was in Los Angeles – in the middle of the night and showed great stamina J In true hacking spirit I used the time to find out about Gully’s system. I downloaded it and couldn’t get it to install (needed java-6). So Gully repackaged it, and within two iterations (an hour) I had it working. That would have taken days conventionally. LA-PDFText is particularly good at discovering blocks (more sophisticated than #AMI2) so maybe I can use it in my work rather than competing.
  • CERMINE . I’ve already blogged this but here we had the lead Dominika Tkaczyk live from Poland. I take comfort from her presence and vice versa. CERMINE integrates text better than #AMI at present and has a nice web service
  • Florida State University. Alexander Garcia, Casey McLaughlin, Leyla Jael Garcia Castro, Biotea ( ) Greg Riccardi and colleagues. They are working on suicide in the context of Veterans’ admin documents and provided us with an Open corpus of many hundred PDFs. (Some were good, some were really awful). Alex and Casey ran the workshop with great energy, preparation, food, beer, etc. and arranging the great support from the ABES site.
  • #crowdcrafting. It will become clear that human involvement is necessary in parts of the PDF2XML process. Validating or processes, and also possible tweaking final outputs. We connected to Daniel Lombraña González

    of who took us through the process of building a distributed volunteer community. There was a lot of interest and we shall be designing clear crowdcrafting-friendly tasks (e.g. “draw a rectangle round the title”, “highlight corrupted characters”, “how many references are there”, etc.)

  • CITALO This system deduces the type of the citation (reference) from textual analysis. This is a very good example of a downstream application which depends on the XML but is largely independent of how it is created.
  • #AMI2. Our AMI2 system is complementary to many of the others – I am very happy for others to do citation typing, or match keywords. AMI2 has several unique features (I’ll explain later), including character identification, graphics (graphics are not images) extraction, image extraction, sub and superscripts, bold and italic. (Most of the other systems ignore graphics completely and many also ignore bold/italic)

So we have a wonderful synthesis of people and projects and tools. We all want to collaborate and are all happy to put community success as the goal , not individual competition. (And the exciting thing is that it’s publishable and will be heavily cited. We have shown this in the Blue Obelisk publications where the first has 300 citations and I’d predict that a coherent Jailbreaking publication would be of great interest. )

So yesterday was a turning point. We have clear trajectories. We have to work to make sure we develop rapidly and efficiently. But we can do this initially in a loose collaboration, and planning meetings and bringing in other collaborators and funding.

So if you are interested in An Open approach to making PDFs Open and semantic, let us know in the comments.


Pre Green-OA Fool’s Gold vs. Post Green-OA Fair Gold

Comment on Richard Poynder’s “The UK?s Open Access Policy: Controversy Continues“:

Yes, the Finch/RCUK policy has had its predictable perverse effects:

1. sustaining arbitrary, bloated Gold OA fees
2. wasting scarce research funds
3. double-paying publishers [subscriptions plus Gold]
4. handing subscription publishers a hybrid-gold-mine
5. enabling hybrid publishers to double-dip
6. abrogating authors’ freedom of journal-choice [economic model/CC-BY instead of quality]
7. imposing re-mix licenses that many authors don’t want and most users and fields don’t need
8. inspiring subscription publishers to adopt and lengthen Green OA embargoes [to maxmize hybrid-gold revenues]
9. handicapping Green OA mandates worldwide (by incentivizing embargoes)
10. allowing journal-fleet publishers to confuse and exploit institutions and authors even more

But the solution is also there (as already adopted in Francophone Belgium and proposed by HEFCE for REF):

a. funders and institutions mandate immediate-deposit
b. of the peer-reviewed final draft
c. in the author’s institutional repository
d. immediately upon acceptance for publication
e. whether journal is subscription orGold
f. whether access to the deposit is immedate-OA or embargoed
g. whether license is transfered, retained or CC-BY;
h. institutions implement repository’s facilitated email eprint request Button;
i. institutions designate immediate-deposit the mechanism for submitting publictions for research performance assessment;
j. institutions monitor and ensure immediate-deposit mandate compliance

This policy restores author choice, moots publisher embargoes, makes Gold and CC-BY completely optional, provides the incentive for author compliance and the natural institutional mechanism for verifying it, consolidates funder and institutional mandates, hsstens the natural death of OA embargoes, the onset of universal Green OA, and the resultant institutional subscription cancellations, journal downsizing and transition to Fair-Gold OA at an affordable, sustainable price, paid out of institutional subscription cancellation savings instead of over-priced, double-paid, double-dipped Fool’s-Gold. And of course Fair-Gold OA will license all the re-use rights users need and authors want to allow.

SePublica : Overview of my Polemics presentation #scholrev

This is a list of the points I want to cover when introducing the session on Polemics. A list looks a bit dry but I promise to be polemical. And try to show some demos at the end. The polemics are constructive in that I shall suggest how we can change the #scholpub world by building a better one than the current one.

NOTE: Do not be overwhelmed by the scale of this. Together we can do it.

It is critical we act now

  • Semantics/Mining is now seen as an opportunity by some publishers to “add value” by building walled gardens.
  • Increasing attempts to convince authors to use CC-NC.
  • We must develop semantic resources ahead of this and push the edges

One person can change the world

We must create a coherent community

  • Examples:
    • Open Streetmap,
    • Wikipedia
    • Galaxyzoo
    • OKFN Crowdcrafting,
    • Blue Obelisk (Chemistry – PMR),
    • ?#scholrev


  • Give power to authors
  • Discover, aggregate and search (“Google for science”)
  • Make the literature computable
  • Enhance readers with semantic aids
  • Smart “invisible” capture of information

Practice before Politics

  • Create compelling examples
  • Add Value
  • Make authors’ lives easier
  • Mine and semanticize current scholarship.

Text Tables Diagrams Data

  • Text (chemistry, species)
  • Tables (Jailbreak corpus)
  • Diagrams chemical spectra, phylogenetic trees
  • Data (output). Quixote

Material to start with

  • Open information (EuropePMC, theses)
  • “data not copyrightable”. Supp data, tables, data-rich diagrams
  • Push the limits of what’s allowed (forgiveness not permission)

Disciplines/artefacts with good effort/return ratio

  • Phylogenetic trees (Ross Mounce + PMR)
  • Nucleic acid sequences
  • Chemical formulae and reactions
  • Regressions and models
  • Clinical/human studies (tables)
  • Dose-response curves

Tools, services, resources

    We need a single-stop location for tools

  • Research-enhancing tools (science equiv of Git/Mercurial). Capture and validate work continuously
  • Common approach to authoring
  • Crawling tools for articles, theses.
  • PDF and Word converters to “XML”
  • Classifiers
  • NLP tools and examples
  • Table hackers
  • Diagram hackers
  • Logfile hackers
  • Semantic repositories
  • Abbreviations and glossaries
  • Dictionaries and dictionary builders


Advocacy, helpers, allies

  • Bodies who may be interested (speculative, I haven’t asked them):
    • Funders of science
    • major Open publishers
    • Funders of social change (Mellon, Sloane, OSF…)
    • SPARC, DOAJ, etc.
    • (Europe)PMC
  • Crowdcrafting (OKF, am involved with this)
  • Wikipedia