Guest post: a technical update from our development team – News Service

“Here are some major bits of work that we have carried out:

Enhancements to our historical data management system. We track all changes to the body of publicly available objects (Journals and Articles) and we have a better process for handling that.
Introduced a more advanced testing framework for the source code. As DOAJ gains more features, the code becomes larger and more complex. To ensure that it is properly tested for before going into production, we have started to use parameterised testing on the core components. This allows us to carry out broader and deeper testing to ensure the system is defect free.
A weekly data dump of the entire public dataset (Journals and Articles) which is freely downloadable.
A major data cleanup on articles: a few tens of thousands of duplicates, from historical data or sneaking in through validation loopholes, were identified and removed. We closed the loopholes and cleaned up the data.
A complete new hardware infrastructure, using Cloudflare. This resulted in the significant increase in stability mentioned above and allows us to cope with our growing data set (increasing at a rate of around 750,000 records per year at this point).

And here are some projects we have been working on which you will see come into effect over the next few weeks:

A completely new search front-end. It looks very similar to the old one, but with some major improvements under-the-hood (more powerful, more responsive, more accessible), and gives us the capability to build better, cooler interfaces in the future.
Support for Crossref XML as an article upload format. In the future this may also be extended to the API and we may also integrate directly with Crossref to harvest articles for you. We support the current Crossref schema (4.7) and we will be supporting new versions as they come along….”

River Valley launch publishing platform, completing their end-to-end publishing solution – River Valley Technologies

“River Valley Technologies announce the launch of RVHost, the innovative content hosting platform, and the final component of their XML-based end-to-end scholarly publishing solution.  

The platform gives unprecedented control to publishers over their content, thus allowing them to create a brand new journal within minutes, or to schedule any publication for a precise date and time, e.g. timed with a press release. RVHost  delivers full analytics, including an innovative graphical “history” of a publication, together with any citations. The system is fully customisable, multi-lingual and content agnostic, allowing it to host any form of data or multimedia. 

RVHost will be launched in partnership with forward-looking STM journal, GigaScience, which will be using River Valley’s complete end-to-end publishing system. …

Kaveh Bazargan added: “RVHost fully supports Open Access publishing, ensuring compliance with Plan S. We ask all participating journals to adopt the COPE/DOAJ principles of transparency to ensure ethical publishing.” …”

Publishing at the speed of research | EurekAlert! Science News

“Today, the open-access, open-data journal GigaScience and the technology and publishing services company River Valley Technologies announce a new partnership to deliver a research publishing process that is extremely rapid, low-cost, and modular. As a pioneer of open data and open science publishing, GigaScience brings editorial expertise in publishing research that includes all components of the research process: data, source code, workflows, and more. River Valley Technologies, with 30 years of expertise in publishing production, delivers an end-to-end publishing solution, including manuscript submission, content management and hosting, using its collaborative online platforms. The collaboration is developing a publishing process that, in addition to providing on-the-fly article production, will create more interactive articles that can be versioned and forked….”

Open Journal Systems (OJS) sets new standards to achieve OpenAIRE compliance with JATS – OpenAIRE Blogs

Open Journal Systems (OJS, https://pkp.sfu.ca/ojs/) is an open source journal management and publishing system, developed by the Public Knowledge Project (PKP, https://pkp.sfu.ca/). Around 10,000 journals worldwide and over a thousand journals published in Europe use Open Journal Systems. The latest major version OJS 3 was released in 2016, and since then hundreds of OJS journals have upgraded including large national journal platforms like Tidsskrift.dk and Journal.fi.Therefore, it is important to help the growing number of OJS 3 journals to become compliant with the OpenAIRE infrastructure in terms of comprehensive metadata descriptions of open access articles on research in Europe and beyond….”

Open Journal Systems (OJS) sets new standards to achieve OpenAIRE compliance with JATS – OpenAIRE Blogs

Open Journal Systems (OJS, https://pkp.sfu.ca/ojs/) is an open source journal management and publishing system, developed by the Public Knowledge Project (PKP, https://pkp.sfu.ca/). Around 10,000 journals worldwide and over a thousand journals published in Europe use Open Journal Systems. The latest major version OJS 3 was released in 2016, and since then hundreds of OJS journals have upgraded including large national journal platforms like Tidsskrift.dk and Journal.fi.Therefore, it is important to help the growing number of OJS 3 journals to become compliant with the OpenAIRE infrastructure in terms of comprehensive metadata descriptions of open access articles on research in Europe and beyond….”

PRESS RELEASE: Researchers Respond to Implementation of Plan S | Eurodoc

joint response to the implementation guidance for Plan S has today been issued by three organisations representing early-career and senior researchers in Europe. The response by the European Council of Doctoral Candidates and Junior Researchers (Eurodoc), the Marie Curie Alumni Association (MCAA), and the Young Academy of Europe (YAE) offers concrete recommendations on the proposed guidance for implementing Open Access via Plan S.

Our three organisations represent a broad spectrum of researchers in Europe: Eurodoc represents 100000+ doctoral candidates and postdoctoral researchers from 29 national associations across Europe; MCAA has 10000+ members who are alumni fellows of the Marie Sk?odowska-Curie Actions (MSCA); YAE consists of 200+ outstanding and recognised researchers in Europe. We all strongly support the main goals of Open Science and Plan S.

The joint response builds upon previous recommendations by our organisations on the principles of Plan S and aims to ensure its realistic implementation from the perspective of European researchers. Eurodoc President Gareth O’Neill: “Plan S has shaken the academic community awake and created a lively discussion on Open Access publishing. cOAlition S has addressed some key concerns from researchers in the technical guidance but still leaves other issues open and sets too strict standards for the desired broad adoption of Plan S.” …”

Joint Statement on Implementation Guidance for Plan S

“Plan S is an initiative by cOAlition S to achieve full and immediate Open Access to scientific publications after 01 January 2020 in Europe. At the heart of the plan are 10 principles currently being developed into a set of implementation guidelines. We, representatives of early-career and senior researchers across Europe, have already commented on Plan S and hereby reaffirm our general support and offer our views on the implementation guidance….”

Martin Paul Eve, Response to cOAlition S Implementation Guidelines

“I write to provide feedback in an individual capacity on the Plan S implementation guidelines. I am extremely supportive of the cOAlition’s goals and Plan S in general. I disagree with those who say that the timeline is too short; many of these actors have not taken the opportunities over the last decade to experiment with open access or new business models and have only begun dialogue under the threat of immediate action. That said, I welcome the recent engagement by the Wellcome Trust and UKRI to speak with Learned Societies and to evaluate routes to their transition to Plan S compliance. Developing alternative revenue streams to support the activities of these bodies is not a small task, but it is crucial for the wellbeing of these disciplines, and for open access to prosper. There are a few areas where the document could provide greater clarity….”

My Draft Plan S Implementation Guidance Feedback | Martin Paul Eve | Professor of Literature, Technology and Publishing

“I write to provide feedback in an individual capacity on the Plan S implementation guidelines.

 

I am extremely supportive of the cOAlition’s goals and Plan S in general. I disagree with those who say that the timeline is too short; many of these actors have not taken the opportunities over the last decade to experiment with open access or new business models and have only begun dialogue under the threat of immediate action. That said, I welcome the recent engagement by the Wellcome Trust and UKRI to speak with Learned Societies and to evaluate routes to their transition to Plan S compliance. Developing alternative revenue streams to support the activities of these bodies is not a small task, but it is crucial for the wellbeing of these disciplines, and for open access to prosper.

There are a few areas where the document could provide greater clarity….”

rOpenSci | pubchunks: extract parts of scholarly XML articles

“The goal of pubchunks is to fetch sections out of XML format scholarly articles. Users do not need to know about XML and all of its warts. They only need to know where their files or XML strings are and what sections they want of each article. Then the user can combine these sections and do whatever they wish downstream; for example, analysis of the text structure or a meta-analysis combining p-values or other data.

 

The other major format, and more common format, that articles come in is PDF. However, PDF has no structure other than perhaps separate pages, so it’s not really possible to easily extract specific sections of an article. Some publishers provide absolutely no XML versions (cough, Wiley) while others that do a good job of this are almost entirely paywalled (cough, Elsevier). There are some open access publishers that do provide XML (PLOS, Pensoft, Hindawi) – so you have the best of both worlds with those publishers….”

Science Beam – using computer vision to extract PDF data | Labs | eLife

“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”

Panelists discuss JATS for Reuse for OASPA Webinar – OASPA

“JATS4R (JATS for Reuse) is an inclusive group of publishers, vendors, and other interested organisations who use the NISO Journal Article Tag Suite (JATS) XML standard. On Monday 13th March, 2017, OASPA hosted a webinar on the history, goals and recent work of JATS4R, the importance of participation and outreach around JATS4R, and to provide a platform for discussions on how the initiative can be advanced in the future.”

NIH Manuscript Collection Optimized for Text-Mining and More

“NIH-supported scientists have made over 300,000 author manuscripts available on PubMedCentral (PMC) since 2008. Now, NIH is making these papers accessible to the public in a format that will allow robust text analyses.

You can download the entire PMC collection of NIH-supported author manuscripts as a package in either XML or plain text formats….”

Inconsistent XML as a Barrier to Reuse of Open Access Content – Journal Article Tag Suite Conference (JATS-Con) Proceedings 2013 – NCBI Bookshelf

Abstract:  In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset for automated upload to Wikimedia Commons.

Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the media types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements had the greatest impact, requiring us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons.
Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations related to tagging practices of certain data, to ensure that it is both compatible with existing standards, and consistent and machine-readable.