PDF Data Extractor (PDE) – A Free Web Application and R Package Allowing the Extraction of Tables from Portable Document Format (PDF) Files and High-Throughput Keyword Searches of Full-Text Articles | bioRxiv

Abstract:  The PDF Data Extractor (PDE) R package is designed to perform comprehensive literature reviews for scientists at any stage in a user-friendly way. The PDE_analyzer_i() function permits the user to filter and search thousands of scientific articles using a simple user interface, requiring no bioinformatics skills. In the additional PDE_reader_i() interface, the user can then quickly browse the sentences with detected keywords, open the full-text article, when required, and convert tables conveniently from PDF files to Excel sheets (pdf2table). Specific features of the literature analysis include the adaptability of analysis parameters and the detection of abbreviations of search words in articles. In this article, we demonstrate and exemplify how the PDE package allows the user-friendly, efficient, and automated extraction of meta-data from full-text articles, which can aid in summarizing the existing literature on any topic of interest. As such, we recommend the use of the PDE package as the first step in conducting an extensive review of the scientific literature. The PDE package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=PDE.

 

ETDplus Toolkit [Tool Review]

Abstract:  Electronic theses and dissertations (ETDs) have traditionally taken the form of PDFs and ETD programs and their submission and curation procedures have been built around this format. However, graduate students are increasingly creating non-PDF files during their research, and in some cases these files are just as or more important than the PDFs that must be submitted to satisfy degree requirements. As a result, both graduate students and ETD administrators need training and resources to support the handling of a wide variety of complex digital objects. The Educopia Institute’s ETDplus Toolkit provides a highly usable set of modules to address this need, openly licensed to allow for reuse and adaption to a variety of potential use cases.

 

HighWire at 25: Richard Sever (bioRxiv) looks back – Highwire Press

“10 years later I ended up working at Cold Spring Harbor myself, and continuing my relationship with HighWire from a new perspective. The arXiv preprint server for physics had launched in 1991, and my colleague John Inglis and I had often talked about whether we could do something similar for biology. I remember saying we could put together some of HighWire’s existing components, adapt them in certain ways and build something that would function as a really effective preprint server—and that’s what we did, launching bioRxiv in 2013. It was great then to be able to take that experiment to HighWire meetings to report back on. Initially there was quite a bit of skepticism from the community, who thought there were cultural barriers that meant preprints wouldn’t work well for biology, but 7 years and almost 100,000 papers later it’s still there, and still being served very well by HighWire.

When we launched bioRxiv we made it very explicit that we would not take clinical work, or anything involving patients. But the exponential growth of submissions to bioRxiv demonstrated that there was a demand and a desire for this amongst the biomedical community, and people were beginning to suggest that a similar model be trialed for medicine. A tipping point for me was an OpEd in the New York Times (Don’t Delay News of Medical Breakthroughs, 2015) by Eric Topol (Scripps Research) and Harlan Krumholz (Yale University), who would go on to become a co-founder of medRxiv….”

DAISY Publishes White Paper on the Benefits of EPUB 3 – The DAISY Consortium

“The DAISY Consortium has published a white paper encouraging the use of Born Accessible EPUB 3 files for corporate, government and university publications and documents. This important piece of work recognizes the work of the publishing industry who have embraced EPUB 3  as their format of choice for ebooks and digital publishing and focuses on how this same approach should be used for all types of digital content, both online and offline….”

New business models for the open research agenda | Research Information

“The rise of preprints and the move towards universal open access are potential threats to traditional business models in scholarly publishing, writes Phil Gooch

Publishers have started responding to the latter with transformative agreements[1], but if authors can simply upload their research to a preprint server for immediate dissemination, comment and review, why submit to a traditional journal at all? Some journals are addressing this by offering authors frictionless submission direct from the preprint server. This tackles two problems at once: easing authors’ frustrations with existing journal submission systems[2], and providing a more direct route from the raw preprint to the richly linked, multiformat version of record that readers demand and accessibility standards require….

Dissemination of early-stage research as mobile-unfriendly PDF is arguably a technological step backwards. If preprints are here to stay, the reading experience needs to be improved. A number of vendors have developed native XML or LaTeX authoring environments which enable dissemination in richer formats….”

The Push to Replace Journal Supplements with Repositories | The Scientist Magazine®

“But it’s not just broken hyperlinks that frustrate scientists. As papers get more data-intensive and complex, supplementary files often become many times longer than the manuscript itself—in some extreme cases, ballooning to more than 100 pages. Because these files are typically published as PDFs, they can be a pain to navigate, so even if they are available, the information within them can get overlooked. “Most supplementary materials are just one big block and not very useful,” Cooper says.

Another issue is that these files are home to most of a study’s published data, and “you can’t extract data from PDFs except using complex software—and it’s a slow process that has errors,” Murray-Rust tells The Scientist. “This data is often deposited as a token of depositing data, rather than people actually wanting to reuse it.”…

Depositing material that would end up in supplementary files in places other than the journal is becoming an increasingly common practice. Some academics opt to post this information on their own websites, but many others are turning to online repositories offered by universities, research institutions, and companies. …

There are advantages these repositories provide over journal articles, according to Holt. For one, repositories offer the ability to better store and interact with large amounts of openly accessible data than journals typically do. In addition, repositories’ files are labelled with a digital object identifier (DOI), meaning researchers can easily link to it from a published article and make sure to get credit for their work….”

The Scientific Paper Is Obsolete. Here’s What’s Next. – The Atlantic

“Perhaps the paper itself is to blame. Scientific methods evolve now at the speed of software; the skill most in demand among physicists, biologists, chemists, geologists, even anthropologists and research psychologists, is facility with programming languages and “data science” packages. And yet the basic means of communicating scientific results hasn’t changed for 400 years. Papers may be posted online, but they’re still text and pictures on a page.

What would you get if you designed the scientific paper from scratch today? …

Software is a dynamic medium; paper isn’t. When you think in those terms it does seem strange that research like Strogatz’s, the study of dynamical systems, is so often being shared on paper …

I spoke to Theodore Gray, who has since left Wolfram Research to become a full-time writer. He said that his work on the notebook was in part motivated by the feeling, well formed already by the early 1990s, “that obviously all scientific communication, all technical papers that involve any sort of data or mathematics or modeling or graphs or plots or anything like that, obviously don’t belong on paper. That was just completely obvious in, let’s say, 1990,” he said. …”

 

 

Time for accessible journals | Research Information

“The case for making publications accessible is so obvious and has been made so often that I won’t waste time here setting out those arguments. You know that accessibility is the right thing to do.

What you may not know is that making a publication accessible has recently become a whole lot more straightforward – and that your publications today are closer to being made properly accessible – than you realise….”

Open and Shut?: Realising the BOAI vision: Peter Suber’s Advice

Peter Suber’s current high-priority recommendations for advancing open access.

Peter Suber, Half a dozen reasons why I support the Jussieu Call for Open Science and Bibl…

“1. I support its call to move beyond PDFs. This is necessary to bypass publisher locks and facilitate reuse, text mining, access by the visually impaired, and access in bandwidth-poor parts of the world. 

2. I applaud its recognition of no-fee or no-APC open-access journals, their existence, their value, and the fact that a significant number of authors will always depend on them. 

3. I join its call for redirecting funds now spent on subscription journals to support OA alternatives. 

4. I endorse its call to reform methods of research evaluation. If we want to assess quality, we must stop assuming that impact and prestige are good proxies for quality. If we want to assess impact, we must stop using metrics that measure it badly and create perverse incentives to put prestige ahead of both quality and access.

5. I support its call for infrastructures that are proof against privatization. No matter how good proprietary and closed-source platforms may initially be, they are subject to acquisition and harmful mutation beyond the control of the non-profit academic world. Even without acquisition, their commitment to OA is contingent on the market, and they carry a permanent risk of trapping rather than liberating knowledge. The research community cannot afford to entrust its research to platforms carrying that risk. 

6. Finally I support what it terms bibliodiversity. While we must steer clear of closed-source infrastructure, subject to privatization and enclosure, we must also steer clear of platform monocultures, subject to rigidity, stagnation, and breakage. Again, no matter how good a monoculture platform may initially be, in the long run it cannot be better than an ecosystem of free and open-source, interoperable components, compliant with open standards, offering robustness, modularity, flexibility, freedom to create better modules without rewriting the whole system, freedom to pick modules that best meet local needs, and freedom to scale up to meet global needs without first overcoming centralized constraints or unresponsive decision-makers. …”

Kopernio | One-click access to PDF articles

“Fast, one-click access to millions of research papers….One-click access to PDFs. No more VPNs, login forms, redirects, frantic Googling and chasing broken links….Jump over paywalls. Automatically search university library subscriptions, pre-print servers, institutional repositories and private blogs for free PDFs….Take your university library with you wherever you go; at home, at conferences, on the beach….Kopernio automagically files away the PDFs you read in your own private Kopernio locker. Come back and read them again later, anywhere, anytime….”

Science Beam – using computer vision to extract PDF data | Labs | eLife

“There’s a vast trove of science out there locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure….Extracting key information from PDF files isn’t trivial. …It would therefore certainly be useful to be able to extract all key data from manuscript PDFs and store it in a more accessible, more reusable format such as XML (of the publishing industry standard JATS variety or otherwise). This would allow for the flexible conversion of the original manuscript into different forms, from mobile-friendly layouts to enhanced views like eLife’s side-by-side view (through eLife Lens). It will also make the research mineable and API-accessible to any number of tools, services and applications. From advanced search tools to the contextual presentation of semantic tags based on users’ interests, and from cross-domain mash-ups showing correlations between different papers to novel applications like ScienceFair, a move away from PDF and toward a more open and flexible format like XML would unlock a multitude of use cases for the discovery and reuse of existing research….We are embarking on a project to build on these existing open-source tools, and to improve the accuracy of the XML output. One aim of the project is to combine some of the existing tools in a modular PDF-to-XML conversion pipeline that achieves a better overall conversion result compared to using individual tools on their own. In addition, we are experimenting with a different approach to the problem: using computer vision to identify key components of the scientific manuscript in PDF format….To this end, we will be collaborating with other publishers to collate a broad corpus of valid PDF/XML pairs to help train and test our neural networks….”

Release ‘open’ data from their PDF prisons using tabulizer | R-bloggers

“As a political scientist who regularly encounters so-called “open data” in PDFs, this problem is particularly irritating. PDFs may have “portable” in their name, making them display consistently on various platforms, but that portability means any information contained in a PDF is irritatingly difficult to extract computationally.”

Ebooks, Innovation, and the Rebel Within – The Scholarly Kitchen

“As with all good innovators, Peter [Krautzberger, project lead for MathJax] is frustrated. He feels, for example, that advocates of open science focus heavily on sharing of supposedly neutral data, but are still not able to see beyond the PDF. For him open science should be more about how the Web can facilitate communications….”

open science – Why don’t publication venues systematically make the LaTeX source of papers available? – Academia Stack Exchange

“I wonder why most publication venues don’t systematically make the LaTeX source for published papers available? (which implies systematically asking authors for the LaTex source)

LaTex source are more machine readable than PDFs, and make it easier for humans to reuse part of it (e.g. math equation or figures), amongst other advantages. I fail to see any downside….”