[2303.14334] The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

Abstract:  Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question “Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces — even for legacy PDFs?” We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we’ve developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We’ve also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers — Discovery, Efficiency, Comprehension, Synthesis, and Accessibility — and present an overview of our progress and remaining open challenges.

 

Introducing the new EPUB reader for e-books at the Library of Congress | The Signal

“The Open Access Books Collection on loc.gov includes approximately 6,000 contemporary open access e-books covering a wide range of subjects, including history, music, poetry, technology, and works of fiction. All books in this collection were published under open access licenses, meaning the e-books are available to use and reuse according to the terms of the licenses. Users can access the e-books in the Open Access Books Collection by reading directly online in a browser or downloading the book as a PDF or EPUB file.

When we first made open access e-books available on loc.gov, titles were available for download in either PDF or EPUB format, but PDF was the only one available for reading directly on the website; loc.gov did not support viewing EPUBs in the browser and they were only available for download. As many books were available in both formats or in PDF only, this ensured most titles were viewable directly on the website. However, we recognized an increase in titles available in EPUB only so we are happy to share the news that an EPUB viewer was launched on loc.gov. The viewer makes EPUBs available for reading on loc.gov and provides a richer interface for users….”

Introducing the new EPUB reader for e-books at the Library of Congress | The Signal

“The Open Access Books Collection on loc.gov includes approximately 6,000 contemporary open access e-books covering a wide range of subjects, including history, music, poetry, technology, and works of fiction. All books in this collection were published under open access licenses, meaning the e-books are available to use and reuse according to the terms of the licenses. Users can access the e-books in the Open Access Books Collection by reading directly online in a browser or downloading the book as a PDF or EPUB file.

When we first made open access e-books available on loc.gov, titles were available for download in either PDF or EPUB format, but PDF was the only one available for reading directly on the website; loc.gov did not support viewing EPUBs in the browser and they were only available for download. As many books were available in both formats or in PDF only, this ensured most titles were viewable directly on the website. However, we recognized an increase in titles available in EPUB only so we are happy to share the news that an EPUB viewer was launched on loc.gov. The viewer makes EPUBs available for reading on loc.gov and provides a richer interface for users….”

A framework for improving the accessibility of research papers on arXiv.org

Abstract:  The research content hosted by arXiv is not fully accessible to everyone due to disabilities and other barriers. This matters because a significant proportion of people have reading and visual disabilities, it is important to our community that arXiv is as open as possible, and if science is to advance, we need wide and diverse participation. In addition, we have mandates to become accessible, and accessible content benefits everyone. In this paper, we will describe the accessibility problems with research, review current mitigations (and explain why they aren’t sufficient), and share the results of our user research with scientists and accessibility experts. Finally, we will present arXiv’s proposed next step towards more open science: offering HTML alongside existing PDF and TeX formats. An accessible HTML version of this paper is also available at https://info.arxiv.org/about/accessibility_research_report.html 

Access is not the same as accessibility: A framework for making research papers truly open – arXiv.org blog

“arXiv has pioneered open access for more than 30 years by removing financial, institutional, and geographic barriers to research. No paywalls or fees, no login required for reading. This approach – which gives researchers maximum control over the release of their results and broad visibility – transformed the research process and launched the open access movement.

However, access is not the same as accessibility, which is the practice of ensuring access regardless of disability. The vast majority of research papers posted to any journal or platform do not meet basic accessibility standards.

In 2022, arXiv completed intensive user research with over 40 people to determine the extent of the problem, evaluate current mitigation efforts, and consider solutions. This work, informed by arXiv staff, accessibility experts, and arXiv readers and authors who use assistive technology, is posted on arXiv in PDF and HTML formats (arXivID: 2212.07286).

In extensive interviews, our research participants shared that finding research, reading it, preparing documents, and submitting work are all steps in the research process where people encounter barriers. In particular, interpreting math equations, figures, and charts is problematic.

Flexible content can help address these issues. Offering well-formatted HTML, alongside PDF and TeX source, will lead to critical accessibility gains. arXiv’s collaboration with ar5iv, which currently renders HTML for approximately 70% of arXiv papers, is a first step in this process. Next, we expect to reduce the error rate and add a link to HTML on arXiv abstract pages….”

Open Inaccessibility

“When a PDF is downloaded, who can read it?

At the start of the year I discussed the social model of disability and inaccessibility in relation to open scholarship, but since then I have not done much more in a practical sense. Here’s the best explanation of the social model of disability I have seen…

Content inaccessibility came back on my radar again when I read a recent study about content accessibility improvements for arXiv. This paper calls content accessibility “the next frontier of open science.” As we see a simultaneous increase in user-generated content platforms for publishing, where there is less control over what and how things get published, I would agree and argue that accessibility will become a bigger topic quickly.

Some of my main takeaways and juxtapositions from this paper include:

There is clear content inaccessibility: only 30% of people using assistive technologies rate all research as accessible (vs. 59% of people not using assistive technologies).
HTML is preferred for accessibility, but non-disabled people prefer PDFs.
Biggest improvement areas for accessibility are (1) PDF formatting, (2) images (alt texts), (3) math accessibility (e.g., MathML for screenreaders), (4) making data in figures parseable by screen readers.
People who don’t use assistive technologies don’t know what is required of them to make accessible documents
PDF is often preferred because it is easy/easier to save to reference managers….”

Welcome to the Single Source Publishing Community | The Single Source Publishing Community (SSPC) is a network stakeholders from the Open Science community that are interested in Single Source Publishing (SSP) for scholarly purposes – developing open-source software and advocacy.

“The Single Source Publishing Community (SSPC) is a network of stakeholders from the Open Science community that are interested in Single Source Publishing (SSP) for scholarly purposes – developing open-source software and advocacy.”

The PDF is not enough: why science needs open formats – University Library

“In the project period from 2019 to 2021 , the project bundled modern publishing as part of the Hamburg Open Science (HOS) initiativeMany years of experience at the Hamburg University of Technology (TUHH) and the Hamburg State and University Library (SUB). The goal: The development of a socio-technical system for single source publishing, i.e. for generating different output formats from one source format. It was based on open source solutions such as GitLab and Open Journal Systems (OJS) to enable an open alternative approach to the publication of scientific results compared to commercial and proprietary publishing offers….

Former team members of the project have created the Single Source Publishing Community (SSPC)founded. This focuses on scientific writing and publishing with open tools and formats and is a meeting point for researchers, lecturers, publishers and developers. Under the motto “Collaborate more, compete less”, the active members of the community exchange ideas in their monthly meetingson current developments in their projects and discuss strategies for cultural change in the field of scientific publication….

Numerous open-source tools favor the desired sovereignty: software projects such as Open Journal Systems, Viviliostyle, Paged.js, Swapfire , FidusWriter, HedgeDoc, quartoand last but not least pandocare combined in different ways in the community projects to create alternative open systems.

Many projects use the Markdown format as a source, to generate complementary versions of PDF in the form of HTML, JATS/XMLand create EPUB. The latter offer the advantage that they retain the semantic labeling of the information they contain and thus open up a wide range of possible applications in automated text mining processes. At the same time, the usability and reach of published scientific findings increases….”

New Leaves: Riffling the History of Digital Pagination

Abstract:  This article presents a new history of digital pagination. Virtual pagination works very differently from its print correlate. Despite this, encapsulated and paginated formats have gained a solid digital foothold. Nonetheless, many commentators have argued that we must overcome such a reliance on and continuity with print in the digital space. This article charts a fresh history of the development of digital pagination through a revisionist interrogation of three interrelated phenomena: 1. That digital pages do not behave as do their physical correlates but instead mimic earlier historical forms of print that fused pagination, scrolling, and the tablet form. 2. That the development of PDF was almost abandoned by Adobe’s board of directors, who could see no audience for it. 3. That there are other more robust lineages of constraint for digital pages from cinema and television. Drawing on new correspondence with the creators of the PDF format I argue from these historical tracings that nothing was sure about the development of textual pagination in the digital space. Further, the digital page almost never came to the prominence and dominance now presumed in discussions of digital reading.

Harmon | ETDplus Toolkit [Tool Review] | Journal of Librarianship and Scholarly Communication

Abstract:  Electronic theses and dissertations (ETDs) have traditionally taken the form of PDFs and ETD programs and their submission and curation procedures have been built around this format. However, graduate students are increasingly creating non-PDF files during their research, and in some cases these files are just as or more important than the PDFs that must be submitted to satisfy degree requirements. As a result, both graduate students and ETD administrators need training and resources to support the handling of a wide variety of complex digital objects. The Educopia Institute’s ETDplus Toolkit provides a highly usable set of modules to address this need, openly licensed to allow for reuse and adaption to a variety of potential use cases.

 

ResearchHub | Open Science Community

“ResearchHub’s mission is to accelerate the pace of scientific research. Our goal is to make a modern mobile and web application where people can collaborate on scientific research in a more efficient way, similar to what GitHub has done for software engineering.

Researchers are able to upload articles (preprint or postprint) in PDF form, summarize the findings of the work in an attached wiki, and discuss the findings in a completely open and accessible forum dedicated solely to the relevant article.

Within ResearchHub, papers are grouped in “Hubs” by area of research. Individual Hubs will essentially act as live journals within focused areas, within highly upvoted posts. (i.e the paper and its associated summary and discussion) moving to the top of each Hub.

To help bring this nascent community together and incentivize contribution to the platform, a newly created ERC20 token, ResearchCoin (RSC), has been created. Users receive RSC for uploading new content to the platform, as well as for summarizing and discussion research. Rewards for contributions are proportionate to how valuable the community perceives the actions to be – as measured by upvotes.”

 

If It’s Open, Is It Accessible? – Association of Research Libraries

“The library and open access (OA) publishing communities have made great strides in making more new scholarship openly available. But have we included readers with vision challenges in our OA plans? Only an estimated 7% of all printed works are available in accessible format, and that statistic might not significantly differ for digital scholarship worldwide….

Libraries need to consider accessibility of the document format, as well as accessibility of the tools and platforms they typically use for OA journal and monograph publishing, storage, and access. According to a blog post by the UX designer for the Directory of Open Access Journals last year, testing of a platform’s web interface can be done easily through free tools such as Lighthouse and Accessibility Insights for Web, both available as web browser extensions, which test accessibility against the World Wide Web Consortium (W3C) Web Content Accessibility Guidelines (WCAG) 2.1 AA.

Earlier this year, the Open Journal Systems (OJS) team at the Public Knowledge Project noted the strides that their Accessibility Interest Group team has made to improve the accessibility of OJS 3.3. Next up, they will be working on a guide to help journal editors create more accessible content within OJS.

 

This leads to the question of the format of open content. Adobe’s Portable Document Format (PDF), ubiquitous and a de facto standard for digital publishing, is typically not the best format for accessibility. Certainly, PDFs can be made WCAG-compliant, but one must make careful efforts to do so….”

Why most academic journals are following outdated publishing practices

“In his Medium article “Scholarly publishing is stuck in 1999,”

Springer Nature product manager Stephen Cornelius reproaches the outdated publishing practices many academic journals are using to produce online content. He notes that, despite decades of technological advancement, “research publishing seems stuck with those that were employed when it first went online.” Cornelius points to many areas of digital journal publishing that have been designed to mirror print publishing, such as journals formatting online articles as print-based PDFs, despite there being better ways to produce and present content online….

PDFs are rife with limitations as compared to HTML because, unlike HTML, PDFs:

Cannot support embedded multi-media research files such as videos
Have a poor layout for online reading, generally using columns that require readers to scroll up and down to read content on the same page
Are nearly impossible to read on mobile devices because PDFs are a static page (whereas HTML can be made to have a responsive design)
Do not easily allow for clickable references within the text
Are overwhelmingly not search-optimized for online browsers…

A recent article in The Atlantic titled “The Scientific Paper Is Obsolete“ explores the limitations of PDFs and the need for journals, particularly in STEM fields, to adopt internet-based publishing formats in order to support more dynamic presentations of research as well as to make it easier for readers to find articles online….”