[2303.14334] The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

Abstract:  Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question “Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces — even for legacy PDFs?” We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we’ve developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We’ve also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers — Discovery, Efficiency, Comprehension, Synthesis, and Accessibility — and present an overview of our progress and remaining open challenges.

 

Single Source Publishing and HTML for Publishers

Definition: Single Source Publishing (SSP) is an approach used by technical publishing systems that focuses on using one source file, shared across content creation and production stages.

In the world of publishing, content creation and production are often disconnected processes. Content creation happens in isolation from the production phases, and the technical systems and file formats used in each stage are often completely separate.

Single Source Publishing (SSP) utilizes a single source file throughout the content creation and production phases.

 

We need a plan D | Nature Methods

“Ensuring data are archived and open thus seems a no-brainer. Several funders and journals now require authors to make their data public, and a recent White House mandate that data from federally funded research must be made available immediately on publication is a welcome stimulus. Various data repositories exist to support these requirements, and journals and preprint servers also provide storage options. Consequently, publications now often include various accession numbers, stand-alone data citations and/or supplementary files.

But as the director of the National Library of Medicine, Patti Brennan, once noted, “data are like pictures of children: the people who created them think they’re beautiful, but they’re not always useful”. So, although the above trends are to be applauded, we should think carefully about that word ‘useful’ and ask what exactly we mean by ‘the data’, how and where they should be archived, and whether some data should be kept at all….

Researchers, institutions and funders should collaborate to develop an overarching strategy for data preservation — a plan D. There will doubtless be calls for a ‘PubMed Central for data’. But what we really need is a federated system of repositories with functionality tailored to the information that they archive. This will require domain experts to agree standards for different types of data from different fields: what should be archived and when, which format, where, and for how long. We can learn from the genomics, structural biology and astronomy communities, and funding agencies should cooperate to define subdisciplines and establish surveys of them to ensure comprehensive coverage of the data landscape, from astronomy to zoology….”

Data sharing is the future | Nature Methods

“In late 2022, the US government mandated open-access publication of scholarly research and free and immediate sharing of data underlying those publications for federally funded research beginning no later than 2025. For some fields the necessary standards and infrastructure are largely in place to support these policies. For others, however, many questions remain as to how these mandates can best be met.

In this issue, we feature a Correspondence from Richard Sever that was inspired by the government mandate and the increasing demand for open science. In it, he raises important topics, including deciding which data must be shared, standardizing file formats and developing community guidelines. He also calls for a “federated system of repositories with functionality tailored to the information that they archive,” to meet the needs of many distinct fields….”

Guest Post – Advancing Accessibility in Scholarly Publishing: Recommendations for Digital Accessibility Best Practices – The Scholarly Kitchen

“For publishers, it is also important to be up to date with the following:

PDF/UA (PDF/Universal Accessibility), formally known as ISO 14289-1:2014 (Document management applications — Electronic document file format enhancement for accessibility), is an International Organization for Standardization (ISO) standard for accessible PDF technology. PDF/UA complements WCAG 2.0 and should be used to make PDF files that also conform with WCAG 2.0.
EPUB Accessibility 1.0 Specifications: The EPUB Accessibility 1.0 EPUB Accessibility 1.1: Conformance and Discoverability Requirements for EPUB publications specification specifies content conformance requirements for verifying the accessibility of EPUB publications, as well as accessibility metadata requirements for the discoverability of EPUB publications.
The Marrakesh Treaty to Facilitate Access to Published Works for Persons Who Are Blind, Visually Impaired, or Otherwise Print Disabled (MVT). “The treaty allows for copyright exceptions to facilitate the creation of accessible versions of books and other copyrighted works for visually impaired persons. It sets a norm for countries ratifying the treaty to have a domestic copyright exception covering these activities and allowing for the import and export of such materials.”…”

Introducing the new EPUB reader for e-books at the Library of Congress | The Signal

“The Open Access Books Collection on loc.gov includes approximately 6,000 contemporary open access e-books covering a wide range of subjects, including history, music, poetry, technology, and works of fiction. All books in this collection were published under open access licenses, meaning the e-books are available to use and reuse according to the terms of the licenses. Users can access the e-books in the Open Access Books Collection by reading directly online in a browser or downloading the book as a PDF or EPUB file.

When we first made open access e-books available on loc.gov, titles were available for download in either PDF or EPUB format, but PDF was the only one available for reading directly on the website; loc.gov did not support viewing EPUBs in the browser and they were only available for download. As many books were available in both formats or in PDF only, this ensured most titles were viewable directly on the website. However, we recognized an increase in titles available in EPUB only so we are happy to share the news that an EPUB viewer was launched on loc.gov. The viewer makes EPUBs available for reading on loc.gov and provides a richer interface for users….”

Introducing the new EPUB reader for e-books at the Library of Congress | The Signal

“The Open Access Books Collection on loc.gov includes approximately 6,000 contemporary open access e-books covering a wide range of subjects, including history, music, poetry, technology, and works of fiction. All books in this collection were published under open access licenses, meaning the e-books are available to use and reuse according to the terms of the licenses. Users can access the e-books in the Open Access Books Collection by reading directly online in a browser or downloading the book as a PDF or EPUB file.

When we first made open access e-books available on loc.gov, titles were available for download in either PDF or EPUB format, but PDF was the only one available for reading directly on the website; loc.gov did not support viewing EPUBs in the browser and they were only available for download. As many books were available in both formats or in PDF only, this ensured most titles were viewable directly on the website. However, we recognized an increase in titles available in EPUB only so we are happy to share the news that an EPUB viewer was launched on loc.gov. The viewer makes EPUBs available for reading on loc.gov and provides a richer interface for users….”

A framework for improving the accessibility of research papers on arXiv.org

Abstract:  The research content hosted by arXiv is not fully accessible to everyone due to disabilities and other barriers. This matters because a significant proportion of people have reading and visual disabilities, it is important to our community that arXiv is as open as possible, and if science is to advance, we need wide and diverse participation. In addition, we have mandates to become accessible, and accessible content benefits everyone. In this paper, we will describe the accessibility problems with research, review current mitigations (and explain why they aren’t sufficient), and share the results of our user research with scientists and accessibility experts. Finally, we will present arXiv’s proposed next step towards more open science: offering HTML alongside existing PDF and TeX formats. An accessible HTML version of this paper is also available at https://info.arxiv.org/about/accessibility_research_report.html 

Access is not the same as accessibility: A framework for making research papers truly open – arXiv.org blog

“arXiv has pioneered open access for more than 30 years by removing financial, institutional, and geographic barriers to research. No paywalls or fees, no login required for reading. This approach – which gives researchers maximum control over the release of their results and broad visibility – transformed the research process and launched the open access movement.

However, access is not the same as accessibility, which is the practice of ensuring access regardless of disability. The vast majority of research papers posted to any journal or platform do not meet basic accessibility standards.

In 2022, arXiv completed intensive user research with over 40 people to determine the extent of the problem, evaluate current mitigation efforts, and consider solutions. This work, informed by arXiv staff, accessibility experts, and arXiv readers and authors who use assistive technology, is posted on arXiv in PDF and HTML formats (arXivID: 2212.07286).

In extensive interviews, our research participants shared that finding research, reading it, preparing documents, and submitting work are all steps in the research process where people encounter barriers. In particular, interpreting math equations, figures, and charts is problematic.

Flexible content can help address these issues. Offering well-formatted HTML, alongside PDF and TeX source, will lead to critical accessibility gains. arXiv’s collaboration with ar5iv, which currently renders HTML for approximately 70% of arXiv papers, is a first step in this process. Next, we expect to reduce the error rate and add a link to HTML on arXiv abstract pages….”

When XML Marks the Spot: Machine-readable journal articles for discovery and preservation

“If you work with a campus-based journal program and you’re looking to expand the readership and reputation of the articles you publish, adding them to relevant archives and indexes (A&Is) presents a treasure trove of opportunities. A&Is serve as valuable content distribution networks, and inclusion in selective ones is a signal of research quality. You may have heard about XML, one of the primary machine-readable formats academic databases use to ingest content, and wonder if that’s something you need to reach your archiving and indexing goals.

This free webinar, co-hosted by Scholastica, UOregon Libraries, and the GWU Masters in Publishing program, will offer a crash course in the benefits of XML production and use cases, including:

What XML is and the different types required or preferred by academic indexes and archives (with an overview of JATS)
How producing metadata and/or full-text articles in XML can unlock discovery and archiving opportunities with examples
Additional benefits of XML for journal accessibility as well as publishing program and professional development
When XML is needed and when it may not be the best use of journal resources
Ways you can produce XML, including an overview of Scholastica’s production service…”

Open Inaccessibility

“When a PDF is downloaded, who can read it?

At the start of the year I discussed the social model of disability and inaccessibility in relation to open scholarship, but since then I have not done much more in a practical sense. Here’s the best explanation of the social model of disability I have seen…

Content inaccessibility came back on my radar again when I read a recent study about content accessibility improvements for arXiv. This paper calls content accessibility “the next frontier of open science.” As we see a simultaneous increase in user-generated content platforms for publishing, where there is less control over what and how things get published, I would agree and argue that accessibility will become a bigger topic quickly.

Some of my main takeaways and juxtapositions from this paper include:

There is clear content inaccessibility: only 30% of people using assistive technologies rate all research as accessible (vs. 59% of people not using assistive technologies).
HTML is preferred for accessibility, but non-disabled people prefer PDFs.
Biggest improvement areas for accessibility are (1) PDF formatting, (2) images (alt texts), (3) math accessibility (e.g., MathML for screenreaders), (4) making data in figures parseable by screen readers.
People who don’t use assistive technologies don’t know what is required of them to make accessible documents
PDF is often preferred because it is easy/easier to save to reference managers….”

Quarto

“Quarto® is an open-source scientific and technical publishing system built on Pandoc

Create dynamic content with Python, R, Julia, and Observable.
Author documents as plain text markdown or Jupyter notebooks.
Publish high-quality articles, reports, presentations, websites, blogs, and books in HTML, PDF, MS Word, ePub, and more.
Author with scientific markdown, including equations, citations, crossrefs, figure panels, callouts, advanced layout, and more….”