Celebrating UC San Diego’s Mass Contributions to HathiTrust

“This month, the University of California San Diego will send its final shipment of carts filled with library books to be digitized by Google as part of the Google Books Library Project. In total, well over half a million UC San Diego Library volumes have been sent to Google to be digitized and deposited in HathiTrust. This will be the third time the university has taken part in the project. 

UC San Diego joined as an early Google Books partner in 2008. From 2008 to 2011 the campus sent over 470,000 volumes to be digitized. Since rejoining the project in 2017, UC San Diego Library has sent over 111,000 books. The project paused in March 2020 due pandemic shutdowns, but the campus resumed sending shipments to be digitized in November 2021.

Tens of thousands of volumes from UC San Diego’s International Relations, Pacific Studies, and East Asian Language collections were digitized in the first three years of the project. When UC San Diego rejoined the Google Library Project in 2017, the campus included large numbers of US federal government documents, dissertations, and special collections volumes in its shipments. But throughout all phases of participation in the project, hundreds of thousands of books from the general library collection were digitized….”

How one digital book led to an important COVID-19 discovery

“In early 2020, many scientists believed that particles containing COVID-19 were too large to be airborne. Medical canon held that only particles sized 5 microns or smaller could stay in the air long enough to be transmitted between people over 6 feet apart. But a team of scientists questioned the 5 micron figure. Katie Randall, then a graduate student at Virginia Tech, went to work investigating the origin of the number. “I was working on my dissertation when the pandemic hit, and I had to pause in-person research,” says Katie. “I was supposed to focus on revising my research plan, but when I got the email about this project, I knew I couldn’t say no — it was too important and too intriguing to ignore.”

In her research, Katie located an out-of-print book, Airborne Contagion and Air Hygiene: An Ecological Study of Droplet Infections, written by William Firth Wells in 1955. While she normally would have borrowed the book through an agreement between libraries to share items in their collection, pandemic closures meant that was not an option. Fortunately, she was able to locate a digital copy of the book in HathiTrust, a Google Books partner.

HathiTrust is a nonprofit collaborative of academic and research libraries which preserves digitized items — most of which come from partnerships with Google Books. “Early on, our partner libraries dedicated themselves to digital preservation,” says Mike Furlough, Executive Director of HathiTrust. “But even when preservation was the goal, our thoughts were always on providing access for research and scholarship.”

With the help of the digitized book, Katie discovered that the 5 micron threshold had no real scientific basis — in fact, the experiments detailed in Wells’ book showed the aerosolization of particles as big as 100 microns….”

A Search Engine That Finds You Weird Old Books | by Clive Thompson | Jan, 2022 | Debugger

“Still, sifting through old books can be a hassle. You have to go to those search sites and filter for the right vintage (and public-domain-status). It’s a pain.

So: I decided to partly automate this — by making my own search tool.

Behold the Weird Old Book Finder….

Behind the scenes, here’s what it’s doing, which is pretty simple: i) You type in a query, and ii) my app sends it to Google Books, and filters the results for pre-1927 public domain. Then iii) it picks one at random and displays it to you….”

A Search Engine That Finds You Weird Old Books | by Clive Thompson | Jan, 2022 | Debugger

“Still, sifting through old books can be a hassle. You have to go to those search sites and filter for the right vintage (and public-domain-status). It’s a pain.

So: I decided to partly automate this — by making my own search tool.

Behold the Weird Old Book Finder….

Behind the scenes, here’s what it’s doing, which is pretty simple: i) You type in a query, and ii) my app sends it to Google Books, and filters the results for pre-1927 public domain. Then iii) it picks one at random and displays it to you….”

Book Review – Along Came Google: A History of Library Digitization – The Scholarly Kitchen

“Meanwhile, Google had only just gone public with an IPO in 2004. That year, at the Frankfurt Book Fair, Google announced its Publisher Program, which promised to support the same type of search functionality. Publishers willingly signed up, unaware that the Library Project would be announced two months later. The Library Project was ambitious, digitizing titles acquired for collections held at Harvard, Stanford, the University of Michigan, the Bodleian Library at Oxford University, and the New York Public Library. This was a breathtaking step farther than Amazon, and the information community was thunderstruck as it tried to process the implications of what such an expansion could mean. 

This is the story that is told in Along Came Google: A History of Library Digitization by Deana Marcum and Roger Schonfeld (full disclosure, Roger is a regular contributor to this blog). Note the subtitle. This book documents from a library perspective the implications and long-term impact of Google’s move to make a significant corpus of “offline content searchable online” through optimized means of scanning and digitization. The outcome of Google’s ambitious project would ultimately be diminished, due to constraints resulting from extended legal battles, but key library leadership has managed to create the infrastructure needed to sustain and carry on the massive digitization needed. There were significant barriers to that work, as the authors note, despite the fact that “in this story, there are many actors, all of good intentions. Inevitably, it is also a story of limitations and failures to collaborate.” …”

Coyle’s InFormation: Digitization Wars, Redux

“From 2004 to 2016 the book world (authors, publishers, libraries, and booksellers) was involved in the complex and legally fraught activities around Google’s book digitization project. Once known as “Google Book Search,” the company claimed that it was digitizing books to be able to provide search services across the print corpus, much as it provides search capabilities over texts and other media that are hosted throughout the Internet. 

Both the US Authors Guild and the Association of American Publishers sued Google (both separately and together) for violation of copyright. These suits took a number of turns including proposals for settlements that were arcane in their complexity and that ultimately failed. Finally, in 2016 the legal question was decided: digitizing to create an index is fair use as long as only minor portions of the original text are shown to users in the form of context-specific snippets. 

We now have another question about book digitization: can books be digitized for the purpose of substituting remote lending in the place of the lending of a physical copy? This has been referred to as “Controlled Digital Lending (CDL),” a term developed by the Internet Archive for its online book lending services. The Archive has considerable experience with both digitization and providing online access to materials in various formats, and its Open Library site has been providing digital downloads of out of copyright books for more than a decade. Controlled digital lending applies solely to works that are presumed to be in copyright. …”

HathiTrust: A Digital Library Revolution Takes Flight

“The phrase “closed until further notice due to COVID-19” has become all too familiar. And, while we have started to grow accustomed to losing access to many resources that typically define our community existence, there’s one that’s particularly crucial to student and faculty researchers: libraries. For some, it may be easy to write off libraries as “nice-to-have.” But for scholars, they are essential. And as library doors began to shutter throughout California and much of the world, the potential impact on the academic community was profound.

Thankfully, the University of California has been preparing for this moment for decades. In 2008, the UC Libraries co-founded HathiTrust, and started contributing scanned copies of books and journals to the new organization. Based at the University of Michigan (U-M), HathiTrust is a large-scale repository of digital content collaboratively created by academic and research institutions. As researchers lost access to vital hard-copy materials, it initiated an Emergency Temporary Access Service (ETAS) to give UC researchers critical access to more than 13 million digital volumes. This revolution has been immediately impactful — and a profound advancement in sharing digital content….”

CSU Explores the Possibility of a Google Books Partnership – Cal schol.com

“Just heard yesterday that our CSU Council of Library Deans (COLD) approved a request I’d made to send records of our entire CSU print holdings to Google Books for evaluation. Google Books will run a comparison of their current digitized holdings against our holdings and evaluate on their end whether a digitization partnership makes sense. If it does, then the CSU will consider whether it might make sense for us as well….”

CSU Explores the Possibility of a Google Books Partnership – Cal schol.com

“Just heard yesterday that our CSU Council of Library Deans (COLD) approved a request I’d made to send records of our entire CSU print holdings to Google Books for evaluation. Google Books will run a comparison of their current digitized holdings against our holdings and evaluate on their end whether a digitization partnership makes sense. If it does, then the CSU will consider whether it might make sense for us as well….”

Why a National Emergency Library Would Have Been Unnecessary – Disruptive Competition Project

“Last week, in response to the COVID-19 pandemic, the Internet Archive announced the National Emergency Library (“NEL”), which expanded digital access to the books in its collection. The New Yorker welcomed it as “a gift to readers everywhere.” Predictably, the Authors Guild, the Copyright Alliance, and the Association of American Publishers condemned the move as infringing copyright. Overlooked in this controversy is that had the 2008 attempted settlement of the litigation over the Google Library Project been approved by the court, the NEL would likely have been unnecessary….”

B2fxxx: Carl Malamud at the Open University

“Without asking publishers’ permission, Malamud has put a lot of stuff online via a project at Jawaharlal Nehru University (JNU) in India – 125 million journal articles from many sources, from the mid 19th century up to the present.

The storage facility is air-gapped and not connected to the internet. Researchers who want access can bring their computers to the facility and text & data mine the materials there. Without having to read or download the articles which is not permitted, they can, nevertheless, draw scientific insights, thereby circumventing any potential copyright problems. The terms and conditions are modeled on those of the HathiTrust and the store specialises in bioinformatics. The access model is 3-tiered:

Tier 0 is air-gapped and pdfs of the articles

Tier 1 is extracted texts and is also air-gapped

Tier 2 is facts. As there is no copyright on facts, this can be made available openly to everyone….

In 2016 the US Supreme Court rejected the Authors Guild’s request to further appeal the decision, ending the more than a decade long litigation. The Authors Guild also tried suing the HathiTrust but were unsuccessful in that case too. The technicalities of the case were different.  One interesting angle was that the court made a point of noting the value of the HathiTrust approach to making the books available to print disabled and visually impaired.

The bottom line was that Google Books and the HathiTrust were given the ok by the US courts.

In the UK text and data mining is permitted only for non-commercial use. …”

Google Books 2020 Update | Communications

“What would you do if Google came to you and said: You have 1 million items that we would like to scan for you and make available to the world?

Over the past two years, a team from Access Services, Stacks Management, Library Technology Services, Information and Technical Services, Harvard Depository, and ReCAP have been attempting to do just that as part of a Harvard Library Digital Strategies and Innovation (DSI) initiative. This project began nearly a decade after our first partnership with Google Books, and it has been an opportunity to approach this work differently — to identify the challenges that we face at each step of the workflow and to look for creative, iterative ways to meet them….

Between 2004 and 2009, Google scanned 891,164 volumes from Harvard. Google has begun reprocessing those materials, enhancing and correcting the raw images and running them through updated OCR to create better, more searchable, machine-readable text.  

As part of this relationship, we are involved in the Google Library Partners group, an active community of our colleagues from peer institutions who also share their materials with Google. As a group we have been able to advocate for and contribute to reviews for handling of materials, quality assurance in scanning, and expanded treatments for items with foldouts or materials of non-traditional size. We have also led a review of how our peers provide access to materials and are actively partnering with HathiTrust to conduct more research into how users find and utilize these materials….”

4.5 Million UC Volumes Digitized & UC’s Most Popular Full View Books in HathiTrust for 2019 – California Digital Library

“The University of California Libraries recently contributed the 4,500,000th digitized book from their collections to HathiTrust Digital Library–a tremendous achievement resulting from 15 years of continuous digitization work. 

The vast majority of these millions of volumes were generated via the Google Books Library Project, which UC joined in 2006. That year the mass digitization of UC’s library collections began in earnest when the Northern Research Library Facility (NRLF) started sending books to the Google Books Library Project for scanning. UC’s work with the Google Books Library Project has never paused–by the time UC’s 3,000,000th volume was digitized in 2010, UC San Diego, UC Santa Cruz, and UCLA had all begun sending collections to Google for digitization. Since then, UC San Francisco, the Southern Research Library Facility (SRLF), UC Davis, UC Berkeley, UC Riverside, UC Irvine, and UC Santa Barbara have all participated, with UC Santa Barbara, UC Berkeley, UC San Diego, UC Riverside, UCLA, and NRLF continuing to do so….”

The Rebirth of Copyright As an Opt-In System? – The Media Institute

“For most of the history of Anglo-American copyright law, copyright was an opt-in system: Authors had to jump through certain regulatory hoops if they wanted to prevent others from copying their works without consent.  These threshold formalities included registering their works with a government agency, affixing a notice to published copies, depositing exemplars with a centralized library, and more.  A failure to comply with the requirements usually meant a diminution in the authors’ copyright entitlement – and in some cases a wholesale forfeiture, under which the works would pass immediately into the public domain.

After some 200 years, however, U.S. copyright abandoned its formal requirements.  Beginning in 1976 and culminating in 1989, Congress responded to complaints from authors (who had sometimes lost protection due to what they viewed as a technicality) and to pressure to join the international copyright community (which forbade most formalities).  Copyright law accordingly underwent a conversion from opt-in to opt-out.

As a result, copyright protection now arises by operation of law, without any action by the author.  As long as a work contains a modicum of originality and is fixed in some tangible form, copyright automatically protects it, and authors must affirmatively disclaim the entitlement if they don’t want its protection.  And these threshold requirements of originality and fixation are incredibly minimal, such that every reader of this essay is probably the owner of hundreds, and quite possibly thousands, of copyrights – in everything from diary entries to doodles….

Of course, any opt-in proposal would face a number of political obstacles, including the fact that predicating copyright protection on any formality (at least for foreign works) is inconsistent with the international copyright conventions to which the United States is a party.  But the Internet does not stop at the border; if opt-in makes sense here, it will make sense abroad as well.  When the United States and its trade partners are done figuring out what to do with Google Books, then, they should consider a return to copyright’s roots.  Make copyright opt-in once more….”

ASECS at 50: Interview with Robert Darnton

“Of the potential solutions, open research practices are among the most promising. The argument is that transparency acts as an implicit quality control process. If others are able to scrutinise our work—not just the final published output, but the underlying data, code, and so on—researchers will be incentivised to ensure these are high quality.

So, if we think that research could benefit from improved quality control, and if we think that open research might have a role to play in this, why aren’t we all doing it? In a word: incentives….”