Google shared AI knowledge with the world until ChatGPT caught up – The Washington Post

“In February, Jeff Dean, Google’s longtime head of artificial intelligence, announced a stunning policy shift to his staff: They had to hold off sharing their work with the outside world.

For years Dean had run his department like a university, encouraging researchers to publish academic papers prolifically; they pushed out nearly 500 studies since 2019, according to Google Research’s website.

 

But the launch of OpenAI’s groundbreaking ChatGPT three months earlier had changed things. The San Francisco start-up kept up with Google by reading the team’s scientific papers, Dean said at the quarterly meeting for the company’s research division. Indeed, transformers — a foundational part of the latest AI tech and the T in ChatGPT — originated in a Google study.

Things had to change. Google would take advantage of its own AI discoveries, sharing papers only after the lab work had been turned into products, Dean said, according to two people with knowledge of the meeting, who spoke on the condition of anonymity to share private information….”

Open Buildings

“Building footprints are useful for a range of important applications, from population estimation, urban planning and humanitarian response, to environmental and climate science. This large-scale open dataset contains the outlines of buildings derived from high-resolution satellite imagery in order to support these types of uses. The project is based in Ghana, with an initial focus on the continent of Africa and new updates on South Asia and South-East Asia….”

Google’s Got A Secret – Knuckleheads’ Club

“Bandwidth costs money, so there’s a limit to how much and how often website operators will let their websites be crawled. This limit means that website operators are picky about who they let crawl their websites. Only a select few crawlers are allowed access to the entire web, and Google is given extra special privileges on top of that. This isn’t illegal and it isn’t Google’s fault, but this monopoly on web crawling that has naturally emerged prevents any other company from being able to effectively compete with Google in the search engine market.

 

There Should Be A Public Cache Of The Web

All of Google’s competitors in the search engine market have failed in their own way but most of them have complained bitterly about how Google has such an advantage when it comes to web crawling. We think that there is clearly a failure in this market and government intervention is required to break Google’s hold on the natural monopoly of crawling the web….”

 

How Google and Amazon helped the FBI to successfully track the Russian owners of Z-Library – Good e-Reader

“Behind-the-scenes information is slowly pouring out as to what really happened as the cops closed in on Z-Library and eventually took it down. As TorrentFreak reported, active co-operation from companies like Google and Amazon helped the FBI in tracking the activities of the company as well as its Russian owners. Also, from what the investigators revealed, tracking down the owners of what came to be known as the world’s largest digital library proved to be much simpler than they might have thought.

The FBI, armed with search warrants aimed at Google and Amazon found it relatively easy to unravel the truth given how, as the investigators soon got to know, the need to secure their identity never seemed to be the top priority for the owners Anton Napolsky and Valeriia Ermakova. Both have since been arrested from Argentina and chances are that they will be deported to the US for further investigation….”

Google AI Blog: Announcing the Patent Phrase Similarity Dataset

“Patent documents typically use legal and highly technical language, with context-dependent terms that may have meanings quite different from colloquial usage and even between different documents. The process of using traditional patent search methods (e.g., keyword searching) to search through the corpus of over one hundred million patent documents can be tedious and result in many missed results due to the broad and non-standard language used. For example, a “soccer ball” may be described as a “spherical recreation device”, “inflatable sportsball” or “ball for ball game”. Additionally, the language used in some patent documents may obfuscate terms to their advantage, so more powerful natural language processing (NLP) and semantic similarity understanding can give everyone access to do a thorough search.

The patent domain (and more general technical literature like scientific publications) poses unique challenges for NLP modeling due to its use of legal and technical terms. While there are multiple commonly used general-purpose semantic textual similarity (STS) benchmark datasets (e.g., STS-B, SICK, MRPC, PIT), to the best of our knowledge, there are currently no datasets focused on technical concepts found in patents and scientific publications (the somewhat related BioASQ challenge contains a biomedical question answering task). Moreover, with the continuing growth in size of the patent corpus (millions of new patents are issued worldwide every year), there is a need to develop more useful NLP models for this domain.

Today, we announce the release of the Patent Phrase Similarity dataset, a new human-rated contextual phrase-to-phrase semantic matching dataset, and the accompanying paper, presented at the SIGIR PatentSemTech Workshop, which focuses on technical terms from patents. The Patent Phrase Similarity dataset contains ~50,000 rated phrase pairs, each with a Cooperative Patent Classification (CPC) class as context. In addition to similarity scores that are typically included in other benchmark datasets, we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, and domain related. This dataset (distributed under the Creative Commons Attribution 4.0 International license) was used by Kaggle and USPTO as the benchmark dataset in the U.S. Patent Phrase to Phrase Matching competition to draw more attention to the performance of machine learning models on technical text. Initial results show that models fine-tuned on this new dataset perform substantially better than general pre-trained models without fine-tuning….”

FBI Gains Access to Sci-Hub Founder’s Google Account Data * TorrentFreak

“Sci-Hub founder Alexandra Elbakyan says that following a legal process, the Federal Bureau of Investigations has gained access to data in her Google account. Google itself informed her of the data release this week noting that due to a court order, the company wasn’t allowed to inform her sooner….

In an email to Elbakyan dated March 2, 2022, Google advises that following a legal process issued by the FBI, Google was required to hand over data associated with Elbakyan’s account. Exactly what data was targeted isn’t made clear but according to Google, a court order required the company to keep the request a secret….”

 

Opinion: Pros and Cons of Google vs. Subscription Databases

“During my time overseeing the library services department of a large school district, we found our subscription databases were generally a well-kept secret. The lack of trained school librarians available to teach these resources was part of the issue. But Google was ubiquitous, as was Wikipedia, and they became de facto research sources for students, despite their limitations for such a role.

Google has its place for students and researchers (I used it for this article), as does Google Scholar (which I also used). But for students, subscription databases should also play a central research role, beginning with age-appropriate sources for elementary kids – like National Geographic – and moving up to “Gale in Context” for middle school students, and more scholarly articles for high schoolers from sources like ABC-CLIO….”

 

Book Review – Along Came Google: A History of Library Digitization – The Scholarly Kitchen

“Meanwhile, Google had only just gone public with an IPO in 2004. That year, at the Frankfurt Book Fair, Google announced its Publisher Program, which promised to support the same type of search functionality. Publishers willingly signed up, unaware that the Library Project would be announced two months later. The Library Project was ambitious, digitizing titles acquired for collections held at Harvard, Stanford, the University of Michigan, the Bodleian Library at Oxford University, and the New York Public Library. This was a breathtaking step farther than Amazon, and the information community was thunderstruck as it tried to process the implications of what such an expansion could mean. 

This is the story that is told in Along Came Google: A History of Library Digitization by Deana Marcum and Roger Schonfeld (full disclosure, Roger is a regular contributor to this blog). Note the subtitle. This book documents from a library perspective the implications and long-term impact of Google’s move to make a significant corpus of “offline content searchable online” through optimized means of scanning and digitization. The outcome of Google’s ambitious project would ultimately be diminished, due to constraints resulting from extended legal battles, but key library leadership has managed to create the infrastructure needed to sustain and carry on the massive digitization needed. There were significant barriers to that work, as the authors note, despite the fact that “in this story, there are many actors, all of good intentions. Inevitably, it is also a story of limitations and failures to collaborate.” …”

Google turns AlphaFold loose on the entire human genome | Ars Technica

“Just one week after Google’s DeepMind AI group finally described its biology efforts in detail, the company is releasing a paper that explains how it analyzed nearly every protein encoded in the human genome and predicted its likely three-dimensional structure—a structure that can be critical for understanding disease and designing treatments. In the very near future, all of these structures will be released under a Creative Commons license via the European Bioinformatics Institute, which already hosts a major database of protein structures.

In a press conference associated with the paper’s release, DeepMind’s Demis Hassabis made clear that the company isn’t stopping there. In addition to the work described in the paper, the company will release structural predictions for the genomes of 20 major research organisms, from yeast to fruit flies to mice. In total, the database launch will include roughly 350,000 protein structures….”

Google Dataset Search: Using open access tools during the research process – News – Illinois State

“We often discuss publications and publishing open access (OA) materials in these news items, but the OA movement can be a part of many other steps of the research process. Many researchers choose to make the datasets their research is based on open access as well. This can be done as part of a funding institution’s requirements, to increase transparency and reproducibility, or simply because they wish to make their data easily available to other researchers.

One way students and faculty can find these datasets is through Google Dataset Search. Out of beta in early 2020, Google Dataset Search can be used to find links to datasets that have been published on the web and described via the schema.org standard. The internet does not include all datasets, and not all are described using this standard, but Google does claim that over 25 million datasets are indexed for searching….”

Google AI Blog: A Step Toward More Inclusive People Annotations in the Open Images Extended Dataset

“In 2016, we introduced Open Images, a collaborative release of ~9 million images annotated with image labels spanning thousands of object categories and bounding box annotations for 600 classes. Since then, we have made several updates, including the release of crowdsourced data to the Open Images Extended collection to improve diversity of object annotations. While the labels provided with these datasets were expansive, they did not focus on sensitive attributes for people, which are critically important for many machine learning (ML) fairness tasks, such as fairness evaluations and bias mitigation. In fact, finding datasets that include thorough labeling of such sensitive attributes is difficult, particularly in the domain of computer vision.

Today, we introduce the More Inclusive Annotations for People (MIAP) dataset in the Open Images Extended collection. The collection contains more complete bounding box annotations for the person class hierarchy in 100k images containing people. Each annotation is also labeled with fairness-related attributes, including perceived gender presentation and perceived age range. With the increasing focus on reducing unfair bias as part of responsible AI research, we hope these annotations will encourage researchers already leveraging Open Images to incorporate fairness analysis in their research….”

Open search tools need sustainable funding – Research Professional News

“The Covid-19 pandemic has triggered an explosion of knowledge, with more than 200,000 papers published to date. At one point last year, scientific output on the topic was doubling every 20 days. This huge growth poses big challenges for researchers, many of whom have pivoted to coronavirus research without experience or preparation.

Mainstream academic search engines are not built for such a situation. Tools such as Google Scholar, Scopus and Web of Science provide long, unstructured lists of results with little context.

These work well if you know what you are looking for. But for anyone diving into an unknown field, it can take weeks, even months, to identify the most important topics, publication venues and authors. This is far too long in a public health emergency.

The result has been delays, duplicated work, and problems with identifying reliable findings. This lack of tools to provide a quick overview of research results and evaluate them correctly has created a crisis in discoverability itself. …

Building on these, meta-aggregators such as Base, Core and OpenAIRE have begun to rival and in some cases outperform the proprietary search engines. …”

Google Books: how to get the full text of public domain books

“While Google Books has digitised millions of books all over the world with the help of thousands of libraries as part of the Library Project, not all of those digitised books are freely available on the website. Books that are still in copyright cannot be consulted in full-text, even though you might see a snippet preview.

Sometimes, however, Google has not assessed the copyright correctly and the book is not publicly available, although Google has scanned it and it is out-of-copyright. That is the case with all books published before 1900 and some books published between 1900 and 1930.

When you know that Google Books has a scan of a book available, and you believe that the book should be in the public domain, you can ask Google to re-evaluate the copyright situation of that publication. The Google Books team will give you an answer in a couple of days….”