Open Science and Software Assistance: Commentary on “Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened”

Abstract:  Májovský and colleagues have investigated the important issue of ChatGPT being used for the complete generation of scientific works, including fake data and tables. The issues behind why ChatGPT poses a significant concern to research reach far beyond the model itself. Once again, the lack of reproducibility and visibility of scientific works creates an environment where fraudulent or inaccurate work can thrive. What are some of the ways in which we can handle this new situation?

 

Will AI Chatbots Boost Efforts to Make Scholarly Articles Free? Peter Baldwin interview | EdSurge News, May 23, 2023

“…Baldwin’s latest book, “Athena Unbound: Why and How Scholarly Knowledge Should Be Free for All,” looks at the history and future of the open access movement. And fittingly, his publisher made a version of the book available free online. This professor is not arguing that all information should be free. He’s focused on freeing up scholarship made by those who have full-time jobs at colleges, and who are thus not expecting payment from their writing to make a living. In fact, he argues that the whole idea of academic research hinges on work being shared freely so that other scholars can build on someone else’s idea or see from another scholar’s work that they might be going down a dead-end path. The typical open access model makes scholarly articles free to the public by charging authors a processing fee to have their work published in the journal. And in some cases that has caused new kinds of challenges, since those fees are often paid by college libraries, and not every scholar in every discipline has equal access to support. The number of open access journals has grown over the years. But the majority of scholarly journals still follow the traditional subscription model, according to recent estimates. EdSurge recently connected with Baldwin to talk about where he sees the movement going….”

EU governments to rein in unfair academic publishers and unsustainable fees | Science|Business

“Member states have almost settled on a call to make immediate open access the default, with no author fees. But some say the Council needs to do more to prevent AI-generated papers threatening the integrity of the scientific record.

Research ministers are nearing the finish line in drafting their position on changes needed in academic publishing, with the latest leak draft revealing the near-final text.

The upcoming paper, to be adopted by ministers in late May, will call on policymakers and publishers to make immediate and unrestricted open access, “the default mode in publishing, with no fees for authors.”

While there is a warm reception for this, there is concern that the Council is missing the opportunity to crack down on AI-generated scientific articles, given rising evidence that the AI chatbots like ChatGPT could undermine the integrity of academic publishing….”

Defining a Machine-Readable Friendly License for Cloud Contribution Environments

Abstract:  There are two types of Contribution environments that have been widely written about in the last decade – closed environments controlled by the promulgator and open access environments seemingly controlled by everyone and no-one at the same time. In closed environments, the promulgator has the sole discretion to control both the intellectual property at hand and the integrity of that content. In open access environments the Intellectual Property (IP) is controlled to varying degrees by the Creative Commons License associated with the content. It is solely up to the promulgator to control the integrity of that content. Added to that, open access environments don’t offer native protection to data in such a way that the data can be access and utilized for Text Mining (TM), Natural Language Processing (NLP) or Machine Learning (ML). It is our intent in this paper to lay out a third option – that of a federated cloud environment wherein all members of the federation agree upon terms for copyright protection, the integrity of the data at hand, and the use of that data for Text Mining (TM), Natural Language Processing (NLP) or Machine Learning (ML).

Google shared AI knowledge with the world until ChatGPT caught up – The Washington Post

“In February, Jeff Dean, Google’s longtime head of artificial intelligence, announced a stunning policy shift to his staff: They had to hold off sharing their work with the outside world.

For years Dean had run his department like a university, encouraging researchers to publish academic papers prolifically; they pushed out nearly 500 studies since 2019, according to Google Research’s website.

 

But the launch of OpenAI’s groundbreaking ChatGPT three months earlier had changed things. The San Francisco start-up kept up with Google by reading the team’s scientific papers, Dean said at the quarterly meeting for the company’s research division. Indeed, transformers — a foundational part of the latest AI tech and the T in ChatGPT — originated in a Google study.

Things had to change. Google would take advantage of its own AI discoveries, sharing papers only after the lab work had been turned into products, Dean said, according to two people with knowledge of the meeting, who spoke on the condition of anonymity to share private information….”

[2303.14334] The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

Abstract:  Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question “Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces — even for legacy PDFs?” We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we’ve developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We’ve also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers — Discovery, Efficiency, Comprehension, Synthesis, and Accessibility — and present an overview of our progress and remaining open challenges.

 

Data Rivers: Carving Out the Public Domain in the Age of Generative AI by Sylvie Delacroix :: SSRN

Abstract:  The salient question, today, is not whether ‘copyright law [will] allow robots to learn’. The pressing question is whether the fragile data ecosystem that makes generative AI possible can be re-balanced through intervention that is timely enough. The threats to this ecosystem come from multiple fronts. They are comparable in kind to the threats currently affecting ‘water rivers’ across the globe.

First, just as the fundamental human right to water is only possible if ‘reasonable use’ and reciprocity constraints are imposed on the economic exploitation of rivers, so is the fundamental right to access culture, learn and build upon it. It is that right -and the moral aspirations underlying it- that has led millions to share their creative works under ‘open’ licenses. Generative AI tools would not have been possible without access to that rich, high-quality content. Yet few of those tools respect the reciprocity expectations without which the Creative Commons and Open-Source movements cease to be sustainable. The absence of internationally coordinated standards to systematically identify AI-generated content also threatens our ‘data rivers’ with irreversible pollution.

Second, the process that has allowed large corporations to seize control of data and its definition as an asset subject to property rights has effectively enabled the construction of hard structures -canals or dams- that has led to the rights of many of those lying up-or downstream of such structures to be ignored. While data protection laws seek to address those power imbalances by granting ‘personal’ data rights, the exercise of those rights remains demanding, just as it is challenging for artists to defend their IP rights in the face of AI-generated works that threaten them with redundancy.

To tackle the above threats, the long overdue reform of copyright can only be part of the required intervention. Equally important is the construction of bottom-up empowerment infrastructure that gives long term agency to those wishing to share their data and/or creative works. This infrastructure would also play a central role in reviving much-needed democratic engagement. Data not only carries traces of our past. It is also a powerful tool to envisage different futures. There is no doubt that tools such as GPT4 will change us. We would be fools to believe we may leverage those tools at the service of a variety of futures by merely imposing sets of ‘post-hoc’ regulatory constraints.

Frontiers | Editorial: Data science and artificial intelligence for (better) science

“Meaningful and explainable AI in research can only be fulfilled when as much data as possible is made FAIR (Findable, Accessible, Interoperable, and Reusable). How meaning is communicated in science “as precisely as possible” to machines when we formulate scientific concepts is a key question. Machine readability and interpretability is needed in order to make data and information “Fully AI-Ready” and support data-intensive research (Schultes et al.). The future of science is where there is only “one computer” and FAIR services see all FAIR data and effectively access a global FAIR database….

Finally, the question is how to enable (better) open science. Increasingly relevant today than ever before is the greater reliance on access to data, artificial intelligence (AI) and machine learning (ML). Data access increasingly determines scientific discoveries and advancements. Data reuse is at the forefront of an emerging “third wave of open data” (Verhulst et al., 2020). But despite progress in implementing open data and FAIR principles, science data asymmetries (as in disparities in access to science data) are a growing problem and can undermine scientific progress. Comparative research is needed to document (Verhulst and Young) for instance, investigating the creation of new types of data asymmetries by, e.g., new private-sector investments in data platforms and knowledge repositories, how data portability and interoperability impact the practice of data collaboration, the relationship and interplay between existing asymmetries and technological and societal drivers. Finally, new methods for achieving a social license for data use and reuse toward the public good are needed, capturing multiple stakeholders’ acceptance of standard practices and procedures.”

Knowledge Pixels

“We aim to kick-start the next revolution in scientific publishing and knowledge sharing. With nanopublications as our core technology, we are taking the first steps towards our vision of the knowledge space to make the use of research results radically more efficient and effective….

We provide software and services to publish scientific findings in a way that is human readable and machine actionable at the same time. Our approach is open, decentralized, and in full accordance with the FAIR principles….”

Introducing FAIR findings – Innovators and publishers join forces to make scientific articles’ findings machine-interpretable

“First pilot project by Knowledge Pixels starts with scholarly publishers Pensoft and IOS Press to publish findings as knowledge graph snippets by means of nanopublications….

Until AI technology “learns” how to interpret complex scientific literature and evaluates the data, methodology and evidence behind it, it is our responsibility to make sure what we know today is optimally available to the computer algorithms. In turn, those algorithms would be extremely helpful in assisting researchers to build on the knowledge of yesterday by delivering the right information at the right time in a ready-to-use format.

This is why the team behind Knowledge Pixels, a recent startup that develops software and services, devised a framework to publish scientific findings in a way that is simultaneously human-readable and machine-actionable. To do this, the duo teamed up with forward-looking scholarly publishers Pensoft and IOS Press to implement its goal.

AI Is Tearing Wikipedia Apart

“As generative artificial intelligence continues to permeate all aspects of culture, the people who steward Wikipedia are divided on how best to proceed.  During a recent community call, it became apparent that there is a community split over whether or not to use large language models to generate content. While some people expressed that tools like Open AI’s ChatGPT could help with generating and summarizing articles, others remained wary.  The concern is that machine-generated content has to be balanced with a lot of human review and would overwhelm lesser-known wikis with bad content. While AI generators are useful for writing believable, human-like text, they are also prone to including erroneous information, and even citing sources and academic papers which don’t exist. This often results in text summaries which seem accurate, but on closer inspection are revealed to be completely fabricated….”

[2305.00118] Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Abstract:  In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

 

Generative AI Meets Open Culture Tickets, Tue, May 2, 2023 at 10:00 AM | Eventbrite

“With the rise of generative artificial intelligence (AI), there has been increasing interest in how AI can be used in the description, preservation and dissemination of cultural heritage. While AI promises immense benefits, it also raises important ethical considerations.

In this session, leaders from Internet Archive, Creative Commons, and Wikimedia Foundation will discuss how public interest values can shape the development and deployment of AI in cultural heritage, including how to ensure that AI reflects diverse perspectives, promotes cultural understanding, and respects ethical principles such as privacy and consent.

Join us for a thought-provoking discussion on the future of AI in cultural heritage, and learn how we can work together to create a more equitable and responsible future.”