[2307.10320] Reproducibility in Machine Learning-Driven Research

Abstract:  Research is facing a reproducibility crisis, in which the results and findings of many studies are difficult or even impossible to reproduce. This is also the case in machine learning (ML) and artificial intelligence (AI) research. Often, this is the case due to unpublished data and/or source-code, and due to sensitivity to ML training conditions. Although different solutions to address this issue are discussed in the research community such as using ML platforms, the level of reproducibility in ML-driven research is not increasing substantially. Therefore, in this mini survey, we review the literature on reproducibility in ML-driven research with three main aims: (i) reflect on the current situation of ML reproducibility in various research fields, (ii) identify reproducibility issues and barriers that exist in these research fields applying ML, and (iii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility. With this, we hope to contribute to decisions on the viability of different solutions for supporting ML reproducibility.



Wikipedia’s Moment of Truth – The New York Times, 18 July 2023

“…In late June, I began to experiment with a plug-in the Wikimedia Foundation had built for ChatGPT. At the time, this software tool was being tested by several dozen Wikipedia editors and foundation staff members, but it became available in mid-July on the OpenAI website for subscribers who want augmented answers to their ChatGPT queries. The effect is similar to the “retrieval” process that Jesse Dodge surmises might be required to produce accurate answers. GPT-4’s knowledge base is currently limited to data it ingested by the end of its training period, in September 2021. A Wikipedia plug-in helps the bot access information about events up to the present day. At least in theory, the tool — lines of code that direct a search for Wikipedia articles that answer a chatbot query — gives users an improved, combinatory experience: the fluency and linguistic capabilities of an A.I. chatbot, merged with the factuality and currency of Wikipedia….”


Cracking double-blind review: Authorship attribution with deep learning | PLOS ONE

Abstract:  Double-blind peer review is considered a pillar of academic research because it is perceived to ensure a fair, unbiased, and fact-centered scientific discussion. Yet, experienced researchers can often correctly guess from which research group an anonymous submission originates, biasing the peer-review process. In this work, we present a transformer-based, neural-network architecture that only uses the text content and the author names in the bibliography to attribute an anonymous manuscript to an author. To train and evaluate our method, we created the largest authorship-identification dataset to date. It leverages all research papers publicly available on arXiv amounting to over 2 million manuscripts. In arXiv-subsets with up to 2,000 different authors, our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly. We present a scaling analysis to highlight the applicability of the proposed method to even larger datasets when sufficient compute capabilities are more widely available to the academic community. Furthermore, we analyze the attribution accuracy in settings where the goal is to identify all authors of an anonymous manuscript. Thanks to our method, we are not only able to predict the author of an anonymous work but we also provide empirical evidence of the key aspects that make a paper attributable. We have open-sourced the necessary tools to reproduce our experiments.


Small bowel capsule endoscopy examination and open access database with artificial intelligence: The SEE?artificial intelligence project – PMC

Abstract:  Objectives

Artificial intelligence (AI) may be practical for image classification of small bowel capsule endoscopy (CE). However, creating a functional AI model is challenging. We attempted to create a dataset and an object detection CE AI model to explore modeling problems to assist in reading small bowel CE.


We extracted 18,481 images from 523 small bowel CE procedures performed at Kyushu University Hospital from September 2014 to June 2021. We annotated 12,320 images with 23,033 disease lesions, combined them with 6161 normal images as the dataset, and examined the characteristics. Based on the dataset, we created an object detection AI model using YOLO v5 and we tested validation.


We annotated the dataset with 12 types of annotations, and multiple annotation types were observed in the same image. We test validated our AI model with 1396 images, and sensitivity for all 12 types of annotations was about 91%, with 1375 true positives, 659 false positives, and 120 false negatives detected. The highest sensitivity for individual annotations was 97%, and the highest area under the receiver operating characteristic curve was 0.98, but the quality of detection varied depending on the specific annotation.


Object detection AI model in small bowel CE using YOLO v5 may provide effective and easy?to?understand reading assistance. In this SEE?AI project, we open our dataset, the weights of the AI model, and a demonstration to experience our AI. We look forward to further improving the AI model in the future.

Clarivate Announces Partnership with AI21 Labs as part of its Generative AI Strategy to Drive Growth

“Clarivate Plc (NYSE: CLVT), a global leader in connecting people and organizations to intelligence they can trust to transform their world, today announced a strategic partnership with AI21 Labs, a pioneer in generative artificial intelligence (AI). The collaboration will integrate large language models into solutions from Clarivate, to enable intuitive academic conversational search and discovery, specifically designed to foster researcher excellence and drive success for researchers and students, while adhering to core academic principles and values.

AI has the potential to revolutionize the world, but its effectiveness relies heavily on the quality of the training data. With billions of trusted, curated, articles, books, documents and propriety best in class data points, Clarivate is well-placed to lead the market on this opportunity, providing customers with the highest quality open, licensed and proprietary content, data and insights while mitigating associated risks….”

Open Data on GitHub: Unlocking the Potential of AI

Abstract:  GitHub is the world’s largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability — ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at this https URL.



Open Science and Software Assistance: Commentary on “Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened”

Abstract:  Májovský and colleagues have investigated the important issue of ChatGPT being used for the complete generation of scientific works, including fake data and tables. The issues behind why ChatGPT poses a significant concern to research reach far beyond the model itself. Once again, the lack of reproducibility and visibility of scientific works creates an environment where fraudulent or inaccurate work can thrive. What are some of the ways in which we can handle this new situation?


Will AI Chatbots Boost Efforts to Make Scholarly Articles Free? Peter Baldwin interview | EdSurge News, May 23, 2023

“…Baldwin’s latest book, “Athena Unbound: Why and How Scholarly Knowledge Should Be Free for All,” looks at the history and future of the open access movement. And fittingly, his publisher made a version of the book available free online. This professor is not arguing that all information should be free. He’s focused on freeing up scholarship made by those who have full-time jobs at colleges, and who are thus not expecting payment from their writing to make a living. In fact, he argues that the whole idea of academic research hinges on work being shared freely so that other scholars can build on someone else’s idea or see from another scholar’s work that they might be going down a dead-end path. The typical open access model makes scholarly articles free to the public by charging authors a processing fee to have their work published in the journal. And in some cases that has caused new kinds of challenges, since those fees are often paid by college libraries, and not every scholar in every discipline has equal access to support. The number of open access journals has grown over the years. But the majority of scholarly journals still follow the traditional subscription model, according to recent estimates. EdSurge recently connected with Baldwin to talk about where he sees the movement going….”

EU governments to rein in unfair academic publishers and unsustainable fees | Science|Business

“Member states have almost settled on a call to make immediate open access the default, with no author fees. But some say the Council needs to do more to prevent AI-generated papers threatening the integrity of the scientific record.

Research ministers are nearing the finish line in drafting their position on changes needed in academic publishing, with the latest leak draft revealing the near-final text.

The upcoming paper, to be adopted by ministers in late May, will call on policymakers and publishers to make immediate and unrestricted open access, “the default mode in publishing, with no fees for authors.”

While there is a warm reception for this, there is concern that the Council is missing the opportunity to crack down on AI-generated scientific articles, given rising evidence that the AI chatbots like ChatGPT could undermine the integrity of academic publishing….”

Defining a Machine-Readable Friendly License for Cloud Contribution Environments

Abstract:  There are two types of Contribution environments that have been widely written about in the last decade – closed environments controlled by the promulgator and open access environments seemingly controlled by everyone and no-one at the same time. In closed environments, the promulgator has the sole discretion to control both the intellectual property at hand and the integrity of that content. In open access environments the Intellectual Property (IP) is controlled to varying degrees by the Creative Commons License associated with the content. It is solely up to the promulgator to control the integrity of that content. Added to that, open access environments don’t offer native protection to data in such a way that the data can be access and utilized for Text Mining (TM), Natural Language Processing (NLP) or Machine Learning (ML). It is our intent in this paper to lay out a third option – that of a federated cloud environment wherein all members of the federation agree upon terms for copyright protection, the integrity of the data at hand, and the use of that data for Text Mining (TM), Natural Language Processing (NLP) or Machine Learning (ML).

Google shared AI knowledge with the world until ChatGPT caught up – The Washington Post

“In February, Jeff Dean, Google’s longtime head of artificial intelligence, announced a stunning policy shift to his staff: They had to hold off sharing their work with the outside world.

For years Dean had run his department like a university, encouraging researchers to publish academic papers prolifically; they pushed out nearly 500 studies since 2019, according to Google Research’s website.


But the launch of OpenAI’s groundbreaking ChatGPT three months earlier had changed things. The San Francisco start-up kept up with Google by reading the team’s scientific papers, Dean said at the quarterly meeting for the company’s research division. Indeed, transformers — a foundational part of the latest AI tech and the T in ChatGPT — originated in a Google study.

Things had to change. Google would take advantage of its own AI discoveries, sharing papers only after the lab work had been turned into products, Dean said, according to two people with knowledge of the meeting, who spoke on the condition of anonymity to share private information….”

[2303.14334] The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

Abstract:  Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question “Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces — even for legacy PDFs?” We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we’ve developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We’ve also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers — Discovery, Efficiency, Comprehension, Synthesis, and Accessibility — and present an overview of our progress and remaining open challenges.