Past and future uses of text mining in ecology and evolution | Proceedings of the Royal Society B: Biological Sciences

Abstract:  Ecology and evolutionary biology, like other scientific fields, are experiencing an exponential growth of academic manuscripts. As domain knowledge accumulates, scientists will need new computational approaches for identifying relevant literature to read and include in formal literature reviews and meta-analyses. Importantly, these approaches can also facilitate automated, large-scale data synthesis tasks and build structured databases from the information in the texts of primary journal articles, books, grey literature, and websites. The increasing availability of digital text, computational resources, and machine-learning based language models have led to a revolution in text analysis and natural language processing (NLP) in recent years. NLP has been widely adopted across the biomedical sciences but is rarely used in ecology and evolutionary biology. Applying computational tools from text mining and NLP will increase the efficiency of data synthesis, improve the reproducibility of literature reviews, formalize analyses of research biases and knowledge gaps, and promote data-driven discovery of patterns across ecology and evolutionary biology. Here we present recent use cases from ecology and evolution, and discuss future applications, limitations and ethical issues.


Meta has built a massive new language AI—and it’s giving it away for free | MIT Technology Review

“Meta’s AI lab has created a massive new language model that shares both the remarkable abilities and the harmful flaws of OpenAI’s pioneering neural network GPT-3. And in an unprecedented move for Big Tech, it is giving it away to researchers—together with details about how it was built and trained….

Meta’s move is the first time that a fully trained large language model will be made available to any researcher who wants to study it. The news has been welcomed by many concerned about the way this powerful technology is being built by small teams behind closed doors….”

The future of research revealed | | April 20, 2022

“The research ecosystem has been undergoing rapid and profound change, accelerated by COVID-19. This transformation is being fueled by many factors, including advances in technology, funding challenges and opportunities, political uncertainty, and new pressures on women in research. At Elsevier, we have been working with the global research community to better understand these changes and what the world of research might look like in the future. The results were published today in Elsevier’s new Research Futures Report 2.0. The report is free to read and download….”

Full article: Emergence of New Public Discovery Services: Connecting Open Web Searchers to Library Content

Abstract:  A growing number of new public citation databases, available free of charge and accessible on the open web, are offering researchers a new place to start their searching, providing an alternative to Google Scholar and library resources. These new “public discovery services” index significant portions of scholarly literature, then differentiate themselves by applying technologies like artificial intelligence and machine learning, to create results sets that are promoted as more meaningful, easier to navigate, and more engaging than other discovery options. Additionally, these new public discovery services are adopting new linking technologies that connect researchers from citation records to full text content licensed on their behalf by their affiliated libraries. With these new sites logging millions of sessions a month, they present unique opportunities for libraries to connect to researchers working outside the library and challenges in how the library can make itself obvious in the user workflow.


How research is being transformed by open data and AI | Popular Science

“How iNaturalist can correctly recognize (most of the time, at least) different living organisms is thanks to a machine-learning model that works off of data collected by its original app, which first debuted in 2008 and is simply called iNaturalist. Its goal is to help people connect to the richly animated natural world around them. 

The iNaturalist platform, which boasts around 2 million users, is a mashup of social networking and citizen science where people can observe, document, share, discuss, learn more about nature, and create data for science and conservation. Outside of taking photos, the iNaturalist app has extended capabilities compared to the gamified Seek. It has a news tab, local wildlife guides, and organizations can also use the platform to host data collection “projects” that focus on certain areas or certain species of interest. 

When new users join iNaturalist, they’re prompted to check a box that allows them to share their data with scientists (although you can still join if you don’t check the box). Images and information about their location that users agree to share are tagged with a creative commons license, otherwise, it’s held under an all-rights reserved license. About 70 percent of the app’s data on the platform is classified as creative commons. “You can think of iNaturalist as this big open data pipe that just goes out there into the scientific community and is used by scientists in many ways that we’re totally surprised by,” says Scott Loarie, co-director of iNaturalist. …

But with an ever-growing amount of data, our ability to wrangle these numbers and stats manually becomes virtually impossible. “You would only be able to handle these quantities of data using very advanced computing techniques. This is part of the scientific world we live in today,” Durant adds….

Another problem that researchers have to consider is maintaining the quality of big datasets, which can impinge on the effectiveness of analytics tools. This is where the peer-review process plays an important role….”


Archives, Access and Artificial Intelligence bei Transcript Publishing

“Digital archives are transforming the Humanities and the Sciences. Digitized collections of newspapers and books have pushed scholars to develop new, data-rich methods. Born-digital archives are now better preserved and managed thanks to the development of open-access and commercial software. Digital Humanities have moved from the fringe to the center of academia. Yet, the path from the appraisal of records to their analysis is far from smooth. This book explores crossovers between various disciplines to improve the discoverability, accessibility, and use of born-digital archives and other cultural assets….


Seiten 7 – 28

Chapter 1: Artificial Intelligence and Discovering the Digitized Photoarchive
Seiten 29 – 60

Chapter 2: Web Archives and the Problem of Access: Prototyping a Researcher Dashboard for the UK Government Web Archive
Seiten 61 – 82

Chapter 3: Design Thinking, UX and Born-digital Archives: Solving the Problem of Dark Archives Closed to Users
Seiten 83 – 108

Chapter 4: Towards Critically Addressable Data for Digital Library User Studies
Seiten 109 – 130

Chapter 5: Reviewing the Reviewers: Training Neural Networks to Read Peer Review Reports
Seiten 131 – 156

Chapter 6: Supervised and Unsupervised: Approaches to Machine Learning for Textual Entities
Seiten 157 – 178

Chapter 7: Inviting AI into the Archives: The Reception of Handwritten Recognition Technology into Historical Manuscript Transcription
Seiten 179 – 204

AFTERWORD: Towards a new Discipline of Computational Archival Science (CAS)
Seiten 205 – 218 …

[From the Introduction:]

The closure of libraries, archives and museums due to the COVID-19 pandemic has highlighted the urgent need to make archives and cultural heritage materials accessible in digital form. Yet too many born-digital and digitized collections remain closed to researchers and other users due to privacy concerns, copyright and other issues. Born-digital archives are rarely accessible to users. For example, the archival emails of the writer Will Self at the British Library are not listed on the Finding Aid describing the collection, and they are not available to users either onsite or offsite. At a time when emails have largely replaced letters, this severely limits the amount of content openly accessible in archival collections. Even when digital data is publicly available (as in the case of web archives), users often need to physically travel to repositories to consult web pages. In the case of digitized collections, copyright can also be a major obstacle to access. For instance, copyrightprotected texts are not available for download from HathiTrust, a not-for-profit collaborative of academic and research libraries preserving 17+ million digitized items (including around 61% not in the public domain)….

It is important to recognize that “dark” archives contain vast amounts of data essential to scholars – including email corres

The need for open access and natural language processing | PNAS

“In PNAS, Chu and Evans (1) argue that the rapidly rising number of publications in any given field actually hinders progress. The rationale is that, if too many papers are published, the really novel ideas have trouble finding traction, and more and more people tend to “go along with the majority.” Review papers are cited more and more instead of original research. We agree with Chu and Evans: Scientists simply cannot keep up. This is why we argue that we must bring the powers of artificial intelligence/machine learning (AI/ML) and open access to the forefront. AI/ML is a powerful tool and can be used to ingest and analyze large quantities of data in a short period of time. For example, some of us (2) have used AI/ML tools to ingest 500,000+ abstracts from online archives (relatively easy to do today) and categorize them for strategic planning purposes. This letter offers a short follow-on to Chu and Evans (hereafter CE) to point out a way to mitigate the problems they delineate….

In conclusion, we agree with CE (1) on the problems caused by the rapid rise in scientific publications, outpacing any individual’s ability to keep up. We propose that open access, combined with NLP, can help effectively organize the literature, and we encourage publishers to make papers open access, archives to make papers easily findable, and researchers to employ their own NLP as an important tool in their arsenal.”

Springer Nature and Université Grenoble Alpes release free enhanced integrity software to tackle fake scientific papers | Corporate Affairs Homepage | Springer Nature

“Springer Nature today announces the release of PySciDetect, its next generation open source research integrity software to identify fake research. Developed in-house with Slimmer AI and released in collaboration with Dr Cyril Labbé from Université Grenoble Alpes, PySciDetect is available for all publishers and those within the academic community to download and use. …


Since launch, SciDetect has been used by a number of publishers and members of the academic community. For Springer Nature, it has scanned over 3.8million journal articles and over 2.5milion book chapters. Available as open source software, PySciDetect expands on Springer Nature’s commitment to being an active partner to the community, supporting a collaborative effort to detect fraudulent scientific research and protect the integrity of the academic record.”

Dataset of Indian and Thai banknotes with annotations – ScienceDirect

Abstract:  Multinational banknote detection in real time environment is the open research problem for the research community. Several studies have been conducted for providing solution for fast and accurate recognition of banknotes, detection of counterfeit banknotes, and identification of damaged banknotes. The State-of art techniques like machine learning (ML) and deep learning (DL) are dominating the traditional methods of digital image processing technique used for banknote classification. The success of the ML or DL projects heavily depends on size and comprehensiveness of dataset used. The available datasets have the following limitations:

 1. The size of existing Indian dataset is insufficient to train ML or DL projects [1], [2].

 2. The existing dataset fail to cover all denomination classes [1].

 3. The existing dataset does not consists of latest denomination [3].

 4. As per the literature survey there is no public open access dataset is available for Thai banknotes.

To overcome all these limitations we have created a total 3000 image dataset of Indian and Thai banknotes which include 2000 images of Indian banknotes and 1000 images of Thai banknotes. Indian banknotes consist of old and new banknotes of 10, 20, 50, 100, 200, 500 and 2000 rupees and Thai banknotes consist of 20, 50, 100, 500 and 1000 Baht.

Frontiers | Sounding the Call for a Global Library of Underwater Biological Sounds | Ecology and Evolution

Abstract:  Aquatic environments encompass the world’s most extensive habitats, rich with sounds produced by a diversity of animals. Passive acoustic monitoring (PAM) is an increasingly accessible remote sensing technology that uses hydrophones to listen to the underwater world and represents an unprecedented, non-invasive method to monitor underwater environments. This information can assist in the delineation of biologically important areas via detection of sound-producing species or characterization of ecosystem type and condition, inferred from the acoustic properties of the local soundscape. At a time when worldwide biodiversity is in significant decline and underwater soundscapes are being altered as a result of anthropogenic impacts, there is a need to document, quantify, and understand biotic sound sources–potentially before they disappear. A significant step toward these goals is the development of a web-based, open-access platform that provides: (1) a reference library of known and unknown biological sound sources (by integrating and expanding existing libraries around the world); (2) a data repository portal for annotated and unannotated audio recordings of single sources and of soundscapes; (3) a training platform for artificial intelligence algorithms for signal detection and classification; and (4) a citizen science-based application for public users. Although individually, these resources are often met on regional and taxa-specific scales, many are not sustained and, collectively, an enduring global database with an integrated platform has not been realized. We discuss the benefits such a program can provide, previous calls for global data-sharing and reference libraries, and the challenges that need to be overcome to bring together bio- and ecoacousticians, bioinformaticians, propagation experts, web engineers, and signal processing specialists (e.g., artificial intelligence) with the necessary support and funding to build a sustainable and scalable platform that could address the needs of all contributors and stakeholders into the future. and CORE cooperate to build AI Chemist – Research

“CORE and are extremely pleased to announce the initiation of a new research collaboration funded by the Norwegian Research Council.

Discovering scientific insights about a specific topic is challenging, particularly in an area like chemistry which is one of the top-five most published fields with over 11 million publications and 307,000 patents. The team at have spent the last 5 years building an award-winning AI engine for scientific text understanding. Their patented algorithms for identifying text similarity, extracting tabular data and creating domain-specific entity representations mean they are world leaders in this domain. 

The AI Chemist project is a collaboration between and The Open University, Oxford University, Trinity College, Dublin and University College, London. CORE is a not-for-profit platform delivered by The Open University in cooperation with Jisc that hosts the world’s largest collection of open access scientific articles. As of February 2022, the CORE dataset provides metadata information (title, author, abstract, publishing year, etc.) for approximately 210 million articles, and the full text for 29.5 million articles.”

Using open access research in our battle against misinformation – Research

“While scientific papers have been traditionally seen as a source of mostly trustworthy information, their use within automated tools in the fight against misinformation, such as related to vaccine effectiveness or climate changes, has been rather limited….

At CORE, we are committed to a more transparent society, free of misinformation. Our data services, providing access to machine readable information from across the global network of open repositories are a treasure trove for this use case. 

We are therefore excited to support an innovative startup, Consensus, a search engine designed to perform evidence retrieval and assessment over scientific insights.  …”


Cancers | Free Full-Text | Characterizing Malignant Melanoma Clinically Resembling Seborrheic Keratosis Using Deep Knowledge Transfer

Abstract:  Malignant melanomas resembling seborrheic keratosis (SK-like MMs) are atypical, challenging to diagnose melanoma cases that carry the risk of delayed diagnosis and inadequate treatment. On the other hand, SK may mimic melanoma, producing a ‘false positive’ with unnecessary lesion excisions. The present study proposes a computer-based approach using dermoscopy images for the characterization of S?-like MMs. Dermoscopic images were retrieved from the International Skin Imaging Collaboration archive. Exploiting image embeddings from pretrained convolutional network VGG16, we trained a support vector machine (SVM) classification model on a data set of 667 images. SVM optimal hyperparameter selection was carried out using the Bayesian optimization method. The classifier was tested on an independent data set of 311 images with atypical appearance: MMs had an absence of pigmented network and had an existence of milia-like cysts. SK lacked milia-like cysts and had a pigmented network. Atypical MMs were characterized with a sensitivity and specificity of 78.6% and 84.5%, respectively. The advent of deep learning in image recognition has attracted the interest of computer science towards improved skin lesion diagnosis. Open-source, public access archives of skin images empower further the implementation and validation of computer-based systems that might contribute significantly to complex clinical diagnostic problems such as the characterization of SK-like MMs. 

Artificial Intelligence for Public Domain Drug Discovery: Recommendations for Policy Development

“The current drug discovery market is not responding sufficiently to health care needs where it is not adequately lucrative to do so. Unfortunately, there are a number of important yet non-lucrative fields of research in domains including pandemic prevention and antimicrobial resistance, with major current and future costs for society. In these domains, where high-risk public health needs are being met with low R&D investment, government intervention is critical. To maximize the efficiency of the government’s involvement, it is recommended that the government couple its work catalyzing R&D with the creation of a drug development ecosystem that is more conducive to the use of high-impact artificial intelligence (AI) technologies. The scientific and political communities have been ringing alarm-bells over the threat of bacterial resistance to our current antibiotics arsenal and, more generally, the evolving resistance of microbes to existing drugs. Yet, a combination of technical capacity issues and economic barriers has led to an almost complete halt of R&D into treatments that would otherwise address this threat. When a gap arises between what the market is incentivized to produce and the healthcare needs of society, governments must step in. The COVID-19 pandemic illustrates the importance of bridging that gap to ensure we are protected from future threats that would result in similarly devastating consequences. Artificial intelligence (AI) capabilities have contributed to watershed moments across a variety of industries already. The transformative power of AI is showing early signs of success in the drug discovery industry as well. Should AI for drug discovery reach its full potential, it offers the ability to discover new categories of effective drugs, enable intelligent, targeted design of novel therapies, vastly improve the speed and cost of running clinical trials, and further our understanding about the basic science underlying drug and disease mechanics. However, the current drug discovery ecosystem is suboptimal for AI research, and this threatens to limit the positive impact of AI. The field requires a shift towards open data and open science in order to feed the most powerful, data-hungry AI algorithms. This shift will catalyze research in areas of high social impact, such as addressing neglected diseases and developing new antibiotic solutions to incoming drug-resistant threats. Yet, while open science and AI promise successes on producing new compounds, they cannot address the challenges associated with market-failure for certain drug categories. Government interventions to stimulate AI-driven pharmaceutical innovation for these drug categories must therefore target the entire drug development and deployment lifecycle to ensure that the benefits of AI technology, as applied to the pharmaceutical industry, result in strong value added to improve healthcare outcomes for the public….

This document puts forward a set of recommendations that, taken together, task governments with the responsibility to promote: 1. Research and development in fields of drug discovery that are valuable to society and necessary to public health, but for which investments are currently insufficient because of market considerations. 2. Uptake of AI throughout the entire drug discovery and development pipeline. 3. A shift in culture and capabilities towards more open-data among stakeholders in academia and industry when undertaking research on drug discovery and development….”