? sci2sci – “GitHub for scientists” – AI-friendly research data management and publishing platform | Grants | Gitcoin

“At sci2sci, we are building an electronic lab notebook and a publishing platform in one interface. This will allow to store all experimental data and metadata in one place, and quickly release it in public access with one click. 

In a nutshell, we offer full stack data publishing – from the experiment planning through raw data acquisition and analysis to the final research report – all in a single platform, with a number of benefits that cannot be offered by a current journal pdf manuscript:…”


“Open access to shared information is essential for the development and evolution of artificial intelligence (AI) and AI-powered solutions needed to address the complex challenges facing the nation and the world. The Open Knowledge Network (OKN), an interconnected network of knowledge graphs, would provide an essential public-data infrastructure for enabling an AI-driven future. It would facilitate the integration of diverse data needed to develop solutions to drive continued strong economic growth, expand opportunities, and address complex problems from climate change to social equity. The OKN Roadmap describes the key characteristics of the OKN and essential considerations in taking the effort forward in an effective and sustainable manner….”

NSF releases Open Knowledge Network Roadmap report

“The U.S. National Science Foundation today published the Open Knowledge Network Roadmap – Powering the next data revolution report that outlines a strategy for establishing an open and accessible national resource to power 21st century data science and next-generation artificial intelligence. Establishing such a knowledge infrastructure would integrate the diverse data needed to sustain strong economic growth, expand opportunities to engage in data analysis, and address complex national challenges such as climate change, misinformation, disruptions from pandemics, economic equity and diversity….”

PLOS partners with DataSeer to develop Open Science Indicators – The Official PLOS Blog

“To provide richer and more transparent information on how PLOS journals support best practice in Open Science, we’re going to begin publishing data on ‘Open Science Indicators’ observed in PLOS articles. These Open Science Indicators will initially include (i) sharing of research data in repositories, (ii) public sharing of code and, (iii) preprint posting, for all PLOS articles from 2019 to present. These indicators – conceptualized by PLOS and developed with DataSeer, using an artificial intelligence-driven approach – are increasingly important to PLOS achieving its mission. We plan to share the results openly to support Open Science initiatives by the wider community.”

Google AI Blog: Announcing the Patent Phrase Similarity Dataset

“Patent documents typically use legal and highly technical language, with context-dependent terms that may have meanings quite different from colloquial usage and even between different documents. The process of using traditional patent search methods (e.g., keyword searching) to search through the corpus of over one hundred million patent documents can be tedious and result in many missed results due to the broad and non-standard language used. For example, a “soccer ball” may be described as a “spherical recreation device”, “inflatable sportsball” or “ball for ball game”. Additionally, the language used in some patent documents may obfuscate terms to their advantage, so more powerful natural language processing (NLP) and semantic similarity understanding can give everyone access to do a thorough search.

The patent domain (and more general technical literature like scientific publications) poses unique challenges for NLP modeling due to its use of legal and technical terms. While there are multiple commonly used general-purpose semantic textual similarity (STS) benchmark datasets (e.g., STS-B, SICK, MRPC, PIT), to the best of our knowledge, there are currently no datasets focused on technical concepts found in patents and scientific publications (the somewhat related BioASQ challenge contains a biomedical question answering task). Moreover, with the continuing growth in size of the patent corpus (millions of new patents are issued worldwide every year), there is a need to develop more useful NLP models for this domain.

Today, we announce the release of the Patent Phrase Similarity dataset, a new human-rated contextual phrase-to-phrase semantic matching dataset, and the accompanying paper, presented at the SIGIR PatentSemTech Workshop, which focuses on technical terms from patents. The Patent Phrase Similarity dataset contains ~50,000 rated phrase pairs, each with a Cooperative Patent Classification (CPC) class as context. In addition to similarity scores that are typically included in other benchmark datasets, we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, and domain related. This dataset (distributed under the Creative Commons Attribution 4.0 International license) was used by Kaggle and USPTO as the benchmark dataset in the U.S. Patent Phrase to Phrase Matching competition to draw more attention to the performance of machine learning models on technical text. Initial results show that models fine-tuned on this new dataset perform substantially better than general pre-trained models without fine-tuning….”

The Open Knowledge Network

“OKN, the open knowledge network, is a concept that envisions connecting many public, government, and other open datasets into one large connected information infrastructure. The benefit of creating this infrastructure is that government agencies, private organizations, academics, and anyone can better access and utilize open data to drive evidence-based policies, develop novel AI capabilities, and tackle societal problems.

The focus of the OKN innovation sprint was to construct a roadmap to building a prototype OKN through the development of compelling use cases to illustrate what might be possible with a fully functional open knowledge network….”

Towards Robust, Reproducible, and Clinically Actionable EEG Biomarkers: Large Open Access EEG Database for Discovery and Out-of-sample Validation – Hanneke van Dijk, Mark Koppenberg, Martijn Arns, 2022

“To aid researchers in development and validation of EEG biomarkers, and development of new (AI) methodologies, we hereby also announce our open access EEG dataset: the Two Decades Brainclinics Research Archive for Insights in Neuroscience (TDBRAIN)….

The whole raw EEG dataset as well as python code to preprocess the raw data is available at www.brainclinics.com/resources and can freely be downloaded using ORCID credentials….”

[2208.06178] Mining Legal Arguments in Court Decisions

Abstract:  Identifying, classifying, and analyzing arguments in legal discourse has been a prominent area of research since the inception of the argument mining field. However, there has been a major discrepancy between the way natural language processing (NLP) researchers model and annotate arguments in court decisions and the way legal experts understand and analyze legal argumentation. While computational approaches typically simplify arguments into generic premises and claims, arguments in legal research usually exhibit a rich typology that is important for gaining insights into the particular case and applications of law in general. We address this problem and make several substantial contributions to move the field forward. First, we design a new annotation scheme for legal arguments in proceedings of the European Court of Human Rights (ECHR) that is deeply rooted in the theory and practice of legal argumentation research. Second, we compile and annotate a large corpus of 373 court decisions (2.3M tokens and 15k annotated argument spans). Finally, we train an argument mining model that outperforms state-of-the-art models in the legal NLP domain and provide a thorough expert-based evaluation. All datasets and source codes are available under open lincenses at this https URL.


‘The entire protein universe’: AI predicts shape of nearly every known protein

“From today, determining the 3D shape of almost any protein known to science will be as simple as typing in a Google search.

Researchers have used AlphaFold — the revolutionary artificial-intelligence (AI) network — to predict the structures of some 200 million proteins from 1 million species, covering nearly every known protein on the planet.

The data dump will be freely available on a database set up by DeepMind, Google’s London-based AI company that developed AlphaFold, and the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), an intergovernmental organization near Cambridge, UK….”

How AI could help make Wikipedia entries more accurate

“Automated tools can help identify gibberish or statements that lack citations, but helping human editors determine whether a source actually backs up a claim is a much more complex task — one that requires an AI system’s depth of understanding and analysis.

Building on Meta AI’s research and advancements, we’ve developed the first model capable of automatically scanning hundreds of thousands of citations at once to check whether they truly support the corresponding claims. It’s open-sourced here, and you can see a demo of our verifier here. As a knowledge source for our model, we created a new dataset of 134 million public webpages — an order of magnitude larger and significantly more intricate than ever used for this sort of research. It calls attention to questionable citations, allowing human editors to evaluate the cases most likely to be flawed without having to sift through thousands of properly cited statements. If a citation seems irrelevant, our model will suggest a more applicable source, even pointing to the specific passage that supports the claim. Eventually, our goal is to build a platform to help Wikipedia editors systematically spot citation issues and quickly fix the citation or correct the content of the corresponding article at scale….”

Data Protections and Licenses Affecting Text and Data Mining for Machine Learning


Abstract:  Machines don’t read works or data. Machines need to first abstract and then format data for learning and then apply tagging and other metadata to model the data into something the machine can “understand.” Legal protections aren’t purpose-built to allow machines to abstract data from a work, process it, model it, and then re-present it. Most licenses aren’t purpose-built for that either. This document walks the reader through all the known protections and licenses as to whether they cover machine learning practices. It then postulates a proposed license structure for that purpose.

Data solidarity for machine learning for embryo selection: a call for the creation of an open access repository of embryo data – Reproductive BioMedicine Online

Abstract:  The last decade has seen an explosion of machine learning applications in healthcare, with mixed and sometimes harmful results despite much promise and associated hype. A significant reason for the reversal in the reported benefit of these applications is the premature implementation of machine learning algorithms in clinical practice. This paper argues the critical need for ‘data solidarity’ for machine learning for embryo selection. A recent Lancet and Financial Times commission defined data solidarity as ‘an approach to the collection, use, and sharing of health data and data for health that safeguards individual human rights while building a culture of data justice and equity, and ensuring that the value of data is harnessed for public good’ (Kickbusch et al., 2021).




Open data: The building block of 21st century (open) science | Data & Policy | Cambridge Core

Abstract:  This paper identifies the potential benefits of data sharing and open science, supported by artificial intelligence tools and services, and dives into the challenges to make data open and findable, accessible, interoperable, and reusable (FAIR).