? sci2sci – “GitHub for scientists” – AI-friendly research data management and publishing platform | Grants | Gitcoin

“At sci2sci, we are building an electronic lab notebook and a publishing platform in one interface. This will allow to store all experimental data and metadata in one place, and quickly release it in public access with one click. 

In a nutshell, we offer full stack data publishing – from the experiment planning through raw data acquisition and analysis to the final research report – all in a single platform, with a number of benefits that cannot be offered by a current journal pdf manuscript:…”

OpenPBTA: An Open Pediatric Brain Tumor Atlas | bioRxiv

Abstract:  Pediatric brain and spinal cancer are the leading disease-related cause of death in children, thus we urgently need curative therapeutic strategies for these tumors. To accelerate such discoveries, the Children’s Brain Tumor Network and Pacific Pediatric Neuro-Oncology Consortium created a systematic process for tumor biobanking, model generation, and sequencing with immediate access to harmonized data. We leverage these data to create OpenPBTA, an open collaborative project which establishes over 40 scalable analysis modules to genomically characterize 1,074 pediatric brain tumors. Transcriptomic classification reveals that TP53 loss is a significant marker for poor overall survival in ependymomas and H3 K28-altered diffuse midline gliomas and further identifies universal TP53 dysregulation in mismatch repair-deficient hypermutant high-grade gliomas. OpenPBTA is a foundational analysis platform actively being applied to other pediatric cancers and inform molecular tumor board decision-making, making it an invaluable resource to the pediatric oncology community.


Everything Hertz: 161: The memo (with Brian Nosek)

“Dan and James are joined by Brian Nosek (Co-founder and Executive Director of the Center for Open Science) to discuss the recent White House Office of Science Technology & Policy memo ensuring free, immediate, and equitable access to federally funded research. They also cover the implications of this memo for scientific publishing, as well as the mechanics of culture change in science….”

Open Science

“For a growing number of scientists, though, the process looks like this:

The data that the scientist collects is stored in an open access repository like figshare or Zenodo, possibly as soon as it’s collected, and given its own Digital Object Identifier (DOI). Or the data was already published and is stored in Dryad.
The scientist creates a new repository on GitHub to hold her work.
As she does her analysis, she pushes changes to her scripts (and possibly some output files) to that repository. She also uses the repository for her paper; that repository is then the hub for collaboration with her colleagues.
When she’s happy with the state of her paper, she posts a version to arXiv or some other preprint server to invite feedback from peers.
Based on that feedback, she may post several revisions before finally submitting her paper to a journal.
The published paper includes links to her preprint and to her code and data repositories, which makes it much easier for other scientists to use her work as starting point for their own research.

This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. You can find more on the different aspects of Open Science in this book.

This is one of the (many) reasons we teach version control. …”

Sharkipedia: a curated open access database of shark and ray life history traits and abundance time-series | Scientific Data

Abstract:  A curated database of shark and ray biological data is increasingly necessary both to support fisheries management and conservation efforts, and to test the generality of hypotheses of vertebrate macroecology and macroevolution. Sharks and rays are one of the most charismatic, evolutionary distinct, and threatened lineages of vertebrates, comprising around 1,250 species. To accelerate shark and ray conservation and science, we developed Sharkipedia as a curated open-source database and research initiative to make all published biological traits and population trends accessible to everyone. Sharkipedia hosts information on 58 life history traits from 274 sources, for 170 species, from 39 families, and 12 orders related to length (n?=?9 traits), age (8), growth (12), reproduction (19), demography (5), and allometric relationships (5), as well as 871 population time-series from 202 species. Sharkipedia relies on the backbone taxonomy of the IUCN Red List and the bibliography of Shark-References. Sharkipedia has profound potential to support the rapidly growing data demands of fisheries management, international trade regulation as well as anchoring vertebrate macroecology and macroevolution.


XCIST – an open access x-ray/CT simulation toolkit – IOPscience

Abstract:  Objective: X-ray-based imaging modalities including mammography and computed tomography (CT) are widely used in cancer screening, diagnosis, staging, treatment planning, and therapy response monitoring. Over the past few decades, improvements to these modalities have resulted in substantially improved efficacy and efficiency, and substantially reduced radiation dose and cost. However, such improvements have evolved more slowly than would be ideal because lengthy preclinical and clinical evaluation is required. In many cases, new ideas cannot be evaluated due to the high cost of fabricating and testing prototypes. Wider availability of computer simulation tools could accelerate development of new imaging technologies. This paper introduces the development of a new open-access simulation environment for X-ray-based imaging. Approach: The X-ray-based Cancer Imaging Simulation Toolkit (XCIST) is developed in the context of cancer imaging, but can more broadly be applied. XCIST is physics-based, written in Python and C/C++, and currently consists of three major subsets: digital phantoms, the simulator itself (CatSim), and image reconstruction algorithms; planned future features include a fast dose-estimation tool and rigorous validation. To enable broad usage and to model and evaluate new technologies, XCIST is easily extendable by other researchers. To demonstrate XCIST’s ability to produce realistic images and to show the benefits of using XCIST for insight into the impact of separate physics effects on image quality, we present exemplary simulations by varying contributing factors such as noise and sampling. Main Results: The capabilities and flexibility of XCIST are demonstrated, showing easy applicability to specific simulation problems. Geometric and X-ray attenuation accuracy are shown, as well as XCIST’s ability to model multiple scanner and protocol parameters, and to attribute fundamental image quality characteristics to specific parameters. Significance: This work represents an important first step toward the goal of creating an open-access platform for simulating existing and emerging X-ray-based imaging systems.

Pangeo Forge: Crowdsourcing Open Data in the Cloud :: FOSS4G 2022 general tracks :: pretalx

“Geospatial datacubes–large, complex, interrelated multidimensional arrays with rich metadata–arise in analysis-ready geopspatial imagery, level 3/4 satellite products, and especially in ocean / weather / climate simulations and [re]analyses, where they can reach Petabytes in size. The scientific python community has developed a powerful stack for flexible, high-performance analytics of databcubes in the cloud. Xarray provides a core data model and API for analysis of such multidimensional array data. Combined with Zarr or TileDB for efficient storage in object stores (e.g. S3) and Dask for scaling out compute, these tools allow organizations to deploy analytics and machine learning solutions for both exploratory research and production in any cloud platform. Within the geosciences, the Pangeo open science community has advanced this architecture as the “Pangeo platform” (http://pangeo.io/).

However, there is a major barrier preventing the community from easily transitioning to this cloud-native way of working: the difficulty of bringing existing data into the cloud in analysis-ready, cloud-optimized (ARCO) format. Typical workflows for moving data to the cloud currently consist of either bulk transfers of files into object storage (with a major performance penalty on subsequent analytics) or bespoke, case-by-case conversions to cloud optimized formats such as TileDB or Zarr. The high cost of this toil is preventing the scientific community from realizing the full benefits of cloud computing. More generally, the outputs of the toil of preparing scientific data for efficient analysis are rarely shared in an open, collaborative way.

To address these challenges, we are building Pangeo Forge ( https://pangeo-forge.org/), the first open-source cloud-native ETL (extract / transform / load) platform focused on multidimensional scientific data. Pangeo Forge consists of two main elements. An open-source python package–pangeo_forge_recipes–makes it simple for users to define “recipes” for extracting many individual files, combining them along arbitrary dimensions, and depositing ARCO datasets into object storage. These recipes can be “compiled” to run on many different distributed execution engines, including Dask, Prefect, and Apache Beam. The second element of Pangeo Forge is an orchestration backend which integrates tightly with GitHub as a continuous-integration-style service….”

CCC Launches OA Agreement Intelligence | UKSG

“CCC, a provider of Open Access (OA) workflow solutions, has launched OA Agreement Intelligence, an agreement modeling solution that enables publishers to prepare, build, and analyze their OA data so that they can create and communicate sustainable and transparent agreements with their partners. The solution combines sophisticated data preprocessing with easy-to-use analysis and export capabilities….

OA Agreement Intelligence helps publishers achieve scalability, sustainability, and transparency goals for institutional agreements. Pilot participants cited a range of benefits including time savings associated with manual data clean-up, leveraging automated affiliation enrichment through CCC’s recently acquired Ringgold data, accelerating the creation of agreement offers, adjusting deal parameters in real time to drive customer satisfaction, and gaining strategic insights into historical OA business….”

No evidence that mandatory open data policies increase error correction | Nature Ecology & Evolution

Berberi, I., Roche, D.G. No evidence that mandatory open data policies increase error correction. Nat Ecol Evol (2022). https://doi.org/10.1038/s41559-022-01879-9

Preprint: https://doi.org/10.31222/osf.io/k8ver

Abstract: Using a database of open data policies for 199 journals in ecology and evolution, we found no detectable link between data sharing requirements and article retractions or corrections. Despite the potential for open data to facilitate error detection, poorly archived datasets, the absence of open code and the stigma associated with correcting or retracting articles probably stymie error correction. Requiring code alongside data and destigmatizing error correction among authors and journal editors could increase the effectiveness of open data policies at helping science self-correct.



“Open access to shared information is essential for the development and evolution of artificial intelligence (AI) and AI-powered solutions needed to address the complex challenges facing the nation and the world. The Open Knowledge Network (OKN), an interconnected network of knowledge graphs, would provide an essential public-data infrastructure for enabling an AI-driven future. It would facilitate the integration of diverse data needed to develop solutions to drive continued strong economic growth, expand opportunities, and address complex problems from climate change to social equity. The OKN Roadmap describes the key characteristics of the OKN and essential considerations in taking the effort forward in an effective and sustainable manner….”

NSF releases Open Knowledge Network Roadmap report

“The U.S. National Science Foundation today published the Open Knowledge Network Roadmap – Powering the next data revolution report that outlines a strategy for establishing an open and accessible national resource to power 21st century data science and next-generation artificial intelligence. Establishing such a knowledge infrastructure would integrate the diverse data needed to sustain strong economic growth, expand opportunities to engage in data analysis, and address complex national challenges such as climate change, misinformation, disruptions from pandemics, economic equity and diversity….”