Abstract: Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. We address computational reproducibility at two levels: First, using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks related to publications indexed in PubMed Central. We identified such notebooks by mining the articles full text, locating them on GitHub and re-running them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. Second, this study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over two years. Out of 27271 notebooks from 2660 GitHub repositories associated with 3467 articles, 22578 notebooks were written in Python, including 15817 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 10388 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 1203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. We zoom in on common problems, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.
Category Archives: oa.jupyter
Announcing Quarto, a new scientific and technical publishing system – RStudio
“Today we’re excited to announce Quarto
(opens in a new tab), a new open-source scientific and technical publishing system. Quarto is the next generation of R Markdown(opens in a new tab), and has been re-built from the ground up to support more languages and environments, as well as to take what we’ve learned from 10 years of R Markdown and weave it into a more complete, cohesive whole. While Quarto is a “new” system, it’s important to note that it’s highly compatible with what’s come before. Like R Markdown, Quarto is also based on Knitr(opens in a new tab) and Pandoc(opens in a new tab), and despite the fact that Quarto does some things differently, most existing R Markdown documents can be rendered unmodified with Quarto. Quarto also supports Jupyter(opens in a new tab) as an alternate computational engine to Knitr, and can also render existing Jupyter notebooks unmodified….
Some highlights and features of note:
Choose from multiple computational engines (Knitr, Jupyter, and Observable) which makes it easy to use Quarto with R(opens in a new tab), Python(opens in a new tab), Julia(opens in a new tab), Javascript(opens in a new tab), and many other languages.
Author documents as plain text markdown or Jupyter notebooks, using a variety of tools including RStudio, VS Code, Jupyter Lab, or any notebook or text editor you like. Publish high-quality reports, presentations, websites, blogs, books, and journal articles in HTML, PDF, MS Word, ePub, and more. Write with scientific markdown extensions, including equations, citations, crossrefs, diagrams, figure panels, callouts, advanced layout, and more….”
Fernando Perez: Data Science and Open Communities | TED Talk
“Fernando Perez, creator of the open source computing tool Jupyter, highlights how open source software and the ease of access to data will enhance human-centered computing, collaboration, and transparency across many areas of society.”
A FAIR and AI-ready Higgs boson decay dataset | Scientific Data
Abstract: To enable the reusability of massive scientific datasets by humans and machines, researchers aim to adhere to the principles of findability, accessibility, interoperability, and reusability (FAIR) for data and artificial intelligence (AI) models. This article provides a domain-agnostic, step-by-step assessment guide to evaluate whether or not a given dataset meets these principles. We demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs boson decays and quark and gluon background, and is available through the CERN Open Data Portal. We use additional available tools to assess the FAIRness of this dataset, and incorporate feedback from members of the FAIR community to validate our results. This article is accompanied by a Jupyter notebook to visualize and explore this dataset. This study marks the first in a planned series of articles that will guide scientists in the creation of FAIR AI models and datasets in high energy particle physics.
As Project Jupyter Celebrates 20 Years, Fernando Pérez Reflects On How It Started, Open Science’s Impact and the Value of Diversity in Coding | Computing, Data Science, and Society
Fernando Pérez Reflects On How It Started, Open Science’s Impact and the Value of Diversity in Coding
“Introducing Reproducibility to Citation Analysis” by Samantha Teplitzky, Wynn Tranfield et al.
Abstract: Methods: Replicated methods of a prior citation study provide an updated transparent, reproducible citation analysis protocol that can be replicated with Jupyter Notebooks.
Results: This study replicated the prior citation study’s conclusions, and also adapted the author’s methods to analyze the citation practices of Earth Scientists at four institutions. We found that 80% of the citations could be accounted for by only 7.88% of journals, a key metric to help identify a core collection of titles in this discipline. We then demonstrated programmatically that 36% of these cited references were available as open access.
Conclusions: Jupyter Notebooks are a viable platform for disseminating replicable processes for citation analysis. A completely open methodology is emerging and we consider this a step forward. Adherence to the 80/20 rule aligned with institutional research output, but citation preferences are evident. Reproducible citation analysis methods may be used to analyze open access uptake, however, results are inconclusive. It is difficult to determine whether an article was open access at the time of citation, or became open access after an embargo.
Constellate
“Learn how to text mine or improve your skills using our self-guided lessons for all experience levels. Each lesson includes video instruction and your own Jupyter notebook — think of it like an executable textbook — ready to run in our Analytics Lab….
Teach text analytics to all skill levels using our library of open education resources, including lessons plans and our suite of Jupyter notebooks. Eliminate setup time by hosting your class in our Analytics Lab….
Create a ready-to-analyze dataset with point-and-click ease from over 30 million documents, including primary and secondary texts relevant to every discipline and perfect for learning text analytics or conducting original research….
Find patterns in your dataset with ready-made visualizations, or conduct more sophisticated text mining in our Analytics Lab using Jupyter notebooks configured for a range of text analytics methods….”
Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing | SpringerLink
Abstract: Background
A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike.
Aim of Review
To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science.
Key Scientific Concepts of Review
This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.
One Step Closer to the “Paper of the Future” | Research Data Management @Harvard
“As a researcher who is trying to understand the structure of the Milky Way, I often deal with very large astronomical datasets (terabytes of data, representing almost two billion unique stars). Every single dataset we use is publicly available to anyone, but the primary challenge in processing them is just how large they are. Most astronomical data hosting sites provide an option to remotely query sources through their web interface, but it is slow and inefficient for our science….
To circumvent this issue, we download all the catalogs locally to Harvard Odyssey, with each independent survey housed in a separate database. We use a special python-based tool (the “Large-Survey Database”) developed by a former post-doctoral scholar at Harvard, which allows us to perform fast queries of these databases simultaneously using the Odyssey computing cluster….
To extract information from each hdf5 file, we have developed a sophisticated Bayesian analysis pipeline that reads in our curated hdf5 files and outputs best fits for our model parameters (in our case, distances to local star-forming regions near the sun). Led by a graduate student and co-PI on the paper (Joshua Speagle), the python codebase is publicly available on GitHub with full API documentation. In the future, it will be archived with a permanent DOI on Zenodo. Also on GitHub users will find full working examples of the code, demonstrating how users can read in the publicly available data and output the same style of figures seen in the paper. Sample data are provided, and the demo is configured as a jupyter notebook, so interested users can walk through the methodology line-by-line….”
By Jupyter–Is This the Future of Open Science? | Linux Journal
“In a recent article, I explained why open source is a vital part of open science. As I pointed out, alongside a massive failure on the part of funding bodies to make open source a key aspect of their strategies, there’s also a similar lack of open-source engagement with the needs and challenges of open science. There’s not much that the Free Software world can do to change the priorities of funders. But, a lot can be done on the other side of things by writing good open-source code that supports and enhances open science.
People working in science potentially can benefit from every piece of free software code—the operating systems and apps, and the tools and libraries—so the better those become, the more useful they are for scientists. But there’s one open-source project in particular that already has had a significant impact on how scientists work—Project Jupyter….”
The Scientific Paper Is Obsolete. Here’s What’s Next. – The Atlantic
“Perhaps the paper itself is to blame. Scientific methods evolve now at the speed of software; the skill most in demand among physicists, biologists, chemists, geologists, even anthropologists and research psychologists, is facility with programming languages and “data science” packages. And yet the basic means of communicating scientific results hasn’t changed for 400 years. Papers may be posted online, but they’re still text and pictures on a page.
What would you get if you designed the scientific paper from scratch today? …
Software is a dynamic medium; paper isn’t. When you think in those terms it does seem strange that research like Strogatz’s, the study of dynamical systems, is so often being shared on paper …
I spoke to Theodore Gray, who has since left Wolfram Research to become a full-time writer. He said that his work on the notebook was in part motivated by the feeling, well formed already by the early 1990s, “that obviously all scientific communication, all technical papers that involve any sort of data or mathematics or modeling or graphs or plots or anything like that, obviously don’t belong on paper. That was just completely obvious in, let’s say, 1990,” he said. …”