Project Jupyter: A Computer Code that Transformed Science
A look back at Berkeley Lab’s history with IPython and its impact on science and computing
June 14, 2021
By Linda Vu
A computer code co-developed by a scientist from Lawrence Berkeley National Laboratory (Berkeley Lab) and embraced by the global science community over two decades has been hailed by Nature Magazine as one of “ten computer codes that transformed science.”
Twenty years ago, Fernando Pérez was a graduate student pursuing a doctorate in particle physics at the University of Colorado, Boulder. He’d been searching for an open-source, interactive tool to analyze his scientific simulation data. He’d also just learned the Python programming language and was eager to apply it to his workflow.
Then one afternoon, in the throes of procrastination, Pérez — now an associate professor in statistics at the University of California, Berkeley (UC Berkeley) and a faculty scientist in Berkeley Lab’s Computational Research Division — developed his own Python interpreter for interactive scientific and data-intensive computing: IPython. Although IPython was originally developed as a command shell for interactive computing in the Python programming language, it works in multiple programming languages today, offering introspection, rich media, shell syntax, tab completion, and history.
“The birth of IPython was really simply that I wanted to use Python code, but use it in this interactive workflow where I’m running smaller code as I look at my data, as I look at my files, as I’m iterating and exploring my data,” Pérez said. “This process of iteration and exploration is very natural in science. In research, we don’t typically have a preordained set of requirements where someone tells us ‘this is the software I need, go build me that’ it’s more like we have a question that we are trying to understand and some data we want to analyze. Because IPython is an open-source code that lets you do that, it rapidly got used by a lot of other scientists.”
Many researchers did more than just use IPython; they also contributed their time and effort to add features that made the tool even better. In 2004, Pérez started collaborating with physicists Brian Granger, Min Ragan-Kelley and others, on further developments to IPython. These collaborative efforts included several prototypes of Notebooks for IPython, and in 2011 the team released the first version of what became the successful, browser-based IPython Notebook. This release laid the foundation for today's Project Jupyter and was the beginning of a revolution in data science.
Like other computational notebooks, IPython Notebook combined codes, results, graphics, and text in a single document, but unlike some of its predecessors, it was open-source, inviting contributions from a community of developers. In 2014, IPython Notebook evolved into Project Jupyter and eventually grew to support about 100 languages, allowing users to explore data on remote supercomputers as well as on their laptops.
Going forward, Pérez sees opportunities for open-source tools like Project Jupyter, combined with cloud computing and supercomputing, to empower large-scale “communities of practice” where distributed research collaborations gather and grow to tackle big common problems like climate change or the global COVID pandemic. He believes that the key to unlocking this opportunity is in deploying infrastructures that give access to research communities that haven’t traditionally used these data analysis tools. And he’s been working toward this future with his colleagues at Berkeley Lab since he arrived in 2014.
Jupyter and Systems Biology: A Perfect Match
Early in Pérez’s Berkeley Lab career, he was tapped to work on the Department of Energy’s (DOE’s) Systems Biology Knowledgebase (KBase), which uses Jupyter Notebooks. Launched in 2011, KBase aims to accelerate the discovery, prediction, and design of biological functions by providing a collaborative, open environment for the scientific community to access integrative data analysis and modeling tools supported by the DOE’s world-class computing resources. KBase was one of the first big scientific platforms to have Jupyter notebooks at the center of its design, and the impact of this collaboration has reverberated across the field of systems biology research.
“KBase is meant to be a little disruptive in the way that it operates,” said Adam Arkin, KBase co-principal investigator and Berkeley Lab scientist. “We want to make the field of biological systems research as open, transparent, reusable, and interoperable as possible.”
In the field of systems biology, a typical peer-reviewed scientific paper will contain about half a petabyte of data, Arkin noted. This data is heterogeneous, so a mishmash of genomic and metagenomic sequencing data, chemistry data, imaging data, and geographic data; and it is “difficult data because methods to measure these things can be very noisy and biased by multiple factors.” To fix the data measurements, researchers may rely on one set of algorithms, then use another set of algorithms to analyze the data.
“Reviewing these papers was nearly impossible because of these complex data and the many-layered analyses to get to interpretable information. The authors weren’t sharing their codes and the data was coming from many different places, so it was hard to track everything. The lack of transparency and reusability made it hard to reproduce the work and build on it. Jupyter notebooks and KBase aim specifically at making such work transparent, evaluable, reusable, and provenanced,” said Arkin.
With Pérez’s help, the KBase team was able to realize their vision of collaborative science. By leveraging Jupyter notebooks, they built a system that packages scientific data and automatically documents all of the codes and order of operations that the scientists used to achieve their results, and it’s all backed by DOE supercomputing resources. With one click, scientists can publish their notebooks, and then request a digital object identifier (DOI).
“Once notebooks are published and shared on our system, we’ve seen people leverage the ecology and data tools from other people to perform analysis that they can’t do anywhere else,” Arkin said.
When the collaboration had Jupyter notebooks running properly at KBase, that’s when the scientific community saw the resource’s potential, he noted. “People could see this idea of collaborative science, learning from one another, building on each other’s work, and they got what we were trying to do,” Arkin said.
Another benefit of KBase’s collaboration with Pérez is connecting systems biology researchers to Jupyter developers that are building open source tools and libraries that allow scientists to leverage commercial cloud computing resources to build on Jupyter notebooks published outside of KBase. And in classrooms around the world, Jupyter notebooks have replaced textbooks as the tool of choice to train the next generation of systems biologists.
Arkin credits Pérez’s ability to “humanly interface with both developers and users, especially scientists,” for bridging these communities and making Jupyter tools even better.
“Fernando’s passion for Project Jupyter has a way of making everyone feel like they were part of the effort. You don’t have success without the open part, and he used his passion and openness to bring others along with him,” said Arkin.
NERSC: Enhancing Scientific Supercomputing
As Jupyter notebooks become an increasingly important tool for data science, supercomputing sites around the world are responding to the demand by looking for ways to effectively support them. And the National Energy Research Scientific Computing Center (NERSC), located at Berkeley Lab, has been at the forefront of this effort.
As the primary scientific computing facility for DOE’s Office of Science, NERSC supports more than 8,000 scientists with an assortment of technical skills and at various career stages, as they perform research in disciplines ranging from astrophysics to climate science, biosciences to materials science, and more. Approximately six years ago, the facility’s staff noticed that some users were trying to launch and connect their Jupyter notebooks to Edison, a previous generation NERSC supercomputer, with their SSH tunnels. Rather than fight this trend, NERSC staff connected with Pérez and others in the JupyterHub community to discuss expanding the Jupyter ecosystem to include institutional deployments. JupyterHub is a multi-user gateway to the notebook designed for companies, classrooms, and research labs.
“Jupyter notebooks are more than just capturing the output of your code; they are about capturing your interaction with the computer, software, data, and then encapsulating it in some kind of document. We found that this is what users were already doing on their laptops and desktops, and what they wanted to do on supercomputers, so we should figure out how to support it,” said Rollin Thomas, a Big Data Architect at NERSC. “We also saw an opportunity to leverage Jupyter to expose the unique features of supercomputing for those who may have a hard time learning to use these systems.”
With the support of Pérez and the JupyterHub team, NERSC staff engineered a way for users to launch notebooks on a shared node on the facility’s flagship Cori Supercomputer in 2016. Since then, the demand has only increased. According to a new report, around 700 unique users each month currently use Jupyter on NERSC’s Cori supercomputer, a figure that has tripled in the past three years, and about 20-25% of NERSC user interaction now goes through the platform.
“Jupyter has become a primary point of entry to Cori and other systems for a substantial fraction of all NERSC users,” Thomas said. The emergence of artificial intelligence (AI) libraries and tools presents even more exciting opportunities for NERSC and Jupyter, he added.
“It’s become normal for people to encapsulate their AI workflows in Jupyter notebooks and share them that way. Some users have massive datasets stored on the NERSC global filesystem, and we’ve seen them train AI models against those datasets, capture what they’ve done in a notebook, and then hand that over to collaborators,” Thomas said. “This trend is one of the reasons we want to make sure that Jupyter works well on Perlmutter, our next-generation supercomputer. Perlmutter’s GPU nodes are going to allow a lot of users to run AI and data analytics workflows with Jupyter.”
Thomas and his colleagues are also sharing lessons learned with other supercomputing sites run by DOE, NASA, and the National Science Foundation that are interested in supporting this open-source tool. In 2019, NERSC and the Berkeley Institute for Data Science hosted a three-day workshop to discuss how to make Jupyter the pre-eminent interface for managing experimental and observational workflows, and data analytics at high performance computing centers.
“Part of Jupyter’s success comes from its role in educating today’s data scientists,” Thomas said. “If you take a course in statistics, machine learning, math, or even applied math, a lot of those courses are taught using Jupyter notebooks, so people are coming out of school with that as their training. And the fact that JupyterHub is open source enables it to have a very broad, diverse, robust developer community, which is also key to its success. ”
From the perspective of a high performance computing facility with a diverse and demanding user base, this robust and active developer community is what sets Jupyter apart from its competitors, Thomas added. “We know that if we encounter a problem, there’s a community of people who are going to work with us to solve it.”
The Future: Climate Science
Within the past few years, Pérez also began working with climate researchers in Berkeley Lab’s Earth and Environmental Sciences Area to explore ways to build a community of practice to support collaborative climate research.
“In the field of climate science you have scientific challenges of course, but the biggest problem is collective action and agreement — there are people who question the science,” said Pérez. “So my question is how can we use these tools to deploy an infrastructure that gives access to many researchers who may want to combine data from models, from remote sensing sources, and economic indicators in the community to solve this problem.”
He adds that because Project Jupyter is completely open, scientists can still explore interactively their data, but now they can also collaborate and combine their work, then publish entire interactive books that tell the whole story of a problem. He hopes these books can provide a foundation for societal agreement.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.