Jupyter Notebooks Will Open up New Possibilities on NERSC’s Cori Supercomputer

November 2, 2016

By Jon Bashor

Cori Supercomputer at Berkeley Lab (Photo by Roy Kaltschmidt, Berkeley Lab)

The Department of Energy’s National Energy Research Scientific Computing Center (NERSC) is using Jupyter notebooks to help users more easily access computing resources and will expand these capabilities when the fully deployed Cori supercomputer goes into production this fall.

Although a number of university and NSF-supported computing centers are using Jupyter, NERSC is the first DOE supercomputing center to do so.

NERSC currently offers the Jupyterlab notebook for multi-user environments on a “Science Gateway” node and when users log in, they receive a notebook for their projects. The notebook provides access to NERSC services, files and jobs.

“Jupyter has a lot of little details that make it work very well,” said Shreyas Cholia, who gave a presentation on NERSC’s implementation at a Jupiterhub workshop held in July. “One nice thing is you can open a terminal session and your entire work platform is pulled into the same environment and you can easily go back and forth between modalities.”

NERSC, which is the production computing and data center for DOE’s Office of Science, is currently installing the second phase of its Cori supercomputer and will run Jupyterlab prototype directly on the system when it goes into production use. According to Cholia, this will provide NERSC’s 6,000 users with three important benefits:

Direct access to large datasets, including terabytes written to the scratch file system. Since Jupyterlab will run directly on Cori, users will have direct access to the scratch and project files systems, allowing them to write and view the data.
Access to the job queueing system, allowing uses to submit jobs, query the batch systems and look at results.
Easy login with username and password.

Already, the web-server version of Jupyter has been integrated into six different user projects at NERSC, including OpenMSI, a cloud-based platform hosted at NERSC and built on Jyupyter that allows mass spectrometry imaging (MS)I data to be viewed, analyzed and manipulated in a standard web browser. Ben Bowen a scientist in Berkeley Lab’s Life Sciences Division, is the application lead on OpenMSI.

Bowen said he first experienced iPython notebooks when he read the book “Probabalistic Programming and Bayesian Methods for Hackers,” which was published on the platform. “I thought, ‘Wow, this iPython notebook is pretty awesome,” Bowen said.

Although he had been programming for decades, Bowen didn’t know Python, “but the notebook made it quick for me to learn” and he found it a nice way to try out a lot of new ideas. It’s also helped others in his group quickly get up to speed in analyzing data and sharing results.

Every day, Trent Northren’s group at the Joint Genome Institute uploads raw data files from their MSI experiments studying microbial metabolisms to better understand how microbes transform their environment. Used in an increasing number of biological applications, MSI datasets typically contain unique, high-resolution mass spectra from tens of thousands of spatial locations, resulting in raw data sizes of tens of gigabytes per sample.

Team members of the team then search the data files for an “internal standard” to assess the accuracy of the experiment. “To every sample that is analyzed by mass spectrometry, we add a cocktail of non-biological molecules,” Bowen said. “Their presence and signal-characteristics are used to determine the success or failure of the measurement.”

Every person on the project can now use this tool to easily assess the data, Bowen said. “They’re all using the same methods to check the samples and working from the same workflow notebook. Without Jupyter, I would have to walk them through a series of complicated data analysis steps.”

The team can log in from their own computers using the notebook, see all the OpenMSI files, add ions of particular interest (which the notebook then grabs from data stored at NERSC and inserts into the notebook) and then center the samples on a grid for analysis. The grids contain 384 samples, each with a different spectrum. Using an integrated algorithm, the notebook allows the job to be done in under an hour, whereas doing it manually would take a few days.

As datasets become ever-larger, such automated tools will be more critical, Bowen said.

Bowen adds that new users don’t require much training, but that extensive documentation is available if needed. Instead of having to come up with new codes, a process Bowen likens to learning a foreign language, the notebook contains all of the necessary tools, “hiding them under the hood.”

“These notebooks are something everybody now knows about and Fernando Perez has become somewhat famous and very well regarded in the computing world,” Bowen said, adding that many academics want to keep control of their software tools rather than open them up to the community. “We’re going to see computing change a lot because of these notebooks. I hope it’s a success story for others to follow.”

About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.