NERSCPowering Scientific Discovery for 50 Years

Shane Canon

Shane-Cannon.jpg
Shane Canon
Acting Group Lead
Data Science Engagement Group
Phone: (510) 486-7024
Fax: (510) 486-6459
1 Cyclotron Road
Mailstop: 59R4010A
Berkeley, CA 94720 us

Biographical Sketch

Shane Canon is the acting group lead for the Data Science Engagement Group. He joined NERSC in 2000 to serve as a system administrator for the PDSF cluster. While working with PDSF he gained experience in cluster administration, batch systems, parallel file systems, and the Linux kernel. In 2005, Shane left Lawrence Berkeley National Laboratory (Berkeley Lab) to take a position as group leader at Oak Ridge National Laboratory. One of his more significant accomplishments while at ORNL was architecting the 10-petabyte Spider File System. In 2008, Shane returned to NERSC to lead the Data Systems Group. In 2009, he transitioned to leading a newly created Technology Integration Group to focus on the Magellan Project and other areas of strategic focus. More recently, Shane has focused on enabling data-intensive applications on HPC platforms and engaging with bioinformatics applications. Shane joined the Data & Analytics Services group in 2016 to focus on these topics. Shane is involved in several projects outside of NERSC. He is the Production Lead on the KBase project, which is developing a platform to enable predictive biology. Shane holds a Ph.D. in Physics from Duke University and B.S. in Physics from Auburn University.

Journal Articles

Antypas, KB and Bard, DJ and Blaschke, JP and Canon, RS and Enders, B and Shankar, M and Somnath, S and Stansberry, D and Uram, TD and Wilkinson, SR, "Enabling discovery data science through cross-facility workflows", Institute of Electrical and Electronics Engineers (IEEE), December 2021, 3671-3680, doi: 10.1109/bigdata52589.2021.9671421

Abe Singer, Shane Canon, Rebecca Hartman-Baker, Kelly L. Rowland, David Skinner, Craig Lant, "What Deploying MFA Taught Us About Changing Infrastructure", HPCSYSPROS19: HPC System Professionals Workshop, November 2019, doi: 10.5281/zenodo.3525375

NERSC is not the first organization to implement multi-factor authentication (MFA) for its users. We had seen multiple talks by other supercomputing facilities who had deployed MFA, but as we planned and deployed our MFA implementation, we found that nobody had talked about the more interesting and difficult challenges, which were largely social rather than technical. Our MFA deployment was a success, but, more importantly, much of what we learned could apply to any infrastructure change. Additionally, we developed the sshproxy service, a key piece of infrastructure technology that lessens user and staff burden and has made our MFA implementation more amenable to scientific workflows. We found great value in using robust open-source components where we could and developing tailored solutions where necessary.

Adam P Arkin, Robert W Cottingham, Christopher S Henry, Nomi L Harris, Rick L Stevens, Sergei Maslov, Paramvir Dehal, Doreen Ware, Fernando Perez, Shane Canon, Michael W Sneddon, Matthew L Henderson, William J Riehl, Dan Murphy-Olson, Stephen Y Chan, Roy T Kamimura, Sunita Kumari, Meghan M Drake, Thomas S Brettin, Elizabeth M Glass, Dylan Chivian, Dan Gunter, David J Weston, Benjamin H Allen, Jason Baumohl, Aaron A Best, Ben Bowen, Steven E Brenner, Christopher C Bun, John-Marc Chandonia, Jer-Ming Chia, Ric Colasanti, Neal Conrad, James J Davis, Brian H Davison, Matthew DeJongh, Scott Devoid, Emily Dietrich, Inna Dubchak, Janaka N Edirisinghe, Gang Fang, José P Faria, Paul M Frybarger, Wolfgang Gerlach, Mark Gerstein, Annette Greiner, James Gurtowski, Holly L Haun, Fei He, Rashmi Jain, Marcin P Joachimiak, Kevin P Keegan, Shinnosuke Kondo, Vivek Kumar, Miriam L Land, Folker Meyer, Marissa Mills, Pavel S Novichkov, Taeyun Oh, Gary J Olsen, Robert Olson, Bruce Parrello, Shiran Pasternak, Erik Pearson, Sarah S Poon, Gavin A Price, Srividya Ramakrishnan, Priya Ranjan, Pamela C Ronald, Michael C Schatz, Samuel M D Seaver, Maulik Shukla, Roman A Sutormin, Mustafa H Syed, James Thomason, Nathan L Tintle, Daifeng Wang, Fangfang Xia, Hyunseung Yoo, Shinjae Yoo, Dantong Yu, "KBase: the United States department of energy systems biology knowledgebase", Nature Biotechnology, July 6, 2018, 36.7, doi: 10.1038/nbt.4163.

Here we present the DOE Systems Biology Knowledgebase (KBase, http://kbase.us), an open-source software and data platform that enables data sharing, integration, and analysis of microbes, plants, and their communities. KBase maintains an internal reference database that consolidates information from widely used external data repositories. This includes over 90,000 microbial genomes from RefSeq4, over 50 plant genomes from Phytozome5, over 300 Biolog media formulations6, and >30,000 reactions and compounds from KEGG7, BIGG8, and MetaCyc9. These public data are available for integration with user data where appropriate (e.g., genome comparison or building species trees). KBase links these diverse data types with a range of analytical functions within a web-based user interface. This extensive community resource facilitates large-scale analyses on scalable computing infrastructure and has the potential to accelerate scientific discovery, improve reproducibility, and foster open collaboration.

Lee, Jason R., et al, "Enhancing supercomputing with software defined networking", IEEE Conference on Information Networking (ICOIN), January 10, 2018,

Conference Papers

Alex Gittens et al, "Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies", 2016 IEEE International Conference on Big Data, July 1, 2017,

Shane Canon, Doug Jacobsen, "Shifter: Containers for HPC", Cray User Group, London, England, May 13, 2016,

Container-based computed is rapidly changing the way software is developed, tested, and deployed. This paper builds on previously presented work on a prototype framework for running containers on HPC platforms. We will present a detailed overview of the design and implementation of Shifter, which in partnership with Cray has extended on the early prototype concepts and is now in production at NERSC. Shifter enables end users to execute containers using images constructed from various methods including the popular Docker-based ecosystem. We will discuss some of the improvements over the initial prototype including an improved image manager, integration with SLURM, integration with the burst buffer, and user controllable volume mounts. In addition, we will discuss lessons learned, performance results, and real-world use cases of Shifter in action. We will also discuss the potential role of containers in scientific and technical computing including how they complement the scientific process. We will conclude with a discussion about the future directions of Shifter.

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon,
Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat,
"H5Spark: Bridging the I/O Gap between Spark and Scientific Data Formats on HPC Systems", Cray User Group, May 13, 2016,

Tina Declerck, Katie Antypas, Deborah Bard, Wahid Bhimji, Shane Canon, Shreyas Cholia, Helen (Yun) He, Douglas Jacobsen, Prabhat, Nicholas J. Wright, "Cori - A System to Support Data-Intensive Computing", Cray User Group Meeting 2016, London, England, May 2016,

Doug Jacobsen, Shane Canon, "Contain This, Unleashing Docker for HPC", Cray User Group 2015, April 23, 2015,

Justin Blair, Richard S. Canon, Jack Deslippe, Abdelilah Essiari, Alexander Hexemer, Alastair A. MacDowell, Dilworth Y. Parkinson, Simon J. Patton, Lavanya Ramakrishnan, Nobumichi Tamura, Brian L. Tierney, Craig E. Tull, "High performance data management and analysis for tomography", Proc. SPIE 9212, Developments in X-Ray Tomography IX, September 12, 2014,

S. Parete-Koon, B. Caldwell, S. Canon, E. Dart, J. Hick, J. Hill, C. Layton, D. Pelfrey, G. Shipman, D. Skinner, J. Wells, J. Zurawski, "HPC's Pivot to Data", Conference, May 5, 2014,

 

Computer centers such as NERSC and OLCF have traditionally focused on delivering computational capability that enables breakthrough innovation in a wide range of science domains. Accessing that computational power has required services and tools to move the data from input and output to computation and storage. A pivot to data is occurring in HPC. Data transfer tools and services that were previously peripheral are becoming integral to scientific workflows.  Emerging requirements from high-bandwidth detectors, highthroughput screening techniques, highly concurrent simulations, increased focus on uncertainty quantification, and an emerging open-data policy posture toward published research are among the data-drivers shaping the networks, file systems, databases, and overall HPC environment. In this paper we explain the pivot to data in HPC through user requirements and the changing resources provided by HPC with particular focus on data movement. For WAN data transfers we present the results of a study of network performance between centers

 

Jay Srinivasan, Richard Shane Canon, "Evaluation of A Flash Storage Filesystem on the Cray XE-6", CUG 2013, May 2013,

Flash storage and other solid-state storage technolo-gies are increasingly being considered as a way to address the growing gap between computation and I/O. Flash storage has a number of benefits such as good random read performance and lower power consumption. However, it has a number of challenges too, such as high cost and high-overhead for write operations. There are a number of ways Flash can be integrated into HPC systems. This paper will discuss some of the approaches and show early results for a Flash file system mounted on a Cray XE-6 using high-performance PCI-e based cards. We also discuss some of the gaps and challenges in integrating flash intoHPC systems and potential mitigations as well as new solid state storage technologies and their likely role in the future

You-Wei Cheah, Richard Canon, Beth Plale, Lavanya Ramakrishnan, "Milieu: Lightweight and Configurable Big Data Provenance for Science", IEEE Big Data Congress, 2013,

You-Wei Cheah, Richard Canon, Plale, Lavanya Ramakrishnan, "Milieu: Lightweight and Configurable Big Data Provenance for Science", BigData Congress, 2013, 46-53,

Elif Dede, Fadika, Hartog, Govindaraju, Ramakrishnan, Gunter, Shane Richard Canon, "MARISSA: MApReduce Implementation for Streaming Science Applications", eScience, October 8, 2012, 1-8,

Zacharia Fadika, Madhusudhan Govindaraju, Shane Richard Canon, Lavanya Ramakrishnan, "Evaluating Hadoop for Data-Intensive Scientific Operations", IEEE Cloud 2012, June 24, 2012,

Emerging sensor networks, more capable instruments, and ever increasing simulation scales are generating data at a rate that exceeds our ability to effectively manage, curate, analyze, and share it. Data-intensive computing is expected to revolutionize the next-generation software stack. Hadoop, an open source implementation of the MapReduce model provides a way for large data volumes to be seamlessly processed through use of large commodity computers. The inherent parallelization, synchronization and fault-tolerance the model offers, makes it ideal for highly-parallel data-intensive applications. MapReduce and Hadoop have traditionally been used for web data processing and only recently been used for scientific applications. There is a limited understanding on the performance characteristics that scientific data intensive applications can obtain from MapReduce and Hadoop. Thus, it is important to evaluate Hadoop specifically for data-intensive scientific operations -- filter, merge and reorder-- to understand its various design considerations and performance trade-offs. In this paper, we evaluate Hadoop for these data operations in the context of High Performance Computing (HPC) environments to understand the impact of the file system, network and programming modes on performance.

Jay Srinivasan, Richard Shane Canon, Lavanya Ramakrishnan, "My Cray can do that? Supporting Diverse Workloads on the Cray XE-6", CUG 2012, May 2012,

The Cray XE architecture has been optimized to support tightly coupled MPI applications, but there is an in- creasing need to run more diverse workloads in the scientific and technical computing domains. These needs are being driven by trends such as the increasing need to process “Big Data”. In the scientific arena, this is exemplified by the need to analyze data from instruments ranging from sequencers, telescopes, and X-ray light sources. These workloads are typically throughput oriented and often involve complex task dependencies. Can platforms like the Cray XE line play a role here? In this paper, we will describe tools we have developed to support high-throughput workloads and data intensive applications on NERSC’s Hopper system. These tools include a custom task farmer framework, tools to create virtual private clusters on the Cray, and using Cray’s Cluster Compatibility Mode (CCM) to support more diverse workloads. In addition, we will describe our experience with running Hadoop, a popular open-source implementation of MapReduce, on Cray systems. We will present our experiences with this work including successes and challenges. Finally, we will discuss future directions and how the Cray platforms could be further enhanced to support these class of workloads.

Ghoshal, Devarshi and Canon, Richard Shane and Ramakrishnan, Lavanya, "Understanding I/O Performance of Virtualized Cloud Environments", The Second International Workshop on Data Intensive Computing in the Clouds (DataCloud-SC11), 2011,

We compare the I/O performance using IOR benchmarks on two cloud computing platforms - Amazon and the Magellan cloud testbed.

Lavanya Ramakrishnan, Richard Shane Canon, Krishna Muriki, Iwona Sakrejda, and Nicholas J. Wright., "Evaluating Interconnect and Virtualization Performance for High Performance Computing", Proceedings of 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS11), 2011,

In this paper we detail benchmarking results that characterize the virtualization overhead and its impact on performance. We also examine the performance of various interconnect technologies with a view to understanding the performance impacts of various choices. Our results show that virtualization can have a significant impact upon performance, with at least a 60% performance penalty. We also show that less capable interconnect technologies can have a significant impact upon performance of typical HPC applications. We also evaluate the performance of the Amazon Cluster compute instance and show that it performs approximately equivalently to a 10G Ethernet cluster at low core counts.

Lavanya Ramakrishnan, Piotr T. Zbiegel, Scott Campbell, Rick Bradshaw, Richard Shane Canon, Susan Coghlan, Iwona Sakrejda, Narayan Desai, Tina Declerck, Anping Liu, "Magellan: Experiences from a Science Cloud", Proceedings of the 2nd International Workshop on Scientific Cloud Computing, ACM ScienceCloud '11, Boulder, Colorado, and New York, NY, 2011, 49 - 58,

Neal Master, Matthew Andrews, Jason Hick, Shane Canon, Nicholas J. Wright, "Performance Analysis of Commodity and Enterprise Class Flash Devices", Petascale Data Storage Workshop (PDSW), November 2010,

Kesheng Wu, Kamesh Madduri, Shane Canon, "Multi-Level Bitmap Indexes for Flash Memory Storage", IDEAS '10: Proceedings of the Fourteenth International Database Engineering and Applications Symposium, Montreal, QC, Canada, 2010,

Lavanya Ramakrishnan, R. Jackson, Canon, Cholia, John Shalf, "Defining future platform requirements for e-Science clouds", SoCC, 2010, 101-106,

Keith R. Jackson, Ramakrishnan, Muriki, Canon, Cholia, Shalf, J. Wasserman, Nicholas J. Wright, "Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud", CloudCom, January 1, 2010, 159-168,

Book Chapters

N.J. Wright, S. S. Dosanjh, A. K. Andrews, K. Antypas, B. Draney, R.S Canon, S. Cholia, C.S. Daley, K. M. Fagnan, R.A. Gerber, L. Gerhardt, L. Pezzaglia, Prabhat, K.H. Schafer, J. Srinivasan, "Cori: A Pre-Exascale Computer for Big Data and HPC Applications", Big Data and High Performance Computing 26 (2015): 82., ( June 2015) doi: 10.3233/978-1-61499-583-8-82

Extreme data science is becoming increasingly important at the U.S. Department of Energy's National Energy Research Scientific Computing Center (NERSC). Many petabytes of data are transferred from experimental facilities to NERSC each year. Applications of importance include high-energy physics, materials science, genomics, and climate modeling, with an increasing emphasis on large-scale simulations and data analysis. In response to the emerging data-intensive workloads of its users, NERSC made a number of critical design choices to enhance the usability of its pre-exascale supercomputer, Cori, which is scheduled to be delivered in 2016. These data enhancements include a data partition, a layer of NVRAM for accelerating I/O, user defined images and a customizable gateway for accelerating connections to remote experimental facilities.

Sudip Dosanjh, Shane Canon, Jack Deslippe, Kjiersten Fagnan, Richard Gerber, Lisa Gerhardt, Jason Hick, Douglas Jacobsen, David Skinner, Nicholas J. Wright, "Extreme Data Science at the National Energy Research Scientific Computing (NERSC) Center", Proceedings of International Conference on Parallel Programming – ParCo 2013, ( March 26, 2014)

Lavanya Ramakrishnan, Adam Scovel, Iwona Sakrejda, Susan Coghlan, Shane Canon, Anping Liu, Devarshi Ghoshal, Krishna Muriki, Nicholas J. Wright, "Magellan - A Testbed to Explore Cloud Computing for Science", On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing, (Chapman & Hall/CRC Press: 2013)

Lavanya Ramakrishnan, Adam Scovel, Iwona Sakrejda, Susan Coghlan, Shane Canon, Anping Liu, Devarshi Ghoshal, Krishna Muriki and Nicholas J. Wright, "CAMP", On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing, (Chapman & Hall/CRC Press: January 1, 2013)

Presentation/Talks

Tina Declerck, Katie Antypas, Deborah Bard, Wahid Bhimji, Shane Canon, Shreyas Cholia, Helen (Yun) He, Douglas Jacobsen, Prabhat, Nicholas J. Wright, Cori - A System to Support Data-Intensive Computing, Cray User Group Meeting 2016, London, England, May 12, 2016,

Doug Jacobsen, Shane Canon, Contain This, Unleashing Docker for HPC, NERSC Webcast, May 15, 2015,

David Skinner and Shane Canon, NERSC and High Throughput Computing, February 12, 2013,

Richard Shane Canon, Magellan Project: Clouds for Science?, Coalition for Academic Scientific Computation, February 29, 2012,

This presentation gives a brief overview of the Magellan Project and some of its findings.

Richard Shane Canon, Exploiting HPC Platforms for Metagenomics: Challenges and Opportunities, Metagenomics Informatics Challenges Workshop, October 12, 2011,

Lavanya Ramakrishnan & Shane Canon, NERSC, Hadoop and Pig Overview, October 2011,

The MapReduce programming model and its open source implementation Hadoop is gaining traction in the scientific community for addressing the needs of data focused scientific applications. The requirements of these scientific applications are significantly different from the web 2.0 applications that have  traditionally used Hadoop. The tutorial  will provide an overview of Hadoop technologies, discuss some use cases of Hadoop for science and present the programming challenges with using Hadoop for legacy applications. Participants will access the Hadoop system at NERSC for the hands-on component of the tutorial.

Shane Canon, Debunking Some Common Misconceptions of Science in the Cloud, ScienceCloud 2011, June 29, 2011,

This presentation addressed five common misconceptions of cloud computing including: clouds are simple to use and don’t require system administrators; my job will run immediately in the cloud; clouds are more efficient; clouds allow you to ride Moore’s Law without additional investment; commercial Clouds are much cheaper than operating your own system.

Richard Shane Canon, Cosmic Computing: Supporting the Science of the Planck Space Based Telescope, LISA 2009, November 5, 2009,

The scientific community is creating data at an ever-increasing rate. Large-scale experimental devices such as high-energy collider facilities and advanced telescopes generate petabytes of data a year. These immense data streams stretch the limits of the storage systems and of their administrators. The Planck project, a space-based telescope designed to study the Cosmic Microwave Background, is a case in point. Launched in May 2009, the Planck satellite will generate a data stream requiring a network of storage and computational resources to store and analyze the data. This talk will present an overview of the Planck project, including the motivation and mission, the collaboration, and the terrestrial resources supporting it. It will describe the data flow and network of computer resources in detail and will discuss how the various systems are managed. Finally, it will highlight some of the present and future challenges in managing a large-scale data system.

Reports

GK Lockwood, D Hazen, Q Koziol, RS Canon, K Antypas, J Balewski, N Balthaser, W Bhimji, J Botts, J Broughton, TL Butler, GF Butler, R Cheema, C Daley, T Declerck, L Gerhardt, WE Hurlbert, KA Kallback-Rose, S Leak, J Lee, R Lee, J Liu, K Lozinskiy, D Paul, Prabhat, C Snavely, J Srinivasan, T Stone Gibbins, NJ Wright, "Storage 2020: A Vision for the Future of HPC Storage", October 20, 2017, LBNL LBNL-2001072,

As the DOE Office of Science's mission computing facility, NERSC will follow this roadmap and deploy these new storage technologies to continue delivering storage resources that meet the needs of its broad user community. NERSC's diversity of workflows encompass significant portions of open science workloads as well, and the findings presented in this report are also intended to be a blueprint for how the evolving storage landscape can be best utilized by the greater HPC community. Executing the strategy presented here will ensure that emerging I/O technologies will be both applicable to and effective in enabling scientific discovery through extreme-scale simulation and data analysis in the coming decade.

Katherine Yelick, Susan Coghlan, Brent Draney, Richard Shane Canon, Lavanya Ramakrishnan, Adam Scovel, Iwona Sakrejda, Anping Liu, Scott Campbell, Piotr T. Zbiegiel, Tina Declerck, Paul Rich, "The Magellan Report on Cloud Computing for Science", U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research (ASCR), December 2011,

Posters

Annette Greiner, Evan Racah, Shane Canon, Jialin Liu, Yunjie Liu, Debbie Bard, Lisa Gerhardt, Rollin Thomas, Shreyas Cholia, Jeff Porter, Wahid Bhimji, Quincey Koziol, Prabhat, "Data-Intensive Supercomputing for Science", Berkeley Institute for Data Science (BIDS) Data Science Faire, May 3, 2016,

Review of current DAS activities for a non-NERSC audience.

Others

Special Issue on Data-Intensive Scalable Computing Systems, Special Issue of Parallel Computing, Pages: 1-96 January 31, 2017,

Adam P. Arkin et al, The DOE Systems Biology Knowledgebase (KBase), Biorxiv, December 22, 2016, doi: 10.1101/096354