NERSCPowering Scientific Discovery for 50 Years

Woo-Sun Yang

Woo-Sun-Yang.jpg
Woo-Sun Yang , Ph.D.
HPC Consultant
Phone: (510) 486-5735
Fax: (510) 486-6459
1 Cyclotron Road
Mailstop: 59-4010A
Berkeley, CA 94720 us

Journal Articles

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Concurrency and Computation: Practice and Experience, August 2017, 30, doi: 10.1002/cpe.4291

The newest NERSC supercomputer Cori is a Cray XC40 system consisting of 2,388 Intel Xeon Haswell nodes and 9,688 Intel Xeon‐Phi “Knights Landing” (KNL) nodes. Compared to the Xeon‐based clusters NERSC users are familiar with, optimal performance on Cori requires consideration of KNL mode settings; process, thread, and memory affinity; fine‐grain parallelization; vectorization; and use of the high‐bandwidth MCDRAM memory. This paper describes our efforts preparing NERSC users for KNL through the NERSC Exascale Science Application Program, Web documentation, and user training. We discuss how we configured the Cori system for usability and productivity, addressing programming concerns, batch system configurations, and default KNL cluster and memory modes. System usage data, job completion analysis, programming and running jobs issues, and a few successful user stories on KNL are presented.

Jack Deslippe, Brian Austin, Chris Daley, Woo-Sun Yang, "Lessons learned from optimizing science kernels for Intel's "Knights-Corner" architecture", CISE, April 1, 2015,

Conference Papers

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Cray User Group 2017, Redmond, WA. Best Paper First Runner-Up., May 12, 2017,

Wendy Hwa-Chun Lin, Yun (Helen) He, and Woo-Sun Yang, "Franklin Job Completion Analysis", Cray User Group 2010 Proceedings, Edinburgh, UK, May 2010,

The NERSC Cray XT4 machine Franklin has been in production for 3000+ users since October 2007, where about 1800 jobs run each day. There has been an on-going effort to better understand how well these jobs run, whether failed jobs are due to application errors or system issues, and to further reduce system related job failures. In this paper, we talk about the progress we made in tracking job completion status, in identifying job failure root cause, and in expediting resolution of job failures, such as hung jobs, that are caused by system issues. In addition, we present some Cray software design enhancements we requested to help us track application progress and identify errors.

 

Presentation/Talks

Richard A. Gerber, Helen He, Woo-Sun Yang, Debugging and Optimization Tools, Presented at UC Berkeley CS267 class, February 2014, February 19, 2014,

Woo-Sun Yang, Debugging Tools, February 3, 2014,

Woo-Sun Yang, Debugging and Performance Analysis Tools at NERSC, BOUT++ 2013 Workshop, September 3, 2013,

Yun (Helen) He, Wendy Hwa-Chun Lin, and Woo-Sun Yang, Franklin Job Completion Analysis, Cray User Group Meeting 2010, May 2010,

Reports

K. Antypas, B.A Austin, T.L. Butler, R.A. Gerber, C.L Whitney, N.J. Wright, W. Yang, Z Zhao, "NERSC Workload Analysis on Hopper", Report, October 17, 2014, LBNL 6804E,