Wendy Hwa-Chun Lin, Yun (Helen) He, and Woo-Sun Yang, "Franklin Job Completion Analysis", Cray User Group 2010 Proceedings, Edinburgh, UK, May 2010,
- Download File: cug2010JobComp.pdf (pdf: 429 KB)
The NERSC Cray XT4 machine Franklin has been in production for 3000+ users since October 2007, where about 1800 jobs run each day. There has been an on-going effort to better understand how well these jobs run, whether failed jobs are due to application errors or system issues, and to further reduce system related job failures. In this paper, we talk about the progress we made in tracking job completion status, in identifying job failure root cause, and in expediting resolution of job failures, such as hung jobs, that are caused by system issues. In addition, we present some Cray software design enhancements we requested to help us track application progress and identify errors.