NERSCPowering Scientific Discovery Since 1974

Resolved: Reports of Hanging Jobs on Hopper

March 1, 2012 by Katie Antypas

Issue:

A number of users have reported intermittent large jobs hanging on Hopper.  A job appears to start and then hangs shortly after producing no output.  The job stops when the wall clock limit has been reached.

Status:

Cray has identified a few bad nodes in the system. After rebooting these nodes, no new hung jobs have been reported since Mar 12. A new xt-mpich2/5.4.4 has been installed and set to default, with a system wide MPI env set so that a job will be aborted if detected being hung.  A kernel patch has been installed on Apr 3 to finally address the issue.