Resolved: Reports of Hanging Jobs on Hopper
March 1, 2012 by Katie Antypas
A number of users have reported intermittent large jobs hanging on Hopper. A job appears to start and then hangs shortly after producing no output. The job stops when the wall clock limit has been reached.
Cray has identified a few bad nodes in the system. After rebooting these nodes, no new hung jobs have been reported since Mar 12. A new xt-mpich2/5.4.4 has been installed and set to default, with a system wide MPI env set so that a job will be aborted if detected being hung. A kernel patch has been installed on Apr 3 to finally address the issue.