NERSCPowering Scientific Discovery Since 1974

Resolved: Reports of Hanging Jobs on Hopper

March 1, 2012 by Katie Antypas (0 Comments)

Issue:

A number of users have reported intermittent large jobs hanging on Hopper.  A job appears to start and then hangs shortly after producing no output.  The job stops when the wall clock limit has been reached.

Status:

Cray has identified a few bad nodes in the system. After rebooting these nodes, no new hung jobs have been reported since Mar 12. A new xt-mpich2/5.4.4 has been installed and set to default, with a system wide MPI env set so that a job will be aborted if detected being hung.  A kernel patch has been installed on Apr 3 to finally address the issue. 

 

 

 

 


Post your comment

You cannot post comments until you have logged in. Login Here.

Comments

No one has commented on this page yet.

RSS feed for comments on this page | RSS feed for all comments