| Message or Symptom | Fault | Recommendation |
|---|
| job hit wallclock time limit |
user or system |
Submit job for longer time or start job from last checkpoint and resubmit. If your job hung and produced no output contact consultants. |
| received node failed or halted event for nid xxxx |
system |
One of the compute nodes assigned to the job failed. Resubmit the job |
| PtlNIInit failed : PTL_NOT_REGISTERED |
user |
The executable is from an XT system (Franklin or Jaguar?) using portals. Recompile on Hopper and resubmit. |
| error while loading shared libraries: libxxxx.so: cannot open shared object file: No such file or directory |
mostly user, sometimes system |
Make sure environment variable CRAY_ROOTFS is set to DSL, also the modules loaded when building the dynamic executable is also loaded at run time. Report to consultants if still not resolved. |
| error with width parameters to aprun |
user |
Make sure #PBS -l mppwidth value matches aprun -n value |
| a critical file could not be located |
user |
Verify input or output files have correct path |
| no aprun could be found for this job (A parallel job tried to be launched without using the aprun command.) |
user or system |
Make sure batch script includes line: aprun -n [numProcs] [executable]. If you can't find anything wrong with your script, contact the consultants to make sure it isn't a system problem. |
| application had a non-zero exit code |
unknown |
Try again, use a debugger or contact consultants for help. Could indicate a system problem |
| unable to locate aprun for this job (User may have redirected stdout so that error message could not be caught by system) |
unknown |
Try again, use a debugger or contact consultants for help. |
| disk quota exceeded |
system |
Since Franklin enforces disk quotas on job submission rather than on running jobs, this error indicates a system problem. Please contact consultants. |
| segmentation violation |
user |
Most likely a crash in the code. Try a debugger or contact consultants. |
| node count exceeds reservation claim |
user |
The cores you requested in the aprun -n [numProcs] ..." line is larger than those requested in the batch script with the #PBS -l mppwidth directive. Modify script and resubmit |
| application called MPI_Abort |
user or system |
Often a user code will call MPI_Abort. It could also indicated a system problem if an MPI call times out. |
| compute node initiated termination |
user or system |
Look in your standard out file. Usually this will give some indication of the problem. If you get this error repeatedly there could be an undetected bad node in the system. Please contact the consultants |
| ROMIO-IO level error |
user |
Most likely IO error in user code. Contact consultants for help. |
| OOM killer terminated this process |
user |
The application used more memory than available on a compute node. Examine memory usage of code. See Memory Consideration on Hopper. |
| application terminated apparently before initial barrier reached |
system |
Job appears to hang and produce no output. Try submitting again. Contact consultants for repeated problems. |
| application exited with non-zero exit code |
user |
May not be a problem. Check the error code of your application. |
| error obtaining user credentials |
system |
Resubmit. Contact consultants for repeated problems. |
| nem_gni_error_handler(): a transaction error was detected |
user |
This is a general error messages indicating something wrong in user level code. Check carefully other error messages accompany this message, such as MPI errors, IO errors, segmentation fault, etc. which are better indication of the cause of the job failure. |
| dmapp_dreg.c:391: _dmappi_dreg_register: Assertion `reg_cache_initialized' failed |
user |
Make sure to unload the xt-shmem module at both the compile time and run time if your code only uses MPI, otherwise you run into dependency issues between MPI and SHMEM shared libraries. |
| libhugetlbfs: ERROR: RTLD_NEXT used in code not dynamically loaded |
user |
Make sure to load the craype-hugepages2m module. |
| distributeControlMsg: Apid xxxxxx, write failure to node xxxx |
system |
Problem related to inconsistency between Node Health Check and ALPS. Report the problem to consultants, and resubmit the job. |
| MPID_nem_gni_check_localCQ(xxx)...: unrecoverable network error |
user |
Most related to MPI problems in the user code that at least one MPI rank does not reach MPI_Finalize(). |
| [Nid xxxx] ... MPIU_nem_gni_get_hugepages(): Unable to mmap 4194304 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.xxxx.kvs_14638087, err Cannot allocate memory |
system |
Problem related to hugepages not available on the compute nodes. Just resubmit, or remove "-ss" in the aprun time and resubmit (may have negative performance impact especailly for hybrid MPI/OpenMP applications) |
| [Nid xxxx] ... Apid xxx: Cpuset file /dev/cpuset/xxx/cpus wrote -1 of 5: NIDMSG-D |
system |
Problem related to inconsistency between Node Health Check and ALPS. Possible other processes running on the node. Report the problem to consultants, and resubmit the job. |