Message or Symptom
| Fault | Recommendation |
|---|
| job hit wallclock time limit |
user or system |
Submit job for longer time or start job from last checkpoint and resubmit. If your job hung and produced no output contact consultants. |
| received node failed or halted event for nid xxxx |
system |
resubmit the job |
| error with width parameters to aprun |
user |
Make sure #PBS -l mppwidth value matches aprun -n value |
| new values for MPICH_UNEX_BUFFER_SIZE are required |
user |
Increase MPICH_UNEX_BUFFER_SIZE environment variable. See Running Large Jobs |
| MPICH_PTL_OTHER_EVENTS possibly too small |
user |
Try increasing value of MPICH_PTL_OTHER_EVENTS. See Running Large Jobs |
| MPICH_PTL_UNEX_EVENTS possibly too small |
user |
Try increasing value of MPICH_PTL_UNEX_EVENTS. See Running Large Jobs |
| a critical file could not be located |
user |
Verify input or output files have correct path |
| no aprun could be found for this job (A parallel job tried to be launched without using the aprun command.) |
user or system |
Make sure batch script includes line: aprun -n [numProcs] [executable]. If you can't find anything wrong with your script, contact the consultants to make sure it isn't a system problem. |
| application had a non-zero exit code |
unknown |
Try again, use a debugger or contact consultants for help. Could indicate a system problem |
| unable to locate aprun for this job (User may have redirected stdout so that error message could not be caught by system) |
unknown |
Try again, use a debugger or contact consultants for help. |
| disk quota exceeded |
system |
Since Franklin enforces disk quotas on job submission rather than on running jobs, this error indicates a system problem. Please contact consultants. |
| segmentation violation |
user |
Most likely a crash in the code. Try a debugger or contact consultants. |
| node count exceeds reservation claim |
user |
The cores you requested in the aprun -n [numProcs] ..." line is larger than those requested in the batch script with the #PBS -l mppwidth directive. Modify script and resubmit |
| application called MPI_Abort |
user or system |
Often a user code will call MPI_Abort. It could also indicated a system problem if an MPI call times out. |
| compute node initiated termination |
user or system |
Look in your standard out file. Usually this will give some indication of the problem. If you get this error repeatedly there could be an undetected bad node in the system. Please contact the consultants |
| ROMIO-IO level error |
user |
Most likely IO error in user code. Contact consultants for help. |
OOM killer terminated this process.
|
user |
The application used more memory than available on a compute node. Examine memory usage of code. See Memory Considerations on Franklin.
|
| application terminated apparently before initial barrier reached |
system |
Job appears to hang and produce no output. Try submitting again. Contact consultants for repeated problems. |
| application exited with non-zero exit code |
user |
May not be a problem. Check the error code of your application. |
| error obtaining user credentials |
system |
Resubmit. Contact consultants for repeated problems. NERSC and Cray are working on this issue. |