NERSCPowering Scientific Discovery Since 1974

Trouble Shooting and Error Messages

Error Messages

Message or Symptom
FaultRecommendation
job hit wallclock time limit user or system Submit job for longer time or start job from last checkpoint and resubmit. If your job hung and produced no output contact consultants.
received node failed or halted event for nid xxxx system resubmit the job
error with width parameters to aprun user Make sure #PBS -l mppwidth value matches aprun -n value
new values for MPICH_UNEX_BUFFER_SIZE are required user Increase MPICH_UNEX_BUFFER_SIZE environment variable. See Running Large Jobs
MPICH_PTL_OTHER_EVENTS possibly too small user Try increasing value of MPICH_PTL_OTHER_EVENTS. See Running Large Jobs
MPICH_PTL_UNEX_EVENTS possibly too small user Try increasing value of MPICH_PTL_UNEX_EVENTS. See Running Large Jobs
a critical file could not be located user Verify input or output files have correct path
no aprun could be found for this job (A parallel job tried to be launched without using the aprun command.) user or system Make sure batch script includes line: aprun -n [numProcs] [executable]. If you can't find anything wrong with your script, contact the consultants to make sure it isn't a system problem.
application had a non-zero exit code unknown Try again, use a debugger or contact consultants for help. Could indicate a system problem
unable to locate aprun for this job (User may have redirected stdout so that error message could not be caught by system) unknown Try again, use a debugger or contact consultants for help.
disk quota exceeded system Since Franklin enforces disk quotas on job submission rather than on running jobs, this error indicates a system problem. Please contact consultants.
segmentation violation user Most likely a crash in the code. Try a debugger or contact consultants.
node count exceeds reservation claim user The cores you requested in the aprun -n [numProcs] ..." line is larger than those requested in the batch script with the #PBS -l mppwidth directive. Modify script and resubmit
application called MPI_Abort user or system Often a user code will call MPI_Abort. It could also indicated a system problem if an MPI call times out.
compute node initiated termination user or system Look in your standard out file. Usually this will give some indication of the problem. If you get this error repeatedly there could be an undetected bad node in the system. Please contact the consultants
ROMIO-IO level error user Most likely IO error in user code. Contact consultants for help.
OOM killer terminated this process.
user The application used more memory than available on a compute node. Examine memory usage of code. See Memory Considerations on Franklin.
application terminated apparently before initial barrier reached system Job appears to hang and produce no output. Try submitting again. Contact consultants for repeated problems.
application exited with non-zero exit code user May not be a problem. Check the error code of your application.
error obtaining user credentials system Resubmit. Contact consultants for repeated problems. NERSC and Cray are working on this issue.