NERSCPowering Scientific Discovery Since 1974

Trouble Shooting and Error Messages

Error Messages

Message or SymptomFaultRecommendation
job hit wallclock time limit user or system Submit job for longer time or start job from last checkpoint and resubmit. If your job hung and produced no output contact consultants.
received node failed or halted event for nid xxxx system One of the compute nodes assigned to the job failed. Resubmit the job
PtlNIInit failed : PTL_NOT_REGISTERED user The executable is from an XT system (Franklin or Jaguar?) using portals. Recompile on Hopper and resubmit.
error while loading shared libraries: libxxxx.so: cannot open shared object file: No such file or directory mostly user, sometimes system Make sure environment variable CRAY_ROOTFS is set to DSL, also the modules loaded when building the dynamic executable is also loaded at run time. Report to consultants if still not resolved.
error with width parameters to aprun user Make sure #PBS -l mppwidth value matches aprun -n value
a critical file could not be located user Verify input or output files have correct path
no aprun could be found for this job (A parallel job tried to be launched without using the aprun command.) user or system Make sure batch script includes line: aprun -n [numProcs] [executable]. If you can't find anything wrong with your script, contact the consultants to make sure it isn't a system problem.
application had a non-zero exit code unknown Try again, use a debugger or contact consultants for help. Could indicate a system problem
unable to locate aprun for this job (User may have redirected stdout so that error message could not be caught by system) unknown Try again, use a debugger or contact consultants for help.
disk quota exceeded system Since Franklin enforces disk quotas on job submission rather than on running jobs, this error indicates a system problem. Please contact consultants.
segmentation violation user Most likely a crash in the code. Try a debugger or contact consultants.
node count exceeds reservation claim user The cores you requested in the aprun -n [numProcs] ..." line is larger than those requested in the batch script with the #PBS -l mppwidth directive. Modify script and resubmit
application called MPI_Abort user or system Often a user code will call MPI_Abort. It could also indicated a system problem if an MPI call times out.
compute node initiated termination user or system Look in your standard out file. Usually this will give some indication of the problem. If you get this error repeatedly there could be an undetected bad node in the system. Please contact the consultants
ROMIO-IO level error user Most likely IO error in user code. Contact consultants for help.
OOM killer terminated this process user The application used more memory than available on a compute node. Examine memory usage of code. See Memory Consideration on Hopper.
application terminated apparently before initial barrier reached system Job appears to hang and produce no output. Try submitting again. Contact consultants for repeated problems.
application exited with non-zero exit code user May not be a problem. Check the error code of your application.
error obtaining user credentials system Resubmit. Contact consultants for repeated problems.
nem_gni_error_handler(): a transaction error was detected user This is a general error messages indicating something wrong in user level code. Check carefully other error messages accompany this message, such as MPI errors, IO errors, segmentation fault, etc. which are better indication of the cause of the job failure.
dmapp_dreg.c:391: _dmappi_dreg_register: Assertion `reg_cache_initialized' failed user Make sure to unload the xt-shmem module at both the compile time and run time if your code only uses MPI, otherwise you run into dependency issues between MPI and SHMEM shared libraries.
libhugetlbfs: ERROR: RTLD_NEXT used in code not dynamically loaded user Make sure to load the craype-hugepages2m module.
distributeControlMsg: Apid xxxxxx, write failure to node xxxx system Problem related to inconsistency between Node Health Check and ALPS. Report the problem to consultants, and resubmit the job.
MPID_nem_gni_check_localCQ(xxx)...: unrecoverable network error user Most related to MPI problems in the user code that at least one MPI rank does not reach MPI_Finalize().
[Nid xxxx] ... MPIU_nem_gni_get_hugepages(): Unable to mmap 4194304 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.xxxx.kvs_14638087, err Cannot allocate memory system Problem related to hugepages not available on the compute nodes. Just resubmit, or remove "-ss" in the aprun time and resubmit (may have negative performance impact especailly for hybrid MPI/OpenMP applications)
[Nid xxxx] ... Apid xxx: Cpuset file /dev/cpuset/xxx/cpus wrote -1 of 5: NIDMSG-D system Problem related to inconsistency between Node Health Check and ALPS. Possible other processes running on the node. Report the problem to consultants, and resubmit the job.
[PE xxx]inet_arp_address_lookup:Failed to read output of /sbin/arp -a -i ipogif0 command. Try rerunning job with CRAY_ROOTFS environment variable set to DSL. system This is a transient issue (root cause under investigation). Resbumit usually works.