NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Run-Time Errors

The largest chunk of debugging usually occurs tracking down run-time errors. Run-time errors can be difficult to track down because the error messages are very general and the code crashes before useful data can be saved. But surprisingly most run time errors can be traced to a few common causes. These run time errors and the common causes will be listed here.

Run-time errors often lack meaningful error explanations, which can make them difficult to understand and debug. Since AIX is basically a standard UNIX implementation a good reference book on UNIX provides useful information on run-time errors.

One thing commonly associated with run-time errors is a core file. A core file is the image of a terminated process; or in other words, a dump of everything in memory at the time of the crash.

Many successful programmers never look at core files, while others swear it is the easiest way to track down an error. Each person develops their own debugging style, but it is important not to automatically erase the core file because you think you aren't an advanced enough programmer to understand what's in it. When a code is compiled with the -g option, symbolic debugging is on and the core file is relatively easy to understand.

Operand Range Errors (ORE) / Segmentation Violation or Fault

Often, puzzling errors occur when a program accesses memory that it does not own or uses an index that is out of bounds for a given array. This sometimes leads to a segmentation fault or operand range error (ORE). But sometimes no problem is immediately apparent.

These errors can be difficult to find since the error often shows up in a different location from where it originated; one part of the program corrupts memory but the crash occurs later when that bad memory is accessed, often by a properly written piece of code.

The most important question is: How did the problem storing the data in memory occur in the first place? The most common cause is overwriting the bounds of an array. Compilers provide switches that allow you to turn on run-time bounds checking for statically allocated arrays.

For IBM use -C -qextchk. This run-time checking will dramatically slow down execution of your code, so use it only as a debugging tool. Note that -qextchk checks for argument mismatches across subroutine calls; this will result in many warnings for MPI codes, which depend on weak type checking.

The code below has a number of problems. In line 11, the variable m is set to 1, but he subroutine called in line 12 writes data beyond the bounds of array n. Depending on a number of memory issues, this may or may not cause the program to crash. However, it will certainly corrupt the variable m, as is shown in line 13.

On line 15 memory is allocated for array r. The loop beginning on line 16 writes to out of bounds memory locations. Again, the code may or may not crash.

     1  !filename: ore.F
     2
     3   program ore
     4           implicit none
     5           integer:: m, i, errcode
     6           integer, dimension(10):: n
     7           real, allocatable, dimension(:) :: r
     8
     9           common n,m           !n,m are adjacent in memory
    10
    11           m = 1
    12           call sub             !sub has an error that clobbers m
    13           print *,'m:', m      !see the value of m
    14
    15           allocate(r(100))     !allocate a 100-element array
    16           do i=1,200           !write beyond allocated memory
    17                  r(i) = float(i)
    18           end do
    19

    28           deallocate(r)
    29   end program ore
    30
    31
    32   subroutine sub
    33          implicit none
    34           integer:: i
    35           integer, dimension(10):: q
    36
    37           common q
    38
    39           do i=1,50            !Clobber memory by exceeding
    40             q(i) = i           !array bounds
    41           enddo
    42   end subroutine sub

Following are some examples of compiling and running this code.

IBM SP

% xlf90 -o ore ore.F
** ore   === End of Compilation 1 ===
** sub   === End of Compilation 2 ===
1501-510  Compilation successful for file ore.F.
% ./ore
 m: 11
Segmentation fault (core dumped)
% xlf90 -C -qextchk -o ore ore.F
** ore   === End of Compilation 1 ===
** sub   === End of Compilation 2 ===
1501-510  Compilation successful for file ore.F.
% ./ore
Trace/BPT trap (core dumped)

On the IBM SP, a Segmentation Violation can occur if you application exceeds the stack memory limit.

Floating Point Exceptions (FPE)

Floating point exceptions (FPE) are usually easier to debug than ORE's. And they are almost always caused by a divide by zero. FPE's result when a floating point operation is attempted and one of the operands is not valid. Examples include divide by zero or square root of a negative number.

The IBM xlf compiler will not trap floating point exceptions by default. See xlf Floating Point Exceptions.

A common cause of FPE's is using uninitialized variables in a floating point operation. An option that may be helpful is the IBM xlf -qinitauto option, which initializes variables. By using -qinitauto=FF and -qflttrap=invalid:enable, you can identify uninitialized floating point variables on the IBM machines since anything that "touches" the unitialized variable will become a -NANQ.

For example, this code does not initialize the variable a.

PROGRAM NOINIT
        IMPLICIT NONE
        
        real:: a,b

        b = 2.0 * a

        print *, b
END PROGRAM NOINIT      

Here are examples of compiling and running using both default compiler options and the ones mentioned above.

IBM SP

% xlf90 -o noinit noinit.f 
** noinit   === End of Compilation 1 ===
1501-510  Compilation successful for file noinit.f.
% ./noinit 
 0.0000000000E+00
% xlf90 -qflttrap=invalid:enable -qinitauto=FF -o noinit noinit.f
** noinit   === End of Compilation 1 ===
1501-510  Compilation successful for file noinit.f.
% ./noinit 
 -NaNQ

Even when the FPE is not due to an uninitialized variable, it is still usually easy to find by simply recompiling your code with symbolic debugging on and using totalview to help you find the error. Opening the resulting core file in Totalview will usually put you exactly at the line where the FPE occurred. Use totalview executable -c core.

You can also use brute force methods to find FPE's like looking at variable values from within Totalview or including flags in your code that print variables when their values exceed certain limits. The latter suggestion is sometimes good for general debugging of run-time or logic errors. But stepping through the code with Totalview gets tedious even when the crash occurs relatively early. Don't be afraid of the core -- playing around with Totalview and becoming familiar with its capabilities is the key.

Other Errors

This last category is for any other general error encountered which prevents the code from completing normally. They might not properly be termed "run-time" errors, but they will be presented in this section anyway.

Loader Errors

The most common load-time error is missing or unsatisfied externals. This is usually due to a missing library or to undimensioned arrays.


LBNL Home
Page last modified: Fri, 29 Feb 2008 23:57:57 GMT
Page URL: http://www.nersc.gov/nusers/help/tutorials/debug/runtime_errors.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science