TutorialsTutorials HomeDebuggingIntroductionGood Practices Compiler Errors Run-Time Errors Wrong Answers MPI Errors Utilities
FullDocument
Related InfoTotalviewDebugging Tools |
Run-Time ErrorsThe largest chunk of debugging usually occurs tracking down run-time errors. Run-time errors can be difficult to track down because the error messages are very general and the code crashes before useful data can be saved. But surprisingly most run time errors can be traced to a few common causes. These run time errors and the common causes will be listed here. Run-time errors often lack meaningful error explanations, which can make them difficult to understand and debug. Since AIX is basically a standard UNIX implementation a good reference book on UNIX provides useful information on run-time errors. One thing commonly associated with run-time errors is a core file. A core file is the image of a terminated process; or in other words, a dump of everything in memory at the time of the crash.
Many successful programmers never look at core files, while others
swear it is the easiest way to track down an error. Each person develops their
own debugging style, but it is important not to automatically erase the core
file because you think you aren't an advanced enough programmer to understand
what's in it. When a code is compiled with the Operand Range Errors (ORE) / Segmentation Violation or FaultOften, puzzling errors occur when a program accesses memory that it does not own or uses an index that is out of bounds for a given array. This sometimes leads to a segmentation fault or operand range error (ORE). But sometimes no problem is immediately apparent. These errors can be difficult to find since the error often shows up in a different location from where it originated; one part of the program corrupts memory but the crash occurs later when that bad memory is accessed, often by a properly written piece of code. The most important question is: How did the problem storing the data in memory occur in the first place? The most common cause is overwriting the bounds of an array. Compilers provide switches that allow you to turn on run-time bounds checking for statically allocated arrays. For IBM use -C -qextchk. This run-time checking will dramatically slow down execution of your code, so use it only as a debugging tool. Note that -qextchk checks for argument mismatches across subroutine calls; this will result in many warnings for MPI codes, which depend on weak type checking.
The code below has a number of problems. In line 11, the variable
On line 15 memory is allocated for array
1 !filename: ore.F
2
3 program ore
4 implicit none
5 integer:: m, i, errcode
6 integer, dimension(10):: n
7 real, allocatable, dimension(:) :: r
8
9 common n,m !n,m are adjacent in memory
10
11 m = 1
12 call sub !sub has an error that clobbers m
13 print *,'m:', m !see the value of m
14
15 allocate(r(100)) !allocate a 100-element array
16 do i=1,200 !write beyond allocated memory
17 r(i) = float(i)
18 end do
19
28 deallocate(r)
29 end program ore
30
31
32 subroutine sub
33 implicit none
34 integer:: i
35 integer, dimension(10):: q
36
37 common q
38
39 do i=1,50 !Clobber memory by exceeding
40 q(i) = i !array bounds
41 enddo
42 end subroutine sub
Following are some examples of compiling and running this code. IBM SP% xlf90 -o ore ore.F ** ore === End of Compilation 1 === ** sub === End of Compilation 2 === 1501-510 Compilation successful for file ore.F. % ./ore m: 11 Segmentation fault (core dumped) % xlf90 -C -qextchk -o ore ore.F ** ore === End of Compilation 1 === ** sub === End of Compilation 2 === 1501-510 Compilation successful for file ore.F. % ./ore Trace/BPT trap (core dumped) On the IBM SP, a Segmentation Violation can occur if you application exceeds the stack memory limit. Floating Point Exceptions (FPE)Floating point exceptions (FPE) are usually easier to debug than ORE's. And they are almost always caused by a divide by zero. FPE's result when a floating point operation is attempted and one of the operands is not valid. Examples include divide by zero or square root of a negative number. The IBM xlf compiler will not trap floating point exceptions by default. See xlf Floating Point Exceptions. A common cause of FPE's is using uninitialized variables in a floating point operation. An option that may be helpful is the IBM xlf -qinitauto option, which initializes variables. By using -qinitauto=FF and -qflttrap=invalid:enable, you can identify uninitialized floating point variables on the IBM machines since anything that "touches" the unitialized variable will become a -NANQ. For example, this code does not initialize the variable a.
PROGRAM NOINIT
IMPLICIT NONE
real:: a,b
b = 2.0 * a
print *, b
END PROGRAM NOINIT
Here are examples of compiling and running using both default compiler options and the ones mentioned above. IBM SP% xlf90 -o noinit noinit.f ** noinit === End of Compilation 1 === 1501-510 Compilation successful for file noinit.f. % ./noinit 0.0000000000E+00 % xlf90 -qflttrap=invalid:enable -qinitauto=FF -o noinit noinit.f ** noinit === End of Compilation 1 === 1501-510 Compilation successful for file noinit.f. % ./noinit -NaNQ Even when the FPE is not due to an uninitialized variable, it is still usually easy to find by simply recompiling your code with symbolic debugging on and using totalview to help you find the error. Opening the resulting core file in Totalview will usually put you exactly at the line where the FPE occurred. Use totalview executable -c core. You can also use brute force methods to find FPE's like looking at variable values from within Totalview or including flags in your code that print variables when their values exceed certain limits. The latter suggestion is sometimes good for general debugging of run-time or logic errors. But stepping through the code with Totalview gets tedious even when the crash occurs relatively early. Don't be afraid of the core -- playing around with Totalview and becoming familiar with its capabilities is the key. Other ErrorsThis last category is for any other general error encountered which prevents the code from completing normally. They might not properly be termed "run-time" errors, but they will be presented in this section anyway.
|
![]() |
Page last modified: Fri, 29 Feb 2008 23:57:57 GMT Page URL: http://www.nersc.gov/nusers/help/tutorials/debug/runtime_errors.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |