Working notes from NERSC repo incite1.
In a synchronizing parallel code a prerequisiste to achiveing good parallel effciency is load balance. Time spent doing scalar work (orange) is compared to time spent synchonizing (green/blue) for 128 way runs below.
|
|
Possible focus areas:
- dtrsv solves in determinant calculation
- matrix reference in psi_adjusment
- molib
- coulomb divides repalced by MASS vdiv?
|
| Optimization work : determinants |
- Working on bringing ESSL to GSL calls --> won't work. GSL BLAS API (enum's in place of chars). Need to write CBLAS wrapper library for GSL.
- Recode zori for ESSL directly --> most library time is in LU for determinant. Good results. 35% speedup for hexatriene. See profiles below.
- light blue : molib and matrix adjustments
- dark blue : determinant calculation for alpha electrons
- pink : determinant calculation for beta electrons
|
| before | after |
| |
|
|
accuracy, source snippet
|
The parallel cross section and wallclock time spent doing determinant calculations is much improved by using ESSL. Keep in mind this is a scalar nonsynchronizing portion of the code, so the disorder amongst tasks is a good thing. For large chemical systems the benefit of this optimization is likely to be even greater since the matrices for which determinants are required will increase.
| Optimization work : coulombic divides |
| profile of time spent in potential.c:V_coul (brown)
|
|
|
Look into using MASS vdiv to optimize the potential evaluation
|
Analysis from gprof/tprof above shows hotlines attributable to GSL are from matrix-matrix
reductions like:
if(GET_LAPLACIAN) {
det->lapln=0;
for (i=0;isize;i++) {
for (j=0;jsize;j++) {
det->lapln=det->lapln+gsl_matrix_get(det->laplacianorb,i,j)*
gsl_matrix_get(det->inverse,j,i);
}
}
Which could be mapped onto ESSL, though not by improving the GSL library itself.
I/O Performance
In order to improve I/O performance which became an issue around 512 way,
we wrote a parallel HDF5 interface which saves the simulation state in
a single file. This avoid the file per task slowdown and can be called
multiple times as the calculation progresses in order to make visualizations
of the walkers as they march through space.
The HDF code put into zori uses the methods shown in this code
Rough notes
---------------------------------------------------------------------
For Hexatriene the LU decomposition is for ~20x20 matrices what controls
this dimension?
---------------------------------------------------------------------
potential.c (11% of wallkclock) n2 recip force ---> vdiv in temp
53 w->pot = 0.0;
54 // electron-electron potential
55 for (i=0; ipsianti.electrons;i++) {
56 for (j=0; jpot = w->pot + 1.0/ee_distance(w,i,j);
59 }
60 }
---------------------------------------------------------------------
gcc
-Wl,option
Pass option as an option to the linker. If option
contains commas, it is split into multiple options at
the commas.
-brename:.cblas_dtrsv,.esvdtrsv
---------------------------------------------------------
The inline keyword is not part of ANSI C and the library does not export
any inline function definitions by default. However, the library
provides optional inline versions of performance-critical functions by
conditional compilation. The inline versions of these functions can be
included by defining the macro HAVE_INLINE when compiling an application.
gcc -c -DHAVE_INLINE app.c
If you use autoconf this macro can be defined automatically. The
following test should be placed in your `configure.in' file,
AC_C_INLINE
if test "$ac_cv_c_inline" != no ; then
AC_DEFINE(HAVE_INLINE,1)
AC_SUBST(HAVE_INLINE)
fi
and the macro will then be defined in the compilation flags or by
including the file `config.h' before any library headers. If you do not
define the macro HAVE_INLINE then the slower non-inlined versions of the
functions will be used instead.
Note that the actual usage of the inline keyword is extern inline, which
eliminates unnecessary function definitions in GCC. If the form extern
inline causes problems with other compilers a stricter autoconf test can
be used, see section Autoconf Macros.