Performance and scaling in the ZORI Code

Working notes from NERSC repo incite1.


Load Balance

In a synchronizing parallel code a prerequisiste to achiveing good parallel effciency is load balance. Time spent doing scalar work (orange) is compared to time spent synchonizing (green/blue) for 128 way runs below.

key orange = work green/blue = communication
unbalanced ~balanced
big , eps big , eps

Scalar Performance

Initial Profiling
Possible focus areas:
  • dtrsv solves in determinant calculation
  • matrix reference in psi_adjusment
  • molib
  • coulomb divides repalced by MASS vdiv?
Optimization work : determinants
  • light blue : molib and matrix adjustments
  • dark blue : determinant calculation for alpha electrons
  • pink : determinant calculation for beta electrons
before after
accuracy, source snippet
The parallel cross section and wallclock time spent doing determinant calculations is much improved by using ESSL. Keep in mind this is a scalar nonsynchronizing portion of the code, so the disorder amongst tasks is a good thing. For large chemical systems the benefit of this optimization is likely to be even greater since the matrices for which determinants are required will increase.


Optimization work : coulombic divides
profile of time spent in potential.c:V_coul (brown)
Look into using MASS vdiv to optimize the potential evaluation

Analysis from gprof/tprof above shows hotlines attributable to GSL are from matrix-matrix reductions like:

    if(GET_LAPLACIAN) {
        det->lapln=0;
        for (i=0;isize;i++) {
            for (j=0;jsize;j++) {
                det->lapln=det->lapln+gsl_matrix_get(det->laplacianorb,i,j)*
                    gsl_matrix_get(det->inverse,j,i);
            }
                 
        }

Which could be mapped onto ESSL, though not by improving the GSL library itself.

I/O Performance

In order to improve I/O performance which became an issue around 512 way, we wrote a parallel HDF5 interface which saves the simulation state in a single file. This avoid the file per task slowdown and can be called multiple times as the calculation progresses in order to make visualizations of the walkers as they march through space.

The HDF code put into zori uses the methods shown in this code

Rough notes
---------------------------------------------------------------------
For Hexatriene the LU decomposition is for ~20x20 matrices what controls
this dimension?
---------------------------------------------------------------------

potential.c  (11% of wallkclock) n2 recip force  ---> vdiv in temp

     53     w->pot = 0.0;
     54     // electron-electron potential
     55     for (i=0; ipsianti.electrons;i++) {
     56         for (j=0; jpot = w->pot + 1.0/ee_distance(w,i,j);
     59         }
     60     }

---------------------------------------------------------------------
gcc

          -Wl,option
              Pass option as an option to the linker.  If option
              contains commas, it is split into multiple options at
              the commas.



-brename:.cblas_dtrsv,.esvdtrsv



---------------------------------------------------------

The inline keyword is not part of ANSI C and the library does not export
any inline function definitions by default. However, the library
provides optional inline versions of performance-critical functions by
conditional compilation. The inline versions of these functions can be
included by defining the macro HAVE_INLINE when compiling an application.

gcc -c -DHAVE_INLINE app.c

If you use autoconf this macro can be defined automatically. The
following test should be placed in your `configure.in' file,

AC_C_INLINE

if test "$ac_cv_c_inline" != no ; then
  AC_DEFINE(HAVE_INLINE,1)
  AC_SUBST(HAVE_INLINE)
fi

and the macro will then be defined in the compilation flags or by
including the file `config.h' before any library headers. If you do not
define the macro HAVE_INLINE then the slower non-inlined versions of the
functions will be used instead.

Note that the actual usage of the inline keyword is extern inline, which
eliminates unnecessary function definitions in GCC. If the form extern
inline causes problems with other compilers a stricter autoconf test can
be used, see section Autoconf Macros.