NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory

Using ACML on Hopper and Franklin

The AMD Core Math Library (ACML) is available on Hopper and Franklin It provides a highly optimized library of BLAS, LAPACK, FFT routines, and a Random Number Generator Suite.

Users need to load the ACML module via "module load acml", the compiler wrappers (ftn, CC, cc) will then automatically link to the ACML library.

ACML User's Guide, BLAS and LAPACK man pages.

Note: BLAS and LAPACK functions from Cray LibSci library are preferred (and default) over the corresponding ACML functions so ACML module is advised not to be loaded unless specific functionalities from ACML are required.

Example Fortran and MPI Code using ACML

Below is the compile and run of a Fortran and MPI code that uses ACML random number generator routines to generate distinct arrays of random values on different processors, then uses BLAS routines from ACML to compute dot products.

% cat dp1.f
C     compile:   ftn -fastsse -o dp1 dp1.f
C     run:       qsub rundp
 
      IMPLICIT NONE
 
      INCLUDE 'mpif.h'
  
      integer, parameter :: statesize=16, naggen=1, seedsize=1
      INTEGER :: myPE, totPEs, ierr
      integer :: info, i, n, nskip1, nskip2, imag
      integer :: state1(statesize), state2(statesize), seed(seedsize)
      real :: dpsend
      real, allocatable :: dpset(:)
      double precision :: dp, ddot
      double precision, parameter :: a=0.d0, b=1.0d0
      double precision, allocatable :: x(:), y(:)
  
      CALL MPI_INIT( ierr )
      CALL MPI_COMM_RANK( MPI_COMM_WORLD, myPE, ierr )
      CALL MPI_COMM_SIZE( MPI_COMM_WORLD, totPEs, ierr )

      allocate(dpset(totPEs))
 
C   all processors working from the same random sequence, same seed
 
      seed(1) = 1234
      call drandinitialize(naggen,1,seed,1,state1,statesize,info)
      state2 = state1
 
c   pull vectors out of sequence based on processor number
 
      n = 1000000
      nskip1 = n * myPE
      nskip2 = nskip1 + (n * totPEs)
      call drandskipahead(nskip1,state1,info)
      call drandskipahead(nskip2,state2,info)
 
      allocate(x(n), y(n))
      call dranduniform(n,a,b,state1,x,info)
      call dranduniform(n,a,b,state2,y,info)
 
c   compute dot product of two vectors from random distribution
 
      dp = ddot(n,x,1,y,1)
 
      deallocate(x,y)
 
c    gather normalized dot products
 
      dpsend = 4*(dp/n)
 
      CALL MPI_GATHER(dpsend, 1, MPI_REAL, dpset, 1, MPI_REAL,
     .                0, MPI_COMM_WORLD, ierr) 
 
      if (myPE==0) then
         print *,"dpset = ",dpset
         print *,"PE's, average = ", totPEs, (sum(dpset)/totPEs)
      endif
 
      deallocate(dpset)

      CALL MPI_FINALIZE(ierr)
  
      END

% cat rundp
#PBS -N dpjob
#PBS -q debug
#PBS -l mppwidth=14
#PBS -l walltime=00:10:00
#PBS -e dpjob.out
#PBS -j eo

cd $PBS_O_WORKDIR
module load acml
ftn -o dp1 -fastsse dp1.f

aprun -n 2 ./dp1
aprun -n 4 ./dp1
aprun -n 8 ./dp1

% cat dpjob.out
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/opt/xt-pe/2.0.24b/bin/snos64/ftn: INFO: linux target is being used
dp1.f:
 dpset =     1.000160       0.9977519    
 PE's, average =             2   0.9989561    
Application 178114 resources: utime 0, stime 0
 dpset =    0.9997419       0.9983472       0.9999773        1.000023    
 PE's, average =             4   0.9995222    
Application 178115 resources: utime 0, stime 0
 dpset =     1.000837       0.9985554       0.9998037       0.9999567     
   0.9991672       0.9996513       0.9990580       0.9994583    
 PE's, average =             8   0.9995610    
Application 178116 resources: utime 0, stime 0

Example C++ and MPI Code calling Fortran using ACML

Below is the compile and run of a C++ and MPI code that calls a Fortran subroutine to use the ACML random number generator routines to generate distinct arrays of random values on different processors, then uses BLAS routines from ACML to compute dot products.

The mixed programming requires that certain Fortran libraries be explicitly referenced on the C++ compile line, otherwise the wrapper CC would not know where to find the Fortran libraries.

/scratchdir => cat dp1.C
 
#include 
#include 
using namespace std;
 
extern "C" {extern float fortran_sub_(int *, int *); }
 
int main(int argc, char* argv[])
{
 int myid, numprocs, i;
 float dpsend;
 float dpset[8];
 
 MPI_Init(&argc,&argv);
 MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
 MPI_Comm_rank(MPI_COMM_WORLD,&myid); 
 
 dpsend=fortran_sub_(&myid,&numprocs);
  
 MPI_Gather(&dpsend, 1, MPI_REAL, dpset, 1, MPI_REAL, 0, MPI_COMM_WORLD);
 
 if (myid==0) {
      for(i= 0; i < 8; i++)
         cout << "dpset[" << i << "] = " << dpset[i] << endl;
 }
 
 MPI_Finalize();
}


/scratchdir => cat dp1s.f
      function fortran_sub(myPE,totPEs)
  
      integer, parameter :: statesize=16, naggen=1, seedsize=1
      INTEGER :: myPE, totPEs, ierr
      integer :: info, i, n, nskip1, nskip2, imag
      integer :: state1(statesize), state2(statesize), seed(seedsize)
      real :: dpsend
      double precision :: dp, ddot
      double precision, parameter :: a=0.d0, b=1.0d0
      double precision, allocatable :: x(:), y(:)
  
C   all processors working from the same random sequence, same seed
 
      seed(1) = 1234
      call drandinitialize(naggen,1,seed,1,state1,statesize,info)
      state2 = state1
 
c   pull vectors out of sequence based on processor number
 
      n = 100000
      nskip1 = n * myPE
      nskip2 = nskip1 + (n * totPEs)
      call drandskipahead(nskip1,state1,info)
      call drandskipahead(nskip2,state2,info)
 
      allocate(x(n), y(n))
      call dranduniform(n,a,b,state1,x,info)
      call dranduniform(n,a,b,state2,y,info)
 
c   compute dot product of two vectors from random distribution
 
      dp = ddot(n,x,1,y,1)
 
      deallocate(x,y)
 
c    gather normalized dot products
 
      fortran_sub = 4*(dp/n)
 
      end


/scratchdir => cat rundpmixed
#PBS -N dpmixedjob 
#PBS -q debug
#PBS -l mppwidth=8
#PBS -l walltime=00:05:00
#PBS -e dpmixedjob.out
#PBS -j eo
 
cd $PBS_O_WORKDIR
module load acml 
ftn -fastsse -c dp1s.f      #compile fortran subroutine

#compile C++ and link in fortran
CC -o dp1C dp1s.o dp1.C -lpgf90 -lpgf902 -lpgf90_rpm1 

aprun -n 8 ./dp1C         #launch parallel job on compute nodes



/scratchdir => cat dpmixedjob.out
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/opt/xt-pe/1.5.34a/bin/snos64/ftn: INFO: catamount target is being used
/opt/xt-pe/1.5.34a/bin/snos64/CC: INFO: catamount target is being used
dp1.C:
dpset[0] = 0.997962
dpset[1] = 0.998256
dpset[2] = 1.00215
dpset[3] = 0.995581
dpset[4] = 0.999859
dpset[5] = 0.994864
dpset[6] = 0.997083
dpset[7] = 1.00007

As expected, the second example, with C++ calling Fortran to use the ACML routines produces the same result as the first example above.


LBNL Home
Page last modified: Wed, 18 Nov 2009 19:03:20 GMT
Page URL: http://www.nersc.gov/nusers/systems/franklin/software/acml.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science