NERSCPowering Scientific Discovery Since 1974

OpenMP Tasking Example: Jacobi

Task Dependencies with a Jacobi Iterative Stencil

Starting with the serial version of Jacobi. This requires two arrays, since each write has a dependency on each of it's neighbors. Writing to a second array prevents this write from overwriting the value needed by the current iteration of it's neighbors.

for(int i=0; i<iterations; ++i) {
for(int y=1; y<(n-1); ++y) {
for(int x=(y*n)+1; x<(y*n) + n-1; ++x) {
dest[x]=(src[x-n]+src[x+n]+src[x]+src[x-1]+src[x+1])*0.2;
}
}
std::swap(dest, src);
}

Parallelizing this with a parallel for loop is straightforward.

for(int i=0; i<iterations; ++i) {
#pragma omp parallel for
for(int y=1; y<(n-1); ++y) {
for(int x=(y*n)+1; x<(y*n)+n-1; ++x) {
dest[x]=(src[x-n]+src[x+n]+src[x]+src[x-1]+src[x+1])*0.2;
}
}
std::swap(dest, src);
}

 Adding task dependencies is slightly more involved, but notice that there is no synchronization between iterations now.

#pragma omp parallel
{
#pragma omp single
{
for(int i=0; i<iterations; i++) {
for(int y=1; y<(n-1); ++y) {
#pragma omp task depend(in: src[x], src[x+n], src[x-n]) depend(out: dest[x])
for(int x=(y*n)+1; x<(y*n)+n-1; ++x) {
dest[x]=(src[x-n]+src[x+n]+src[x]+src[x-1]+src[x+1])*0.2;
}
}
std::swap(dest, src);
}
}}

 

On larger data sizes, having one task per row can create too much overhead, so grouping rows together in a task yields better performance. There is no benefit to dividing the matrix into blocks over rows, since there is little data reuse with Jacobi.

#pragma omp parallel
{
#pragma omp single
{
for(int i=0; i<iterations; i++) {
for(int y=1; y<(n-1); y+=chunk_size) {
#pragma omp task depend(in: src[x], src[x+n], src[x-n]) depend(out: dest[x])
for(int row=y; row < y+chunk_size && row < n-1; row++) {
for(int x=(row*n)+1; x<(row*n)+n-1; ++x) {
dest[x]=(src[x-n]+src[x+n]+src[x]+src[x-1]+src[x+1])*0.2;
}
}
}
std::swap(dest, src);
}
}}