NERSCPowering Scientific Discovery Since 1974

OpenMP Tasking Example: LU

Task Dependencies with LU Decomposition

Starting with the serial version of LU, shown below, we can see that there is a division with each diagonal element on every element in the rest of the row, and then the changes in this column are then propagated to each row after.

for(int i=0; i<size; i++) {
for(int j=i+1; j<size; j++) {
A[j*size+i] /= A[i*size+i];
for(int k=i+1; k<size; k++) {
A[j*size+k] -= A[j*size+i] * A[i*size+k];
}
}
}

This can be broken into 4 functions and applied to a blocked version of the matrix. The first is the same as the original algorithm above, and it's applied to the diagonal blocks. Then the changes from the diagonal block are propegated to the blocks in the same column and the blocks in the same row. A fourth function propegates these changes to the rest of the remaining matrix, and then this is repeated on that remaining inner matrix. Breaking the matrix into blocks not only enables parallelism, but also improves cache usage.

for(int i=0; i<num_blocks; i++) {
diag_func( block_list[i][i] );
for(int j=i+1; j<num_blocks; j++) {
row_func( block_list[i][j], block_list[i][i] );
}
for(int j=i+1; j<num_blocks; j++) {
col_func( block_list[j][i], block_list[i][i] );
for(int k=i+1; k<num_blocks; k++) {
inner_func( block_list[j][k], block_list[i][k],
block_list[j][i] );
}
}
}

This can be parallelized with a parallel for loop, but requires some additional changes beyond adding a pragma.

#pragma omp parallel
{
for(int i=0; i<num_blocks; i++) {
#pragma omp single
diag_func( block_list[i][i] );
#pragma omp for
for(int j=i+1; j<num_blocks; j++) {
row_func( block_list[i][j], block_list[i][i] );
}
#pragma omp for

for(int j=i+1; j<num_blocks; j++) {
col_func( block_list[j][i], block_list[i][i] );
for(int k=i+1; k<num_blocks; k++) {
inner_func( block_list[j][k], block_list[i][k],
block_list[j][i] );
}
}
}
}

The outer loop cannot be parallelized because each operation on a block depends on the output from the previous iteration. The single is needed because only one function needs to execute the diag call, and the implicit barrier at the end of the single keeps the rest of the threads from moving to the next loop. Once the diag call is done, the remainder of the row is split up and processed. The next loop splits up the remainer of the matrix below the diagonal block. Once this is done and the next iteration of the outer loop starts, the matrix gets one block smaller and repeats.

The OpenMP task version without dependencies looks very similar

#pragma omp parallel
{
#pragma omp single
{
for( int i=0; i<num_blocks; i++ ) {

diag_func( block_list[i][i] );
for( int j=i+1; j<num_blocks; j++ ) {
#pragma omp task
row_func( block_list[i][j], block_list[i][i] );
#pragma omp task
col_func( block_list[j][i], block_list[i][i] );

}
#pragma omp taskwait

for( int j=i+1; j<num_blocks; j++ ) {
for( int k=i+1; k<num_blocks; k++ ) {
#pragma omp task
inner_func( block_list[j][k], block_list[i][k],
block_list[j][i] );
}
}
#pragma omp taskwait

}
}}

This version breaks up the computation in the inner loop from the whole rows of blocks into a task per block, instead of a column block followed by an entire row of inner calls. As a result the columns need to be handled before the inner loop, and since they are independent of the row calls, they can be executed in parallel. Like the parallel for version, this one synchronizes after each call to diag, after the row/col loop, and then again at the end of each iteration.

By using task dependencies, these portions of the code where most of the threads are waiting at a taskwait or barrier can be removed, making enough tasks available to be able to saturate the machine.

#pragma omp parallel
{
#pragma omp single
{
for(int i=0; i<num_blocks; i++) {
#pragma omp task depend( inout: block_list[i][i] )

diag_func( block_list[i][i] );
for(int j=i+1; j<numBlocks; j++) {
#pragma omp task depend( in: block_list[i][i]) \
depend(inout: block_list[i][j])
row_func(block_list[i][j], block_list[i][i] );
}
for(int j=i+1; j<num_blocks; j++) {

#pragma omp task depend( in: block_list[i][i]) \
depend(inout: block_list[j][i])
col_func( block_list[j][i], block_list[i][i] );

for(int k=i+1; k<num_blocks; k++) {
#pragma omp task depend( in: block_list[i][k], block_list[j][i] ) \
depend(inout: block_list[j][k])
inner_func( block_list[j][k], block_list[i][k],
block_list[j][i] );
}
}
}
}}

This version of LU with task dependencies has no taskwait or barrier in it, instead each task waits only on the specified dependencies.