NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
IBM SP Parallel Scaling Overview - Introduction

IBM SP Parallel Scaling Overview - Introduction


Capability Computing

Finding the level of parallelism best suited to a given problem can be challenging. In general it involves careful consideration of the nature and size of the problem to be solved, the properties of the compute hardware and interconnect, the algorithm and the time scale on which a solution is required.

Consider for example dense matrix multiplication, C(n,n) = A(n,n)*B(n,n) , encountered in LAPACK, ESSL, SCALAPACK and PESSL as a DGEMM routine. The level of parallelism which provides the fastest solution depends on the scale of the problem. For small problems the setup and overhead from parallelism dominates any benefit from the capabilities that parallel computing brings. Beyond this point is the regime of capability computing.

Scaling of NxN matrix-matrix multiply

Parallel computing means more than providing a faster solution. For problems of sufficient scale, capability computing is a requirement for any solution at all. E.g., in most cases there is no practical way to extend single 32 bit CPU solutions beyond the 2G address space which constrains N < 10e4 for such an approach. Likewise for extending shared memory solutions beyond the memory limits of a single SMP node. This makes possible certain classes of problems which are unapproachable on a single node and ultimately necessitates hardware such as high bandwidth low latency interconnects and parallel filesystems.

Specific properties of the compute hardware are also demonstrated above. If the nodes had 8 CPUs, instead of 16, the cross over point between a single node and multiple node solution would be at a different problem size. On seaborg's 16 way SMP nodes the threaded and MPI based matrix multiply show asymptotically identical performance. Both are using shared memory and avoid switch communication.

Dense matrix multiplication is extremely simple from an algorithmic perspective and not too much should be inferred from the above scaling data as it regards other algorithms or computations. As a rough sketch, however, it does represent how scaling of problem size and machine capabilities impact optimal solution strategy. Other algorithms will have different scaling properties, but the overall trends and transition to capability computing similarly occur.

It's worth mentioning that not all tasks scientific or otherwise need massive parallelism. Post-processing of data, debug work, and data workup are important parts of scientific computing and often achievable on a single node or CPU. As such not every job run on a machine such as seaborg will be a capability job.

For the class of scientific problems which do require the capabilities offered by large scale parallel computation seaborg provides development and production environments for implementing and solving scientific problems of scale.

This document and the consultancy resources available within the NERSC User Services Group can provide answers to researchers about scaling and optimal implementation of scientific codes on NERSC hardware.

Constraints to Scaling

As a way of setting boundary conditions, it is useful to lay out what architectural constraints exist on the IBM SP to running parallel at large scale. These limitations are intrinsic to the machine and are mentioned (along with some notes on how they are mitigated) only briefly before moving on to application level issues.

4096 way MPI :

Currently the MPI implementation on seaborg supports up to 4096 tasks or 256 fully packed nodes. Higher concurrency is achievable only through using mixtures of threads and MPI tasks (e.g., OpenMP). When approaching this upper bound on MPI tasks issues involving performance and memory arise.

Process Scheduling:

As with most types of cluster computing there is no fine grained synchronization between nodes. This means synchronizing becomes more difficult at higher concurrency. NERSC does its best to deal with this by enforcing resource uniformity, e.g., not sharing nodes between user jobs and automatically detecting and dealing with errant processes on every node. Synchronization is increasingly important as jobs scale up and is a common bottleneck when scaling up parallel codes.

Job Scheduling:

Scheduling jobs requires waiting for a number of resources proportional to the concurrency to become available. The lack of checkpoint/restart or gang scheduling capabilities with performance sufficient to deal with large scale parallel jobs means that the scheduling of small long running jobs will be at crossed purposes to scheduling large scale parallel jobs. NERSC elevates the priority of larger concurrency jobs to enhance their throughput and has regular NUG discussions about queue structure.

Scaling up a parallel application is largely about avoiding constraints and bottlenecks. Aside from the unavoidable contrasts above, many parts of code itself may come under algorithmic strain as concurrency is increased. Knowing the constraints of the chosen algorithms and their alternatives is of great benefit in avoiding bottlenecks.

Methods of dealing with these constraints and bottlenecks are provided in the next two sections.



LBNL Home
Page last modified: Mon, 11 Jan 2010 22:15:23 GMT
Page URL: http://www.nersc.gov/news/reports/technical/seaborg_scaling/intro.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science