HPF for Fortran Users - Productivity Gains Using HPF on the T3E
High Performance Fortran (HPF) provides a relatively easy means for Fortran programmers to use parallel computers. It is essentially Fortran90/95 augmented with compiler directives and an HPF library of utility routines. Considering that many codes consist of a large amount of "bells and whistles," along with computational kernels consisting of data-parallel and non-parallel operations, HPF allows one to easily port everything but the non-parallel operations. The latter can, in theory, be treated using utility routines, scientific software libraries, or other routines written, e.g., using message passing. HPF is ideal for porting legacy Fortran codes that are structured and cleanly written (aren't they all?). In this talk I will describe the use of HPF on the T3E. In addition to the basic concepts, three specific examples will be given: (1) a particle simulation code for modeling a charged particle beam in a solenoid, (2) a direct solver for modeling waves on a string, and (3) a split-operator spectral code for solving the time-dependent Schrodinger equation. Besides HPF, I will also discuss the thought process that goes into writing parallel code. For example, unlike programming vector supercomputers, where one typically asks "how can I vectorize this loop," on parallel computers one usually asks "how can I distribute my data to minimize communications? Questions such as this are crucial to effectively using parallel computers, regardless of the programming paradigm. [Los Alamos National Laboratory] [Download not available]
We are work with developing and analyzing highly scalable equation solvers for finite element method (FEM) matrices on unstructured meshes. We maintain a minimal interface with the FEM implementation by constructing the coarse grids and coarse grid matrices automatically via maximal independent sets, Delaunay tessellations, and Galerkin coarse grid operators. We work within a parallel computing environment and with numerical primitives provided by Portable Extensible Toolkit for Scientific computing (PETSc) from Argonne National Laboratory. PETSc is written in ANSI C with a strict object oriented program architecture which allows for highly portable and extensible program development. Our code is written in C++, we also use ParMetis (C) from University of Minnesota as our mesh partitioner and FEAP (FORTRAN) from U.C. Berkeley for our FEM implementation. We have used the T3E at NERSC as our primary platform and have to date been able to solve problems with up to 4.3e6 equations in linear elasticity with large jumps in material coefficients. [UC Berkeley]
Shmem and Synchronization Primitives on the Cray T3E
The task asynchronous programming model and one-sided communication protocols will be briefly surveyed. The SHMEM library and symmetric memory on the Cray-T3E will be discussed in detail. Other specialized synchronization primitives will also be touched on. [NERSC, Scientific Computing Group] [Download not available]
PVM is public domain software that enables a collection of heterogeneous computer systems to be used as a parallel virtual machine to solve problems concurrently. The Cray implementation of PVM for the T3E is based on the public domain PVM, version 3.3.10 and is extended in several ways to support its MPP architecture. It operates in two modes, standalone and distributed. This lecture presents and overview of the T3E PVM and introduces how to use PVM in the two different modes on the T3E for both parallel and distributed computing. [NERSC, Scientific Computing Group]
In this talk I will describe the T3E processor, the DEC Alpha EV5, and its local memory and cache. I will describe some techniques to take advantage of the T3E architecture to achieve faster single node performance on applications. I will also discuss some of the more useful compiler options for both the f90 and C/C++ compilers. [SGI/Cray Research]
T3E Multiprocessor Optimization and Debugging
In order to develop parallel application, users need to efficiently debug their applications. I will focus my talk on debugging parallel applications using TOTALVIEW. Also, few tips will be given on how optimize parallel applications on the T3E machine. [NERSC, User Services Group] [Download not available]
I/O on the T3E
I will describe the Input/Output environment on the NERSC Cray T3Es. In particular this talk will:
• Introduce the basic design of the T3E and how it is connected to disks.
• Summarize some of the basic strategies for dealing with I/O in a parallel programming environment.
• Present the concept of Cray's "layers" to control and format I/O.
• Emphasize that use of the Cray Global I/O layer is the only safe way to write to a single files from multiple processors.
• Describe how to increase I/O performance by using disk striping.
[NERSC, User Services Group]
In this presentation we show the various stages of converting a program from PVP to MPP. After producing a basic message passing program, we optimize both the communications and computational kernel. After some work we can achieve a factor of two over a dedicated C90.[NERSC, User Services Group]
Portable lattice QCD Code on the T3E
We have developed and used a set of codes for simulating quantum chromodynamics, the theory of the strong interaction, on a variety of parallel machines. I will describe the code briefly, emphasizing the features that make it portable, and then show some benchmarks on the T3E. After discussing where we think the bottlenecks are, I will plead for help from the audience.
[download not available]
Global Arrays: A Portable Shared Memory Programming Environment for Massively Parallel Computers
This presentation relates to the issue of how to program scalable multiprocessor systems. As we are witnessing a transition from distributed-memory message-passing to scalable shared-memory nonuniform memory access (NUMA) architectures, it becomes clear that the traditional shared-memory uniform memory access (UMA) programming model with flat memory hierarchy is not sufficient to achieve high performance and good scalability for many applications. The Global Array (GA) toolkit provides an efficient and portable "shared-memory" programming interface for massively parallel systems. It combines advantages of the message passing model such as the explicit control of data locality with a convenient one-sided access to the distributed data structures in the spirit of shared-memory model. GA has been adopted by many large applications in computational chemistry, molecular dynamics, graphics, and financial security forecasting areas. It is currently being extended as a part of the DoE-2000 Advanced Computational Testing and Simulation (ACTS) project. [Pacific Northwest National Laboratory] [Download not available]
Parallel implementation of a plasma fluid turbulence model, appropriate or the study of fluctuations at the core of fusion devices, on the CRAY T3E at NERSC will be described. PVM has been adopted for message passing. The serial code is replicated on all processors used. Only matrix operations for the time-implicit linear terms and convolutions for the time-explicit nonlinear part of the calculation are distributed to multiple processors. For matrix operations, parallelization is done over the number of Fourier harmonics in which all physical quantities in the problem are expanded. For the convolutions, parallelization is done over the number of radial grid points. In addition to parallelization, optimization strategies and timing results will be described. This work is part of ORNL's contribution to the Numerical Tokamak Turbulence Project, one of the US DoE's Phase II Grand Challenges. [Oak Ridge National Laboratory]
Application of the PVODE Solver to Parallelize the Fluid Transport Code UEDGE
(Lawrence Livermore National Laboratory)[Download not available]
Overview of the ACTS Toolkit: A Set of Tools That Make It Easier to Write Parallel Programs
The ACTS (Advanced Computational Testing and Simulation) Toolkit is a set of DOE-developed tools that make it easier to develop parallel programs. These tools include PETSc, Aztec, PVODE, TAU, ScaLAPACK, and several others. NERSC is starting a program to evaluate these tools and, with some limits, to support them on NERSC systems. I'll give an overview of the ACTS toolkit components, and describe the support available at NERSC. [NERSC, Future Technologies Group] [Download not available]
ScaLAPACK is a library of linear algebra routines for distributed-memory MIMD computers. It contains routines for solving dense, band, and tridiagonal systems of linear equations, least squares problems, and eigenvalue problems. In this talk, we will give an overview of the functionality, the software infrastructure, the data distribution, and examples of how to use the library. [NERSC, Scientific Computing Group]
AZTEC is a package for solving large sparse linear systems generated from scientific and engineering applications. It contains a very developed matrix-vector multiplication routine for general sparse matrices. It is suitable to solve large linear systems on massively parallel environments. [NERSC, Scientific Computing Group]
Remote Visualization at NERSC
As part of the Visualization Group's efforts to help remote NERSC users, NERSC recently purchased a remote visualization server. The area of remote visualization is an ongoing research topic in the field of visualization. Typical remote visualization techniques have been of the brute force variety with subsequently poor results. A new range of applications and techniques are under development that will bring to users better results and higher interactivity rates. We will be discussing the capabilities of the NERSC visualization server, how to gain access to it, and how to use it. We will discuss traditional methods of remote visualization and some new ideas and techniques for remote visualization under investigation at NERSC. We will also be demonstrating some prototype applications which are still in the developmental phase that will enable remote NERSC users to gain access to higher end visualization capabilities. [NERSC, Graphics and Visualization Group] [Download not available]
TAU is a program and performance analysis tool framework developed over the last six years for parallel object-oriented language systems. TAU provides a framework for integrating program and performance analysis tools and components. A core tool component for parallel performance evaluation is a profile measurement and analysis package. The TAU portable profiling package was developed jointly by the University of Oregon and Los Alamos National Laboratory for profiling and tracing parallel C++ programs. The TAU profiling and tracing instrumentation is supported through an Application Programmer's Interface (API) that can be used at the library or application level. The API features the ability to capture performance data for C++ function, method, basic block, and statement execution, as well as template instantiation. The TAU profiling and tracing package has been integrated in the ACTS Toolkit. In addition, it is available to be used with other C++ libraries. Further information about the TAU framework can be found at:
Department of Computer and Information Science,
University of Oregon.
As part of the Grand Challenge Application entitled "Computational Chemistry for Nuclear Waste Characterization and Processing: Relativistic Quantum Chemistry of Actinides," we have developed a parallel version of the sequential spin-orbit configuration interaction (SOCI) program found in the freely distributed COLUMBUS Program System of electronic structure codes. This program, called PSOCI, takes advantage of the massive memory, disk space, and CPU cycles of large parallel computers such as the Cray T3E. PSOCI determines the ab initio electronic structure of molecules using a nonperturbative inclusion of spin-orbit (SO) interactions among valence electrons in the presence of spin-orbit-coupled relativistic effective core pseudopotentials. Spin-orbit and relativistic effects are most important in the actinide portion of the periodic table. Their inclusion complicates an already computationally intensive electronic structure problem (basically a large, sparse eigenvalue problem). Effective parallelism is achieved by the use of explicit distributed data structures, application-based disk I/O prefetching, when possible, and a static load-balancing scheme. Modifications are implemented, primarily, using the Global Arrays package for distributed memory management and ChemIO for handling the massive parallel I/O requirements. PSOCI speeds the solution to complex SOCI problems, increases by an order of magnitude the size of problems that can be addressed, and enables the solution of a new class of very large problems involving actinides. Here we present scalability and time-to-solution behavior for selected problems involving these heavy elements. [Argonne National Laboratory]