NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory

Programming on Franklin

On this page:

1. Overview

Top

Code development on Franklin is typically performed on a Franklin "login node" under a standard SUSE Linux shell environment. See Franklin User Enviroment for more information about the interactive shell environment.

The login nodes run a fully functional instance of SUSE Linux. The compute nodes, where all parallel jobs are executed, run an operating system known as CLE (Cray Linux Environment). CLE is designed to support high performance computing applications without the overhead of a full Linux distribution. CLE has only a limited number of system calls and Cray does not support run-time dynamic libraries under CLE.

1.1 Parallel Codes

Top

Most parallel codes running on Franklin are written in Fortran, C, or C++ to run in SPMD (Single Program, Multiple Data) mode, with explicit calls to the MPI libraries to communicate among tasks. Codes are typically compiled and linked on the login nodes.

Cray provides a convenient set of commands that should be used in almost all cases for compiling and linking parallel programs:

  • ftn - Use for Fortran source code
  • cc - Use for C source code
  • CC - Use for C++ source code

If you invoke the compilers and linker with these names, you do not need to explicitly link with the MPI libraries or other Cray system software libraries. All the MPI and Cray system include directories are also transparently imported. There are man pages for ftn, cc, and CC on Franklin.

The cray compiler commands ultimately invoke a third-party compiler suite to build your code. The Portland Group compilers are used by default. Pathscale, GNU, and Cray CCE compilers (including UPC compilers) are also available, To change the underlying compiler, use the module command.

franklin% module swap PrgEnv-pgi PrgEnv-pathscale !For Pathscale franklin% module swap PrgEnv-pgi PrgEnv-gnu !For GNU franklin% module swap PrgEnv-pgi PrgEnv-cray !For Cray and UPC

You must use these module commands to change the default base compilers at compile time and at run-time in your batch script.

WARNING: Do not set the environment variables MPICH_CC, MPICH_F90, or MPICH_F77. Doing so will put the compilers into an infinite loop.

1.2 Serial Codes

Top

You may need to build small serial codes meant to execute on the login nodes, or serially in your batch script. These executables should also be compiled with the ftn, cc, or CC wrappers.

Serial codes that run on the login nodes or serially in your batch script should be short (less than 5 minutes), with a small memory footprint (less than 1 GB). The nodes that run these codes are shared among many users and can not support production computing runs. Serial codes that require longer times and/or larger memories should be run on the compute nodes; see Running Jobs on Franklin for more information.

2. Simple Examples

Top

Here is a basic example of how to compile a Fortran 90 MPI code into an parallel executable on Franklin.

franklin% ftn -fast -o simple.x simple.f90

This command compiles the source code (which includes MPI calls) and links with system libraries to produce a parallel executable named simple.x, which is ready to run on the Franklin compute nodes. The PGI compiler option -fast enables the basic optimization level recommended by NERSC. (Please see PGI Fortran Compiler on Franklin for more information on compiler options for PGI as well as other compilers.)

The analagous examples for C and C++ are:

franklin% cc -fast -o simple.x simple.c franklin% CC -fast -o simple.x simple.C

3. Compilers

Top

As described in the overview, you should use these commands to build parallel code: ftn, cc, and/or CC.

In the default evironment ftn, cc, and CC invoke the Portland Group (PGI) compilers. You can change the underlying compiler as described in the overview above. For more information, explore the following links:

4. MPI

Top

Cray's MPI library is based on MPICH-2 and implements the MPI 2.0 Standard, except for dynamic process spawn functions. MPI is implemented on top of Cray's Portals low-level communications interface.

C++ codes must include "mpi.h" before any other include files.

For more information:

4.1 MPI Rank Assignments

Top

The distribution of MPI ranks on the nodes can be written to the standard output file by setting environment variable PMI_DEBUG to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD. See MPI Task Distribution on Nodes and the "intro_mpi" man page for more information.

4.2 Some XT specific tuning for MPI program

Top

  • XT is optimized for message preposting. Posting receive calls first can improve performance.
  • Avoid MPI_(I)probe which eliminates many of the advantages of Portals network protocol stack.
  • Aggregate very small messages into a larger message.
  • XT has limited optimization for non-contiguous MPI derived data types. In contrast to some platforms it may be better to do multiple transfers of contiguous data rather than sending and receiving non-contiguous data types.

5. SHMEM Programming

Top

The Cray SHared, distributed MEMory access (SHMEM) library is a set of logically shared, distributed memory access routines. Cray SHMEM library routines are similar to MPI library routines in that they both pass data among a set of parallel processors. SHMEM routines use one-sided put and get communications to remote address spaces. Cray SHMEM is implemented on top of the Portals low-level message passing scheme.

As with MPI, a header file is required:

! For Fortran
include 'mp/shmem.fh'

# For C/C++
#include <mpp/shmem.h>

Compiler wrappers will automatically link the SHMEM libraries:


% ftn shmem_program.f 
% cc shmem_program.c 
% CC shmem_program.C 

Please refer to intro_shmem man page for more information about SHMEM.

5.1 Some XT specific tuning for SHMEM program

Top

  • Use non-blocking SHMEM operations if possible.
  • SHMEM barriers performed in software and have high overhead, so use with care.
  • Use shmem_fence rather than shmem_quiet where possible.
  • Don't use strided SHMEM operations in performance critical sections of application.

6. Executable File Sizes and Compile Times

Top

Consider the following 33 byte Fortran source program:
/scratchdir => cat hello.f
      print *,"Hello!"
      end
When this code is compiled for serial execution on the login nodes under a standard Linux environment that support dynamic loading, the executable size is 2.2 MB using the PGI compilers, and 26.4 KB using the GNU compiler. However, when the same source code is compiled with the cross-compiling wrapper ftn for the microkernel environment on the compute nodes, where static loading is required, the executable size is 13.0 MB using the PGI compilers and 11.1 MB using the GNU compilers. Executables for the parallel, compute node environment are larger because of static linking.

If an attempt is made to statically link together an executable in excess of 2 GB, the linker will produce a truncation error message such as the following:

...
: relocation truncated to fit: ...
It is then generally necessary for the user to reduce large static arrays in the code, replacing them by dynamically allocated arrays. This problem is more common with older codes with large static arrays (or Fortran common arrays) which are used in various ways by subroutines as a user-managed dynamic memory area.

Compile times may be significantly longer when cross-compiling for the static linking environment on the compute nodes because of the added I/O time required to make static copies of library routines.

The object mode on Franklin is 64-bit, which means that all executables will run in 64-bit address mode.

6.1 Memory Considerations

Top

Each quad-core node has about 7.38 GB of user accessible memory. When running in the default quad-core mode with 4 MPI tasks per node, MPI task will have access to about 1.85 GB of memory. Running with one or two MPI tasks per node will allow each MPI task to use 7.38 or 3.69 GB of user memory. Memory use by the MPI or shmem layer may grow as you move to higher processor counts. See Memory Usage Consideration on Franklin for more details.

6.2 Debugging and Optimization

Top

The basic debugging tool on Franklin is Distributed Debugging Tool (DDT) from Allinea Software.

The Multi Core Report jointly produced by Cray, NERSC, and AMD presented dual core and quad core processor architectures, analyzed impact of multi core processors on the performance of selected micro and application benchmarks, and discussed compiler options and software optimization techniques. Please also refer to Important Portland Group Compiler Options for basic tuning with compiler option choices.

Here is a collection of papers written by Stephen Whalen from Cray on Optimizing the NPB benchmarks for multi-core AMD Opteron Microprocessors. Many of the techniques described in these papers could be used in optimizing general applications.

7. Further Information

Top


LBNL Home
Page last modified: Wed, 09 Sep 2009 23:28:36 GMT
Page URL: http://www.nersc.gov/nusers/resources/franklin/programming/
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science