NERSCPowering Scientific Discovery Since 1974

Profiling Your Application

Introduction

By quantifying the performance of your application on present-day architectures, you will be better able to prioritize, plan, and implement code changes that will enable good performance on Cori. Here, we provide general background on application profiling, as well as links to resources and tools available at NERSC to assist you in this effort.

Background

When preparing your profiling runs, be sure to focus on only the main computation of your application (omitting initialization steps which may otherwise clutter the profiling results). When profiling an application, one should compile the code with the same optimization flags used in production. Occasionally, however, compilers optimize codes so aggressively that profiling tools become unable to pinpoint exactly which code was executed. In this case, it may be valuable to use a reduced level of optimization relative to typical production runs, as well as to enable debugging symbols (-g in many compilers). This can allow for more informative messages about code regions in the profiler output.

What to Collect

In this context, profiling can be roughly separated into two categories: general application profiling and deep analysis of on-node performance. In many ways, closely examining relevant metrics under the general profiling category - and remedying any significant performance issues - can be seen as a prerequisite before moving on to deeper on-node profiling. The latter step is particularly critical for Cori, where achieving good on-node performance through vectorization and threading will be key.

Here, we provide example metrics under each category, as well as potential applications thereof in optimizing your application.

General application profiling

Careful assessment of these topic areas at the whole-application scale can provide valuable insight to guide subsequent deep inspection of on-node performance relevant to Cori, as well as enabling better application performance on present-day architectures.

  • General time breakdowns (e.g. via sampling and / or tracing)
    • Discover "hotspot" routines, code regions, and loops for detailed analysis later on
  • Time spent in (MPI) communication vs. computation
    • Large MPI time fraction? Explore algorithmic changes to reduce communication and / or unnecessary synchronization, improve load balance, implement non-blocking communication techniques
  • Time spent in I/O operations
    • Large I/O time fraction? Reassess access patterns, model of access (prefer high level libraries, optimizations in middleware); More details here
  • Per-thread memory usage (as a function of time, as well as high-water marks)
    • Can help in planning how to use Cori's high-bandwidth on-package memory

On-node performance

In this next step, the goal is to focus on critical "hotspot" routines discovered through general application profiling and collect performance-relevant low-level metrics. Additional preparation may be required, in that hotspot routines may need to be extracted from the full application as "kernels" and appropriate driver codes written to feed realistic data to them.

  • Memory bandwidth
    • Is the code memory bandwidth bound? If so, is this intrinsic to the algorithm used or can it be mitigated with locality improvements? (see next)
  • Cache and TLB miss rates
    • High miss rates? Examine classic locality improvement techniques: cache blocking, reordering of loops for stride-1 access
  • Code vectorization
    • Is the code vectorized effectively by the compiler? If not, examine compiler vectorizer logs to determine why; Implement code changes to enable vectorization (remove conditional loop exit or cycle statements, remove unnecessary loop dependencies, add directives, etc.)
  • FLOP rates and cycles-per-instruction (CPI)
    • Low FLOP rates and / or high CPI? Examine both vectorization and locality improvements
  • Detailed OpenMP performance analysis (or other supported threading model)
    • Poor thread scaling? Look for load imbalance, contention for shared resources, overhead from repeated thread team fork / join;

Tools Available at NERSC

NERSC provides a number of tools for application profiling on our systems, along with extensive documentation intended to help you get started. We recommend using either CrayPat or Allinea MAP, both available on both Edison and Cori, for the type of general application profiling discussed above. For deeper examination of on-node performance, we recommend using Intel VTune.

Summary

Performance analysis is neither an end goal, nor a single isolated step in preparing your application for Cori. Instead, the goal is observable performance improvement and it is realized through repeated iterations of the profile-optimize-profile cycle. As such, the tools you use to profile certain aspects of your application will likely change as the code itself evolves and your focus narrows to address lower-level optimization opportunities.

Further, it is likely that the code changes that you implement in preparation for Intel Xeon Phi on Cori will also benefit performance on present-day multicore architectures (see, for example, these application-readiness case studies).

If you have questions about the performance analysis tools available to NERSC, please contact the HPC Consultants.