Time to Start Getting Ready for Cori
February 4, 2015 by Richard Gerber
Cori is coming and it’s time to start getting ready. Yes, NERSC’s Intel Xeon Phi-based system is still more than a year away, but if you’re not already thinking about how you’re going to use it, you need to get started. That’s because to get your codes to run well (or maybe at all) on NERSC’s first “many-core” system it is going to take more than a simple recompile.
It's no surprise that NERSC is getting a system like Cori; the HPC community has known for years what was coming. Driven by the limits of physics and technology, as well as the cost of power and cooling, future HPC systems are going to get most of their processing power from energy-efficient many-core processors like GPUs and Intel Xeon Phis. These chips contain 10s to 100s of relatively slow processing cores, meaning that performance gains are only going to be achieved by codes that can simultaneously use these cores in parallel effectively.
Many-core systems already exist. The Titan supercomputer at the Oak Ridge Leadership Computing Facility has nodes that contain “traditional heavy-weight processors” with attached NVIDIA GPU “accelerators.” Similarly, the “Stampede” system at the Texas Advanced Supercomputing Center has traditional CPUs accelerated by Intel Xeon “Knights Corner” processors. On both these systems, codes execute on the traditional “host” processor, then “offload” computationally intensive routines to the accelerator. Some codes perform well under this paradigm, but others don’t. For those that don’t, it’s because they don’t have enough fine-grained parallelism to keep all the cores busy and/or they have to frequently move data structures from the host processor to the accelerator, which is a very slow process. If the amount of accelerated computing is not enough to overcome the penalty of moving the data, then you’re dead, performance-wise. And even for those who do run well, it took a lot of effort to refactor the code to get performance.
And that’s where the problem lies, at least from an application developer’s point of view.
Programming for Cori will be a little different. Each Cori node will have one Intel Xeon Phi “Knight’s Landing” (KNL) processor, running in a “self-hosted” mode, meaning that there will be no host/traditional processor. Everything, including the operating system, will run on KNL. So while you won’t have to worry about moving data to a coprocessor, you'll still have to worry about data locality (see below) and finding and expressing enough fine-grained parallelism to keep the 60+ KNL cores busy.
“Data locality” is something you’re going to be hearing a lot about. That’s partly because each KNL processor will have up to 16 GB of “on-package” or “High Bandwidth Memory” (HBM), which has extremely high bandwidth and is potentially very good for performance. However, if all your data structures don’t fit in the HBM, you will have to use some of each node’s traditional DRAM, and there will be a price to pay to bring data from there to the compute units. So you’ll want to do as much computing as possible using data that resides in in the HBM.
At the beginning of this piece, I said that you needed to start getting ready now. So, what can you do? First off, start thinking about your code could be threaded and how each thread might be able to minimize accessing data that won’t fit in a compact data layout.
If you’re just getting started, the easiest way to proceed is to keep the high-level coarse-grained parallelism you’re already expressing with MPI and try using OpenMP to thread (parallelize) compute intensive loops in your code. At the same time, do what you can to make those loops vectorize, which I haven’t mentioned, but that you should read about on this page.
Try the following on Edison: find out where your code spends a lot of its time (see Profiling Your Application)Then extract one of your compute-heavy loops, put in OpenMP directives, run using 1 OpenMP thread and measure how long it takes. Then see if you can get it to run faster with two OpenMP threads, or 4, or 8, etc. If you’re not having any success, maybe your loops are too short and each thread isn’t getting enough work. Maybe the loop is accessing array indices (memory locations) that are widely scattered and that’s killing your performance. Or maybe it’s something else. In any case, NERSC’s consultants are here to help and don’t hesitate to contact them for assistance.
If you are more advanced and already have OpenMP working well in your code, work on making sure your loops vectorize. Compiler reports can help. Try using
with the Intel compiler. Think about your data structures and how the threaded portions of you code are using them. Are your inner loops operating on contiguous regions of memory (good) or dealing with data scattered all over memory (bad). Are your inner loops working on the columns of multidimensional arrays (good in Fortran, bad in C)? Do you have arrays of structures (bad) or structures of arrays (good). Is there anything you can do about how you are accessing diverse memory locations? You might have to refactor the data layout in your code to make it perform well. While this might sound painful, our experience has been that when you put in the effort, your code will run faster on Edison today. Often, a LOT faster.
If you’re interested in seeing how this might work, you could take part in NERSC’s first “hack-a-thon” which we’re holding on February 25th at our Oakland Scientific Facility. You can bring your code loop or small kernel, or use one of ours, to try out these concepts in real time with NERSC consultants on hand to advise you. You can register for the hack-a-thon (We’ll have a remote broadcast set up so you can work remotely and chat questions to the consultants, but I don’t know how effective that is going to be for people online.)
I think this is enough to get you started pondering what you might have to do to get your code ready to run on Cori. There are lots of possibilities I haven’t mentioned: like libraries, new languages, extensions to existing languages, etc. You’ll hear more about these over the coming months and years. But for now, getting a code working well with on Edison using MPI for high-level parallelism, OpenMP for fine-grained parallelism over vectorized loops while accessing memory in contiguous chunks is an excellent way to start preparing for Cori.