NERSCPowering Scientific Discovery Since 1974

Big performance gains at NERSC hack-a-thon

February 27, 2015 by Jack Deslippe & Richard Gerber

At this year's NERSC user group meeting we tried something new: a code optimization "hack-a-thon." Thanks to HPC experts from NERSC and Intel and about 20 enthusiastic code developers, it was a great success! Everyone who showed up at NERSC's Oakland Scientific Facility, had fun, learned some new things, and got some big performance improvements in their code. 

NERSC's Jack Deslippe was one of the lead coordinators and wrote this very nice summary:

NERSC Hack-a-Thon a Big Success

By Jack Deslippe
NERSC HPC Consulting Group

Yesterday’s hack-a-thon was a big success (much more so than I was expecting), and I personally had a lot of fun.

I want to thank everyone involved, and, in particular: Zhengji (Zhao) for tirelessly fighting to get VTune working on Edison and succeeding in getting everything up and running not a day too soon! Without this, the hack-a-thon would have been very different, and not in a good way. When asked at the end how many people would use VTune in the future on their own and I believe all attendees raised their hands.

I also want to thank Scott (French) and Woo-Sun (Yang) for putting together and delivering great tools introductions, Rebecca (Hartman-Baker) for putting together the VH1 code, Helen (He) for organizing Babbage (NERSC KNC testbed) access, and Richard (Gerber) for planning/suggesting the event and providing lunch when we were running late. Also, big thanks to Intel for sending staff to help out with the hack-a-thon.

Here is a summary of a few particular outcomes I captured:

Case study #1

We achieved > 30% speedup on real code. The user used Vtune to identify a hot loop in his code (and also to determine that he is not memory-bandwidth bound). The code contained a large number of flops and instructions but was not being vectorized due to a loop dependence. It had the the following form:

for (many iterations) {
… many flops ...
et = exp(outcome1)
tt = pow(outcome2,3)
IN = IN * et +tt
}

Scott and I helped him restructure the code so that the flop intensive loop doesn’t contain the in variable dependence by creating temporary arrays et(:) and tt(:), and then quickly computing the value of variable IN in a separate loop. In this way, the flop heavy loop can vectorize. This change sped up his entire application run by > 30%.

Case Study #2

We worked with Intel for most of the day on a real application brought by a user. By the end of the day, they had achieved a factor of 2X improvement on the code runtime (in serial at least) by identifying and refactoring hotspots in the code. A significant improvement appears to have come from identifying and removing unnecessary initialization of several large arrays, in some cases that were not later touched.

Case Study #3

 We were able to use Vtune’s advanced hotspot feature on his real application to identify this user's hottest loop. He quickly discovered that he was creating unnecessary branches in the loop by if statements that only come true in the first and last iteration of the loop. Removing these got a quick 10% improvement in overall code performance.

Canned kernels

 Some of the participants participated in the hack-a-thon competition using our canned kernels. Balint Joo from Jeffererson National Accelerator Laboratory was the winner. Balint was able to identify the hotspot in the unoptimized bgw.f90 kernel, add OpenMP and used Vtune’s bandwidth collection capability to identify that the source of imperfect OpenMP scaling was poor memory locality. He then reordered loops (which involves a bit of book-keeping) and eventually got a roughly 12X speedup over the original (non-threaded) code in a short time. His final code is about 2X-3X slower than code that I spent much more time optimizing (meaning he got *most* of the way to the answer). At the end of the day, Balint said it was unusual for him to work on code utilizing tools, but seemed to enjoy the experience and his new-found knowledge of Vtune. Other folks looking at the kernel were able to identify the hotspot and add OpenMP. The canned code is available on github at https://github.com/NERSC/training.git

Thanks again to everyone who contributed!

-Jack