HMMER3 Case Study
HMMER3  is a bioinformatics application used to quickly identify protein sequences which are statistically similar to a desired pattern. This functionality is used at scale in genome sequencing pipelines to annotate the content of new genomes; the Joint Genome Institute uses 7 million CPU hours annually to run HMMER3 applications.
Under the hood, HMMER3 uses dynamic programming to solve a sequence alignment problem in a similar manner to the Smith-Waterman algorithm. A Hidden Markov Model (HMM) is constructed to describe an interesting protein sequence feature, and each candidate sequence is tested against this model up to the scale of millions of sequences searched against tens of thousands of HMMs.
HMMER3 uses a pipeline of filters to avoid computing the expensive full size dynamic programming table whenever possible. Initially, each sequence is processed with the Single Segment Viterbi filter, which only considers high scoring diagonals in the DP table. When the SSV score meets a statistical threshold the sequence is passed to the more complex Viterbi filter which considers a wider range of possible sequence variations. A sequence that passes the Viterbi filter is processed by the Forward-Backward algorithm which builds the full DP table in maximum detail before generating the final output.
HMMER3 is heavily optimized for the personal computing hardware of 5 years ago. Data access is organized such that performance is sacrificed to keep memory usage low. All pipeline filters use SSE intrinsic instructions to directly implement dynamic programming with a complex striping pattern. Threading is implemented with pthreads using a custom queue structure to dispatch sequence data to threads for processing. An MPI implementation is provided which does not support threading but decomposes data with the same pattern as the pthread queue.
Initial Performance of HMMER3
Many aspects of HMMER3 are unique in the context of NESAP applications, making standard analysis techniques more difficult or less informative. The dominant HMMER3 kernel is located in the SSV filter; it consists of a 128 bit vector memory read, an SSE byte subtraction, and an SSE byte maximum. Performance models which depend on a measurement of FLOPS such as the roofline model aren’t directly applicable. Data movement in HMMER3 follows a distinct pattern: instead of iterative computation on a large working set such that memory bandwidth is a dominant concern, HMMER3 sits in a tiny memory footprint and never reuses data after it has been examined. The usage of HMMER3 is also unusual; future scaling will manifest as the need to process many more genomes, but the size of individual data sets will not change significantly. As a result there is no user demand for an MPI version of HMMER3 to scale one problem to multiple nodes.
To determine a performance baseline, hmmsearch provided by the HMMER3.1b2 release with --cpu 3 was used to search the latest release of the uniprot-swissprot protein sequence database  against the Pfam 29.0 database of functional protein domains . This job was run on the Genepool cluster operated by NERSC for JGI and scheduled to execute on 4 cores of a Xeon E5-2670 processor. Total wall time for the search was 8 hours and 13 seconds. This particular task was chosen to approximate the following description of the typical JGI usage of HMMER: “It [number of input sequences] ranges from a few hundred or low thousands to several millions [to be searched against Pfam].”
This same test case was run on a 68 core KNL node in quad-flat memory mode with --cpu 67 (67 worker threads plus 1 master thread). Computation did not finish within the 4 hour time limit but processing for 2599 out of 16295 models was completed. Naive extrapolation suggests full solution would run for 25 hours.
The critical bottleneck of HMMER3 is the failure to scale when using additional threads:
Vtune was used to examine the cause of the poor thread scaling in HMMER3, though initial tests all yielded uninformative results. Memory bandwidth was tested by reducing the processor clock speed; wall time changed in a directly proportional manner suggesting a CPU bound and no memory bandwidth limitations. The possibility of an I/O bottleneck was investigated by using staged input files on the SSD Burst Buffer or tmp filesystem in RAM, but these variations also produced no significant wall time differences.
The source of the bottleneck was identified using the following observation:
25% of executed instructions occur during input sequence parsing functions. This is a direct consequence of the data access pattern in the hmmsearch top level application; though only a small overhead is needed to parse and error check a sequence read from disk, all such work is discarded and repeated for each additional model which slows the rate that new sequences can be added to the thread dispatching queue… to the point that only 4 threads can be supplied with data and any additional threads will find the work queue empty.
The first major modification was to change data access such that input data is not discarded but retained in memory. Buffers now store sequences and models such that each pass through the sequence file provides the data used to search multiple models. Thus, a new tradeoff is available by using a modest amount of additional memory to reduce the number of times the full sequence database file must be read and parsed.
With the expense of a few extra gigabytes of RAM usage, the amount of CPU used to parse and error check sequence file data can be reduced by a factor of 20 or more.
The second major change to the HMMER3 application was to replace all pthread code with an OpenMP implementation utilizing task directives. At the top level of organization, HMMER3 control flow is directed by file access states, which is an excellent match for the use of OpenMP tasking and was implemented as follows:
Two buffers each containing model or sequence data are created (affectionately referred to as “flip” and “flop”). The “flop” buffers are filled and emptied (disk access) via an OpenMP task while worker tasks are computing searches using data in the “flip” buffers. A taskwait synchronizes progress before the sequence flip and flop buffers exchange. This swap moves finished sequences into position to write output and the next block of unprocessed work into place. After a full scan through the sequence database, the hmm buffers are swapped in a similar manner and the sequence file is rewound. In this way I/O happens in parallel with worker processing, workers do not need to wait for disk access as their data has been loaded in advance, and if file access thread finishes before processing is complete it will automatically convert itself into a worker thread. This implementation has an additional performance gain because with appropriate buffer sizes it must synchronize threads much less frequently.
HMMER3 is a very conditional code where the distribution of search hits makes work balancing impossible to predict in advance. An OpenMP taskgroup was used to implement a work stealing policy and increase utilization. All worker tasks are issued within an omp taskgroup pragma while the number of outstanding tasks is tracked; when that number falls below the number of threads then workers with a sufficient amount of unprocessed sequences will create a new task and pass half their remaining work to it.
After modification, HMMER3 is significantly faster and better able to utilize NERSC resources:
The production scale search described earlier (that HMMER3.1b2 completes in 8 hours) executes using the current OpenMP HMMER3 in 27 minutes.
The best wall time achieved running the production test on Knights Landing hardware is 38 minutes. Disappointingly, though OpenMP HMMER3 outperforms HMMER3.1b2 when run on KNL by a factor of 40x, performance has not reached parity with Haswell for a number of reasons. The striped encoding used to directly vectorize dynamic programming tables implemented in HMMER3 is completely invulnerable to compiler auto-vectorization. Any modification would need to be done at the granularity of vector intrinsic instructions. Two filters in HMMER3 use SSE byte and word type instructions to increase throughput; because KNL hardware does not support the AVX-512 BW instruction set, these filters cannot be directly ported in a way that both fully utilizes the hardware and is true to the form of the original algorithm. Tight coupling of data structures between pipeline filters, HMM profile encoding, and hit processing, make an upgrade to even the available AVX2 instruction set a labor intensive all-or-nothing proposition. Finally, the primary HMMER development team has already solved these vectorization concerns; any further work in this area will become obsolete with the eventual release of HMMER4.
 The Pfam protein families database: towards a more sustainable futureR.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. Bateman: Nucleic Acids Research (2016) Database Issue 44:D279-D285