Researchers use Edison to Improve Performance, Energy Efficiency of Bioinformatics Application
September 13, 2016
Contact: Jon Bashor, firstname.lastname@example.org, 510-486-5849
A team of computer scientists and geneticists from Iowa State University, the University of Maryland and the University of Arkansas have demonstrated significant speedups of the epiSNP bioinformatics program using the Edison supercomputer at the Department of Energy’s National Energy Research Scientific Computing Center. The team used the application to perform pairwise comparisons of genotypes and how they affect traits such as height, weight or makeup of body tissues.
Such pairwise comparisons are typically “computationally expensive,” consuming both additional computing resources and the extra electricity needed to power those resources. The more combinations that are calculated, the higher the computational cost. Finding out how to speed up the computation lowers these costs. To achieve their results, the scientists used Coarray Fortran, which allowed them to write parallel programs using the Partitioned Global Address Space (PGAS) programming model.
The researchers published their results in the July 12, 2016 issue of the International Journal of High Performance Computing Applications.
“We think this may be the first published use of Coarray Fortran in the field of bioinformatics and one of the first in the field to use a PGAS language,” said Nathan Weeks, who led the project as part of his research while pursuing a Ph.D. in computer science at Iowa State.
According to co-author James Reecy of the Department of Animal Science, Iowa State University, gaining a better understanding of how pieces of an individual’s genome affects its traits can make biology more predictive. In humans, this can help determine predisposition to disease. In crops like wheat, it could lead to greater production using less land or water. In this case, the researchers were studying beneficial fatty acids in angus-sired beef cattle to inform breeding decisions that could make meat healthier.
This research starts with single-base variations in individual genomes, called Single Nucleotide Polymorphisms (SNP). One person, plant or animal might have an "A" at certain spot in its genome, while another individual might have a "T" at that same spot. Traits that are expressed as a number, such as height, weight or amount of fatty acids, are known as quantitative traits.
Sometimes a single SNP is associated with an observable trait. However, sometimes a trait isn't determined by a single SNP, but with a combination of multiple SNPs, a condition called epistasis. The method of statistically associating SNPs with traits is called a Genome-Wide Association Study (GWAS).
epiSNP is a program for identifying SNP epistasis in quantitative-trait GWAS. A parallel MPI version (EPISNPmpi) was created in 2008 to address this computationally expensive analysis on large data sets with many individuals, quantitative traits, and SNP markers; and the team initially this version in their research. However, detecting epistasis in quantitative-trait GWAS becomes much more computationally expensive as more SNP combinations are investigated. It is this computational complexity of the analysis that is the main bottleneck for epistasis testing in large scale GWAS.
But the resulting information has real-world applications such as determining what genetic factors contribute to agriculturally important traits.
“If we think about how we are going to feed the world, we need to improve the efficiency of agriculture,” said co-author James Koltes of the Department of Animal Science, University of Arkansas. “But we also want to provide the best possible health effects, which are affected by a lot of genetic interactions.”
While the data analysis is computationally expensive, “the falling cost of genotyping has led to an explosion of large-scale GWAS data sets that challenge EPISNPmpi’s ability to compute results in a reasonable amount of time,” the team wrote in their paper. “Therefore, we optimized epiSNP for modern multi-core and highly parallel many-core processors to efficiently handle these large data sets.”
The team first ran their application on Stampede, a Dell supercomputer at the Texas Advanced Computing Center at the University of Texas at Austin. Using a number of serial and parallel optimizations so their optimized epiSNP could handle larger GWAS data sets, the team achieved 38.43× speedup over EPISNPmpi on 126 nodes using Stampede. But several considerations led them to request an allocation to run on Edison, a Cray supercomputer at NERSC.
First, it was Cray that originally developed the coarray technology for Fortran, and Weeks said that Cray systems provide the most robust implementation. Coarrays allowed the team to easily move from static load balancing, which attempts to assign each processor the same amount of work at the beginning, to dynamic load balancing, which assigns work to processors as the program executes, distributing the workload more efficiently to speed up the time to solution and reduce the computing costs.
But the code didn’t immediately run as well as expected.
“We ran into a performance issue using coarrays, which the NERSC staff diagnosed,” Weeks said. “The problem occurred only for the data type we used to represent the SNP genotypes. The staff not only characterized the communication performance behavior and notified Cray, but also identified a work-around that resolved the performance issue. NERSC has very good support staff and this is one of the key factors for our success, as far as I’m concerned.”
One discovery made by using NERSC resources is that epiSNP responds very well to Hyper-Threading, which allows two tasks to execute concurrently on a single Intel processor core. While Hyper-Threading can improve the performance of some applications, it can actually hinder the performance of others. Consequently, Hyper-Threading is often disabled on high-performance computing systems, including Stampede. However, the Cray environment on Edison gives users the flexibility to determine whether or not they want to use Hyper-Threading when their application is run.
“One of our goals was to demonstrate the suitability of PGAS languages for problems with this computational pattern,” Weeks wrote. “We showed that the Coarray version performs competitively with the MPI version on Edison and the corresponding code was much more readable. The new Knights Landing generation of the Intel Xeon Phi has special hardware instructions that should speed up the main computational bottleneck in epiSNP, so we’re looking forward to running this on the Knights Landing processors newly installed in NERSC’s Cori supercomputer.”
Weeks added that the team originally used OpenMP to do dynamic load balancing within a node, but using the PGAS model they were able to implement dynamic load balancing between nodes as well in a relatively straightforward manner, keeping OpenMP for intra-node load balancing.
Other co-authors are Glenn R. Luecke, Brandon M. Groth, Marina Kraeva, Luke M. Kramer, and James Reecy, all of Iowa State University; James Koltes of the University of Arkansas; and Li Ma of the University of Maryland.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 6,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. DOE Office of Science. »Learn more about computing sciences at Berkeley Lab.