Large-Scale PCA for Climate
The most widely used tool for extracting important patterns from the measurements of atmospheric and oceanic variables is the Empirical Orthogonal Function (EOF) technique. EOFs are popular because of their simplicity and their ability to reduce the dimensionality of large nonlinear, high-dimensional systems into fewer dimensions while preserving the most important patterns of variations in the measurements. Because EOFs are a particular instance of the classical Principal Components Analysis (PCA) matrix decomposition, mature numerically stable algorithms exist for their computation on datasets which fit on one machine.
Climatologists are comfortable using tools like Matlab, R, and IDL that are suitable for analysis of small datasets on the order of several gigabytes, but are not scalable to the terabyte-size range. Traditional HPC solutions for working with large-scale matrices (PARPACK, SCALAPACK, and the like) are not readily accessible to climatologists, because of the time or personnel investment required to deploy these solutions. As a consequence climate EOFs have been calculated for surface fields, or other 2d slices, due to the restrictions on the size of matrices that can be handled using popular tools. However, the EOF modes extracted from such data only contain projections of complex 3-D patterns involving deeper layers of the ocean on to the surface. A better understanding of the dynamics of large-scale modes of variability in climate fields may be extracted from full 3-D EOFs including many layers of the ocean, atmosphere, or both.
We implemented a large-scale PCA algorithm in the Apache Spark environment and applied it to the analysis of the ocean temperature fields from the CSFR product. By providing our solution in the widely used and quite intuitive Apache Framework, we provide a more approachable alternative to traditional HPC solutions for the large-scale analyses required by climatologists. The input is expected in the form of an HDF5 file, and the EOFS are returned in one of two formats: a native Spark RDD suitable for further processing within the Spark ecocystem, or an HDF5. Standard preprocessing options are available to users, including reweighing observations by their latitude or depth.
The climate field to be analyzed is provided in the form of a tall rectangular matrix A, whose rows correspond to the observation grid points, and whose columns correspond to the observation time intervals. The rows of A are processed to have zero-mean, so that A represents the deviations from the mean of the climate field. A is stored in row partitioned format to map on the Spark RDD formalism. The top k left and right principal components of A correspond to the dominant spatial and temporal EOFs of the climate field, respectively. The right principal components are computed by using the implicitly restarted Arnoldi method (provided by the Netlib-Java ARPACK binding); this is an iterative process that requires (distributed) multiplies against ATA. Once the right principal components are computed, the left principal components and the corresponding singular values are obtained by doing a (distributed) multiply of the right principal components against A and taking a (local) SVD of the resulting matrix.
The publicly available Climate Forecast System Reanalysis (CFSR) data set contains ocean temperature readings span 31 years and collect 41 levels of ocean surface and subsurface temperature readings collected at 6 hour intervals. We extracted these ocean temperature measurements to compute temperature EOFs; the input data is a 6,349,676-by-46,715 double-precision matrix that occupies 2.3 TB of RAM.
Our analyses provide greater insight into Kelvin and Rossy waves and other components of large-scale modes of oscillation such as the ENSO and PDO. Further, they provide information about the most influential parts of the ocean, which can exist below the surface .
Our current implementation satisfies the goal of providing a user-friendly (viz., Spark-based) tool for EOF extraction. The remaining concern is scalability: the current Spark EOF implementation scales poorly compared to a benchmark MPI implementation of the same algorithm . This is due to the requirement for high communication during the course of the distributed computation, which incurs a high overhead in Spark. Work is undergoing to use a different EOF implementation which incurs a lower communication cost. A separate thread of work is investigating efficient ways to provide an interface between Spark and traditional MPI-based HPC libraries.
- Gittens, A., N. Cavanaugh, K. Kashinath, T.A. O’Brien, Prabhat, M. Mahoney, Large-scale Parallelized EOF Computation on the CSFR Ocean Temperature Field, AMS Annual Meeting, New Orleans, LA, January 2016.
A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, et al. "Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies." arXiv:1607.01335 (preprint). July 2016.