NERSCPowering Scientific Discovery Since 1974

Hadoop

Background

Google proposed the MapReduce model to address in challenges with processing large volumes of Internet data.  The MapReduce model has quickly gained popularity and often used to process indexes, logs and user behavior data in global internet companies including Google, Yahoo!, Facebook and Twitter. The MapReduce model was designed to be deployed on very large clusters of commodity machines.  It was originally developed, and is still heavily used for massive data analysis tasks. Google's MapReduce was desgined to work in conjunction with its proprietary distributed file system, the Google File System (GFS).  GFS supports several features that are specifically suited to MapReduce workloads, such as data locality and data replication.

The basic steps in a MapReduce programming model consists of a map  function that transforms the input records into intermediate output data, grouped by a user-specified key. The reduce function processes intermediate data to generate a final result. Both the map and reduce functions consume and generate key/value pairs. The reduce phase gets a sorted list of key, value pairs from the MapReduce framework. The model takes into account data locality in the map stage, i.e., move the computation closer to the data and provides mechanisms for fault tolerance to manage the inherent failure rates in commodity computers.

Apache Hadoop is an open source MapReduce implementation that has gained significant traction in the last few years in the commercial sector. Hadoop is an open-source distributed computing platform that implements the MapReduce model. Hadoop consists of two core components: the job management framework that handles the map and reduce tasks and the Hadoop Distributed File System (HDFS). Hadoop's job management framework is highly reliable and available, using techniques such as replication and automated restart of failed tasks.  The framework has optimization for heterogeneous environments and workloads, e.g., speculative (redundant) execution that reduces delays due to stragglers.  

HDFS is a highly scalable, fault-tolerant file system modeled after the Google File System. The data locality features of HDFS are used by the Hadoop scheduler to schedule the I/O intensive map computations closer to the data. The scalability and reliability characteristics of Hadoop suggest that it could be used as an engine for running scientific ensembles.  However, the target application for Hadoop is very loosely coupled analysis of huge data sets. HDFS fundamentally differs from parallel file systems due to the underlying storage model. HDFS relies on local storage on each node while parallel file systems are typically served from a set of dedicated I/O servers. Most parallel file systems use the POSIX interface that enables applications to be portable across different file systems on various HPC systems. HDFS on the other hand has its own interface requiring applications to be rewritten to leverage HDFS features. FUSE provides a POSIX interface on top of HDFS but with a performance penalty.

Hadoop is used by a number of global Internet companies and an entire open-source ecosystem of supporting software components has evolved  around the core MapReduce and HDFS framework.  The native Hadoop program has a Java API allowing map and reduce functions to be written as Java programs.  Hadoop Pipes and Hadoop Streaming, add-on-components to Hadoop, enable applications written in C++ and other languages to be plugged in as maps and reduces.  Dumbo is a python library for streaming.  Additionally, higher-level languages such as Pig, a data-flow language and Jaql, a JSON (Javascript Object Notation) based semi-structured query processing language have evolved.

The Hadoop ecosystem is rapidly growing and we mention some popular components here. HBase and Cassandra provides various database implementations on top of Hadoop. Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout is a machine learning library and Giraph provides large-scale graph processing on Hadoop. Oozie is a workflow tool that allows chaining of multiple MapReduce stages. There are a number of other building-blocks such as ZooKeeper, a high-performance coordination service that provides semantics for concurrent access. Chukwa and Scribe are data collection and general log aggregation frameworks. If you are interested in exploring one or more of the tools in the Hadoop ecosystem, contact NERSC support.

Use for Scientific Applications

A number of scientific applications have characteristics in common with MapReduce/Hadoop jobs. A class of scientific applications which employ a high degree of parallelism or need to operate on large volumes of data might benefit from MapReduce and Hadoop.

Apache Hadoop and a number of the other open source MapReduce implementations are written in Java. Scientific codes are often written in Fortran, C, C++ or use languages such as python for analysis. The Hadoop streaming model allows one to create map-and-reduce jobs with any executable or script as the mapper and/or the reducer.  This is the most suitable model for scientific applications that have years of code in place capturing complex scientific processes. For an application to be able to use this model, it needs to read input through standard in and pass output through standard out. Thus, legacy applications are limited to using the streaming model that may not harness the full benefits of the MapReduce framework. We discuss some other characteristics that might apply to scientific application development in Hadoop.

File System. The Hadoop Distributed File System (HDFS) does not have a POSIX compliant interface severely restricting the adoptability for legacy applications. Some projects have implemented native Hadoop based applications that take advantage of the full capabilities of the MapReduce model by using HDFS.  For example, the authors of CloudBurst have implemented short read mapping for genomic sequence data in Hadoop. HDFS's data locality features can be useful to applications that need to process large volumes of data. However, Hadoop considers only the data locality for a single file and does not handle applications that might have multiple input sets. In addition, in legacy Hadoop applications, each map task operates on a single independent data piece, and thus only data locality of the single file is considered.

Data Formats. Apache Hadoop considers inputs as blocks of data where each map task gets a block of data. Scientific applications often work with files where the logical division of work is per file. Apache Hadoop has internal support to handle text data. New file formats require additional java programming to define the format, appropriate split for a single map task, reader that loads the data and converts it to a key value pair.

Diverse Tasks. Traditionally, all mapper and reducer tasks are considered identical in function roughly working on equal sized workloads. Implementing different mapper and reducer requires logic in the tasks that differentiate the functionality since there is no easy way to specify it in the higher-level programming model. In addition, differences in inputs or algorithms can cause worker processing times to vary widely. This could result in timeouts and restarted tasks due to the speculative execution in Hadoop. If there is a large difference in processing time between tasks, this will cause load imbalance. This may dramatically impact the horizontal scalability.

Hadoop Benchmarking for Data-Intensive Operations:  We evaluated and determined the suitability of MapReduce/Hadoop for specific data intensive operations (filter, merge, reorder). While the performance an application can obtain is largely workload dependent, there are some key factors.

    • For the same quantum of work performed on an equivalent data  volume, Filter is the least expensive operation due to minimal writes whereas Merge and Reorder are expensive due to the output writes.
    • High performance file systems (e.g., GPFS in our case) does  better than HDFS with writes at lower concurrency but HDFS does significantly better at higher concurrencies. HDFS does better for read-intensive applications for larger data volumes due to data locality. The ability to use existing file systems available at HPC centers provides a number of  advantages to scientific applications. It allows users to apply  MapReduce functions to existing data, use legacy application binaries and operate through familiar HPC environments.
    • There is minimal impact seen due to high-performance low latency interconnects on our benchmarks. To truly leverage, the underlying network the underlying shuffle algorithm of Hadoop will need to be changed. Mellanox recently announced an unstructured data accelerator (UDA) software plugin that will allow Hadoop frameworks to leverage RDMA (Remote Direct Memory Access) and modified the shuffle algorithm to fully exploit the high-speed network.
    • Replication and as a result data locality can improve the  performance for read-intensive applications (e.g., Filter). For operations such as Merge and Reorder,  the majority of the time is spent in writing out the output and the advantage of replication could be insignificant.
    • Hadoop's streaming mode allows legacy applications to be plugged in as maps and reduces. The streaming mode adds an overhead that  increases data sizes compared to using the Java native Hadoop interface.

Use Cases  

A number of applications from diverse communities have experimented with Hadoop at NERSC, including climate analysis, bioinformatics applications, etc. Bioinformatics and biomedical algorithms have large numbers of high-throughput jobs with minimal or no coordination required between individual jobs, making them suitable for MapReduce/Hadoop. We outline three case studies that show some of the uses of Hadoop at NERSC

BioPig

BioPig is a framework that works on top of Apache Pig to enable biologists to create simple scripts that can perform sophisticated analysis of genomic data. Pig is a framework and language that layers on top of the Apache Hadoop MapReduce framework.  The Pig framework allows a user to express an operation such as filter that is then converted into a set of Map and Reduce steps that are executed in Hadoop.  This allows users to more easily exploit the capabilities of Hadoop without learning the details of how to program for it.  BioPig extends Pig using custom defined functions (CDF).  The extensions provide common operations that are performed on genomic data, such as converting sequencer reads to a k-mer space.

One of the goals of using BioPig is to enable analysis that is currently difficult to tackle using current tools.  For example, one goal was to find new cellulases --- enzymes that can break down cellulose --- in a very large (497 GB) metagenomic data set extracted from a cow's rumen.  Researchers were also interested in using the framework to enable more routine tasks such as k-mer histogram generation. Attempts had already been made to use serial and MPI programming techniques. With serial programming, the applications ran out of memory, except on a 1 TB RAM machine. The MPI program worked well, but required specific knowledge about MPI and was relatively hard to debug. The promise with BioPig is that it could achieve some of the scaling benefits of MPI without requiring special training.

The challenges in using Hadoop were in debugging. The Hadoop job tracker website shows the progress of jobs graphically and numerically and allows the user to drill down to the individual log files for jobs. A BioPig user who writes their own custom defined functions will need to use these tools for debugging. Another challenge that was encountered was that even though the framework is designed to scale to very large datasets, users still need to be careful with how they use it.  For example, in one case runs were creating an index structure that ended up filling the entire Hadoop file system (HDFS) space. The runs were adjusted to work around this by applying an encoding method to the indexed data.  Steps were required to control the amount of RAM used by the mappers and reducers. This amounts to setting the right maximum for the Java virtual machine's heap size. Thus significant tuning of parameters was required to effectively run these applications in Hadoop.

Bioinformatics and Biomedical Algorithms

A group in Washington State University is currently developing MapReduce algorithms and implementations for some of the emerging data-intensive problems in bioinformatics and computational biology. The group started using the NERSC Hadoop cluster in November 2010. The scientific and computational problems addressed (in either completed or ongoing efforts) are as follows:

Identifying peptides from mass spectrometry data obtained from environmental microbial communities.

The group developed a simple MapReduce implementation called MR-MSPolygraph for parallelizing a novel serial hybrid search method developed by researchers at Pacific Northwest National Laboratory. The data intensive nature of the problem coupled with built-in features of Hadoop such as load balancing makes Hadoop an appealing platform for this application. Due to the inherent data parallel nature of the underlying approach, the input could be divided into smaller chunks and efficiently parallelized. However, the main challenges involved managing the workload imbalances across chunks (e.g., the same chunk size may lead to a variable workload), and implementing the necessary file I/O operations during processing of each chunk. These challenges were overcome by conducting some parametric studies (e.g., finding the empirically optimal work granularity to ensure load balance), and by resorting to GPFS file reads during processing.

Experimental results showed that the implementation scales linearly, giving a peak speedup of 398x on 400 map tasks. More importantly, 64,000 experimental spectra were analyzed in six hours on 400 cores, reducing the time to solution drastically (a serial run would have needed more than 2,000 CPU hours). An applications note on this research was accepted in Bioinformatics.

Protein sequence clustering.

The goal of this project is to identify clusters within graphs built out of known or predicted protein sequences from public metagenomic repositories. Among other applications, clustering functions can be useful in identifying protein families represented in environmental communities, but the volume of sequence information and the complexity of the clustering process necessitate a high degree of parallelism. The group is developing a new MapReduce algorithm for clustering massively large graphs and have been testing it on the NERSC Hadoop cloud.

Sequence indexing.

In another ongoing effort, a new MapReduce algorithm for building the suffix array data structure for DNA and protein sequences is being developed. Once fully developed, this tool can help index public sequence repositories for efficiently supporting pattern matching and sequence searching operations.

Numerical Linear Algebra

The QR factorization is an important decomposition in the areas of scientific computing and numerical linear algebra. In particular, tall and skinny (TS) matrices, where the number of rows is much larger than the number of columns, have several applications. Communication-avoiding algorithms for QR decompositions of TS matrices have been developed that lend themselves well to the MapReduce programming model. Previously, Constantine and Gleich demonstrated how to effectively use MapReduce to perform TSQR. One component that is missing from MapReduce TSQR is how to effectively (i.e., with speed, accuracy, and a small amount of memory) generate Q explicitly.  Users have looked at iterative refinement as a method for generating Q. Another goal was to compare MapReduce implementations of TSQR to a Cholesky QR implementation.

Apache Hadoop was used as the MapReduce architecture, and experiments were conducted on Hadoop testbed at NERSC. Tools such as Dumbo were used for the Python wrappers around the Hadoop environment and for deploying jobs on the NERSC Hadoop testbed. The map and reduce tasks were implemented using Python 2.6, and Hadoop streaming was used to plug the Python tasks in Hadoop.

Resources