High Throughput Computing and Workflow Tools
NERSC recognizes that an increasing number of scientific data problems rely on orchestrating and managing large numbers of tasks through complex workflows. An increasingly important case is high throughput computing, where analyses need to perform a very large number of independent tasks, each of which uses only a few cores. Managing these workflows via traditional batch jobs is inefficient for both humans and the batch queuing system. We define a workflow system to mean a system that can coordinate and manage a related set of jobs.
While NERSC does not officially recommend any one specific workflow system, we would like to point our users to tools being used by other NERSC projects to see if these can meet their needs.
Fireworks - Fireworks is a job worklflow manager, designed for high throughput compute pipelines, with support for job dependencies and error handling/restart. It depends on MongoDB and is used by the Materials Project to manage their many-task computing setup. To setup Fireworks for your workflow at NERSC go here.
Galaxy - Galaxy is a web based workflow tool that is use by the genomics community. It allows you to define job dependencies via a web based workflow engine.
qdo - This is a very lightweight tool designed by the astrophysics community to manage queues of high throughput jobs. It supports task dependencies, priorities, management of tasks in aggregate, and flexibility such as adding additional tasks to a queue even after the batch worker jobs have started. For more information on qdo, please contact the Data and Analytics Services Group.
Hadoop - Hadoop is a popular framework for dealing with data parallel problems that fit the map-reduce paradigm. This is a class of problems which employ a high degree of parallelism to operate on large volumes of data. More information on using Hadoop at NERSC can be found here.
MySGE - MySGE allows users to create a private Sun GridEngine cluster on large parallel systems like Hopper. One the cluster is started, users can submit serial jobs, array jobs, and other through-put oriented workloads into the personal SGE scheduler. The jobs are then run within the user private cluster.
For documentation on specific workflow tools, please refer to the workflow software pages.