Supporting data-centric science often involves the movement of data across file systems, multi-stage analytics and visualization. Workflow technologies can improve the productivity and efficiency of data-centric science by orchestrating and automating these steps. NERSC provides support for the TaskFarmer, Swift and Fireworks tools. We also maintain other packages like Tigres that can help users build workflows.
This page describes the current set of tools and services in the Workflow ecosystem at NERSC. The best tool for your workflow will depend strongly on your personal preference; we make the following general recommendations.
- If you need to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, consider using Swift or TaskFarmer.
- If you have a larger number of single or multi-core jobs that need to run in parallel and may have varying wall times, consider TaskFarmer
- If you have a large number of MPI jobs to orchestrate, consider using Swift or Fireworks.
- If you need to run a long-term campaign over diverse compute resources, consider Fireworks.
- If you need a complex or dynamic workflow (i.e. a dependent chain of tasks), consider Fireworks.
Many workflow tools need to run on a login node for long periods of time while they monitor job execution and throughput. Long-running jobs are not encouraged on the login nodes, so on Cori NERSC has provisioned a set of dedicated nodes to run this kind of application. Note that these nodes are still subject to system-wide outages, but will usually see less contention of resources than on a login node. If you require access to a workflow node, please email email@example.com to make the request.
TaskFarmer is a utility developed in-house at NERSC to distribute the execution of tasks across compute nodes in a single large batch job. These tasks can be single- or multi-core tasks, but individual tasks cannot span nodes (e.g. multi-node MPI job). TaskFarmer tracks which tasks have completed successfully, and allows straightforward re-submission of failed or un-run jobs from a task list. Read More »