Checkpoint and Restart Considerations
On April 12, 2007, checkpoint/restart capability was enabled by default for all LoadLeveler jobs. The ability to checkpoint jobs allows NERSC to suspend jobs and later continue jobs - rather than kill them - if the need arises.
This change will be transparent to most jobs. However, there are
some imposed limitations, which are detailed below. If you need
to change the default behavior you may disable checkpoint/restart
by specifying the checkpoint=no LoadLeveler directive in your
batch script, or by unsetting the CHECKPOINT environment
variable in your script.
We have had one report of a python script failing when checkpoint/restart is enabled.
Checkpoint and restart limitations
Use of the checkpoint and restart function has certain limitations.
If planning to use the checkpoint and restart function, you need to be aware
of the types of programs that cannot be checkpointed. You also need to be
aware of certain program, operating system, mode, and other restrictions.
Programs that cannot be checkpointed
The following programs cannot be checkpointed:
- Programs
that do not have the environment variable CHECKPOINT set to yes.
- Programs that are being run under:
- The dynamic probe
class library (DPCL).
- Any debugger
that is not checkpoint/restart-capable.
- Processes that use:
- Extended shmat support
- Pinned shared memory segments
- Sets of processes in which any
process is running a setuid program when a checkpoint occurs.
- Jobs for which POE input or output is
a pipe.
- Jobs
for which POE input or output is redirected, unless the job is submitted from
a shell that had the CHECKPOINT environment variable set to yes before the shell was started. If POE is run from inside a shell script
and is run in the background, the script must be started from a shell started
in the same manner for the job to be able to be checkpointed.
- Jobs that are run using the switch or network table sample programs.
- Interactive POE jobs for which the su command was used prior
to checkpointing or restarting the job.
- User space programs that are not run under a resource manager that communicates
with POE (for example, LoadLeveler®).
Program restrictions
Any program that meets both these criteria:
- is compiled with one of the threaded compile scripts provided by PE
- may be checkpointed prior to its main() function being invoked
must wait for the 0031-114 message to appear in POE's STDERR
before issuing the checkpoint of the parallel job. Otherwise, a subsequent
restart of the job may fail.
Note:
The MP_INFOLEVEL environment variable, or
the -infolevel command-line option, must be set to a
value of at least 2 for this message to appear.
Any program that meets both these criteria:
- is compiled with one of the threaded compile scripts provided by PE
- may be checkpointed immediately after the parallel job is restarted
must wait for the 0031-117 message to appear in POE's STDERR
before issuing the checkpoint of the restarted job. Otherwise, the checkpoint
of the job may fail.
Note:
The MP_INFOLEVEL environment
variable, or the -infolevel command line option, must
be set to a value of at least 2 for this message to appear.
AIX function restrictions
The following AIX functions will fail, with an errno of ENOTSUP, if the CHECKPOINT environment variable is set to yes in the environment
of the calling program:
- clock_getcpuclockid()
- clock_getres()
- clock_gettime()
- clock_nanosleep()
- clock_settime()
- mlock()
- mlockall()
- mq_close()
- mq_getattr()
- mq_notify()
- mq_open()
- mq_receive()
- mq_send()
- mq_setattr()
- mq_timedreceive()
- mq_timedsend()
- mq_unlink()
- munlock()
- munlockall()
- nanosleep()
- pthread_barrierattr_init()
- pthread_barrierattr_destroy()
- pthread_barrierattr_getpshared()
- pthread_barrierattr_setpshared()
- pthread_barrier_destroy()
- pthread_barrier_init()
- pthread_barrier_wait()
- pthread_condattr_getclock()
- pthread_condattr_setclock()
- pthread_getcpuclockid()
- pthread_mutexattr_getprioceiling()
- pthread_mutexattr_getprotocol()
- pthread_mutexattr_setprioceiling()
- pthread_mutexattr_setprotocol()
- pthread_mutex_getprioceiling()
- pthread_mutex_setprioceiling()
- pthread_mutex_timedlock()
- pthread_rwlock_timedrdlock()
- pthread_rwlock_timedwrlock()
- pthread_setschedprio()
- pthread_spin_destroy()
- pthread_spin_init()
- pthread_spin_lock()
- pthread_spin_trylock()
- pthread_spin_unlock()
- sched_getparam()
- sched_get_priority_max()
- sched_get_priority_min()
- sched_getscheduler()
- sched_rr_get_interval()
- sched_setparam()
- sched_setscheduler()
- sem_close()
- sem_destroy()
- sem_getvalue()
- sem_init()
- sem_open()
- sem_post()
- sem_timedwait()
- sem_trywait()
- sem_unlink()
- sem_wait()
- shm_open()
- shm_unlink()
- timer_create()
- timer_delete()
- timer_getoverrun()
- timer_gettime()
- timer_settime()
Node restrictions
The node on which a process is restarted must have:
- The same operating system level (including PTFs). In addition, a restarted
process may not load a module that requires a system call from a kernel extension
that was not present at checkpoint time.
- The same switch type as the node where the checkpoint occurred.
- The capabilities enabled in /etc/security/user that
were enabled for that user on the node on which the checkpoint operation was
performed.
If any threads in the parallel task were bound to a specific processor
ID at checkpoint time, that processor ID must exist on the node where that
task is restarted.
Task-related restrictions
- The number of tasks and the task geometry (the tasks that are common within
a node) must be the same on a restart as it was when the job was checkpointed.
- Any regular file open in a parallel task when that task is checkpointed
must be present on the node where that task is restarted, including the executable
and any dynamically loaded libraries or objects.
- If any task within a parallel application
uses sockets or pipes, user callbacks should be registered to save data that
may be in transit when a checkpoint occurs, and to restore the data when the
task is resumed after a checkpoint or restart. Similarly, any user shared
memory should be saved and restored.
Pthread and atomic lock restrictions
- A checkpoint operation will not begin on a parallel task until each user
thread in that task has released all pthread locks, if held.
This can potentially
cause a significant delay from the time a checkpoint is issued until the checkpoint
actually occurs. Also, any thread of a process that is being checkpointed
that does not hold any pthread locks and tries to acquire one will be stopped
immediately. There are no similar actions performed for atomic locks (_check_lock and _clear_lock, for example).
- Atomic locks must be used in such a way that they do not prevent the
releasing of pthread locks during a checkpoint.
For example, if a checkpoint
occurs and thread 1 holds a pthread lock and is waiting for an atomic lock,
and thread 2 tries to acquire a different pthread lock (and does not hold
any other pthread locks) before releasing the atomic lock that thread 1 is
waiting for, the checkpoint will hang.
- If a
pthread lock is held when a parallel task creates a new process (either implicitly
using popen, for example, or explicitly using fork or exec) and the releasing of the lock is contingent on some action of the
new process, the CHECKPOINT environment variable must be set to no before causing the new process to be created.
Otherwise, the parent
process may be checkpointed (but not yet stopped) before the creation of
the new process, which would result in the new process being checkpointed
and stopped immediately.
- A parallel task must not hold a pthread lock when creating a new process
(either implicitly using popen for example, or explicitly using fork) if the releasing of the lock is contingent on some action of the
new process.
Otherwise a checkpoint could occur that would cause the child
process to be stopped before the parent could release the pthread lock causing
the checkpoint operation to hang.
- The checkpoint
operation may hang if any user pthread locks are held across:
- Any collective communication calls in MPI (or if LAPI is being used in
the application, LAPI).
- Calls to mpc_init_ckpt or mp_init_ckpt.
- Any blocking MPI call that returns only after action on some other task.
Other restrictions
- Processes cannot be profiled
at the time a checkpoint is taken.
- There can be no devices other than TTYs or /dev/null open at
the time a checkpoint is taken.
- Open files must either have an absolute pathname that is less than or
equal to PATHMAX in length, or must have a relative pathname that is less
than or equal to PATHMAX in length from the current directory at the time
they were opened. The current directory must have an absolute pathname that
is less than or equal to PATHMAX in length.
- Semaphores or message queues that are used within the set of processes
being checkpointed must only be used by processes within the set of processes
being checkpointed.
This condition is not verified when a set of processes
is checkpointed. The checkpoint and restart operations will succeed, but inconsistent
results can occur after the restart.
- The processes that create shared
memory must be checkpointed with the processes using the shared memory if
the shared memory is ever detached from all processes being checkpointed.
Otherwise, the shared memory may not be available after a restart operation.
- The ability to checkpoint and restart a process is not supported for B1
and C2 security configurations.
- A process can checkpoint another process only if it can send a signal
to the process.
In other words, the privilege checking for checkpointing
processes is identical to the privilege checking for sending a signal to
the process. A privileged process (the effective user ID is 0) can
checkpoint any process. A set of processes can only be checkpointed if each
process in the set can be checkpointed.
- A process can restart another process only if it can change its entire
privilege state (real, saved, and effective versions of user ID, group ID,
and group list) to match that of the restarted process.
- A set of processes can be restarted only if each process in the set can
be restarted.
:1
|