11.6.2. Checkpoint and restart of parallel jobs

Old versions of Open MPI (starting from v1.3 series) had support for the transparent, coordinated checkpointing and restarting of MPI processes (similar to LAM/MPI).

Open MPI supported both the the BLCR checkpoint/restart system and a “self” checkpointer that allows applications to perform their own checkpoint/restart functionality while taking advantage of the Open MPI checkpoint/restart infrastructure. For both of these, Open MPI provides a coordinated checkpoint/restart protocol and integration with a variety of network interconnects including shared memory, Ethernet, and InfiniBand.

The implementation introduced a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This enabled support for the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g., uncoordinated, message induced).

Note

The checkpoint/restart support was last included as part of the v1.6 series. It was removed from Open MPI after failing to gain adoption and lack of maintenance.