A Survey of Checkpointing Strategies for Shared-Memory Hpc Applications

A Survey of Checkpointing Strategies for Shared-Memory HPC Applications
Ana Gainaru,Aparna Sasidharan,Jorge Mario Cara Carmona University of Illinois, Urbana-Champaign April 25, 2012

1

Introduction

Fault tolerant protocols have always been a major research topic for the HPC community. Harnessing microprocessors to solve large computational problems has required the use of many microprocessors in a single system. Whereas today the large server machines in the business sector may have as many as 32 processors, large supercomputers can have thousands or tens of thousands of processors in a single machine. While this approach has proven itself to be highly eﬀective in expanding the limits of computational capability, it has also brought to the foreground new challenges that did not arise in smaller systems. Fault tolerance is one such critical challenge.The problem of fault tolerance in modern systems arises from two important HPC trends. First is the rising frequency of faults in systems. Second is the increasing size and running times of applications running on these systems, making them more vulnerable to these faults. HPC systems are vulnerable to faults for three major reasons. First, whereas older machines were built from custommade,high-quality components, modern systems use commodity components that were designed and built for a less reliability-aware market. Second, as modern systems are made from more and more components, the probability of one of them failing becomes quite large, even if the individual components are reliable. Finally, as circuit feature sizes become smaller, circuits become increasingly vulnerable to soft, hard and intermittent errors, caused by ambient radiation, temperature ﬂuctuation and other everyday phenomena. The end result is that the largest HPC systems today, such as the ASC systems, have mean times between faults on the order of hours to days. At the same time, many applications being run on these systems have running times on the order of days, weeks and months. At exascale, some predictions are 1

that the mean time between failures will range between a few hours to one day. Designing a suitable fault tolerance protocol for HPC applications is a challenge because it should be energy eﬃcient without aﬀecting the overall performance of the application. Replication is too costly since we are duplicating the workload. This can be ruled out as one of the options. Any fault tolerant protocol we design for HPC applications should therefore be based on process checkpointing.While it is possible to design applications to be insensitive to system faults, this requires a signiﬁcant amount of programmer eﬀort and can be extremely diﬃcult for general algorithms. A more general and simpler solution is rollback restart, a technique in which the state of the application is periodically saved to stable storage. In the event of a system fault the application is aborted and restarted on functioning hardware as if nothing happened. While less eﬃcient than application-speciﬁc approaches, rollback restart has become popular because its simplicity leads to a lower time-to-solution than competing fault tolerance techniques.

2

Fault-tolerance techniques for Message Passing Applications

A message passing system consists of a ﬁxed number of processes that communicate only through messages. Processes cooperate to execute a distributed application program and interact with the outside world by receiving and sending input and output messages respectively. A global state of a message passing system is a collection of individual states of all participating processes and of the states of the communication channels. A consistent global state is one that occurs during a correct execution of the application without incurring failures. For example, if a process’s state reﬂects a message receipt, then the state of the corresponding sender reﬂects sending the message. A rollback recovery protocol for message passing systems must bring the system to a consistent state in the case of a failure. Checkpoint based rollback-recovery techniques can be classiﬁed as follows : • Uncoordinated Checkpointing - Each process takes a checkpoint without co-ordinating with the other processes and when it is most conveninent for it. But this may lead to Domino eﬀect[1] in which case, all the processes may rollback to the beginning of the computation. You may also record checkpoints which donot store useful information which will need to be removed by garbage collection. • Coordinated Checkpointing - This technique does not lead to domino 2

eﬀect and reduces storage overhead since only one checkpoint is saved. There is no need for garbage collection to remove obsolete checkpoints. The disadvantage is the latency involved in taking a checkpoint and therefore in committing the output of execution. Since the checkpoint is global, a barrier needs to be executed before taking a checkpoint and this may block communication. There are techniques to improve this, for example, non-blocking checkpoint coordination[2] and use of synchronised checkpoint clocks[3]. • Communication induced Checkpointing - In this method, a process can take checkpoints independently(local checkpoints), while it may also have to take certain checkpoints to ensure the progress of the recovery line(forced checkpoints). [TODO:Add something on message logging protocols] [TODO:Add diagrams] Coordinated checkpointing is the most popular protocol for HPC right now because recovery is simple and garbage collection is eﬃcient(only the last checkpoint is needed). However it has severe drawbacks since all processes have to rollback to their previous checkpoint. Moreover, all processes writing their checkpoints at the same time creates burst accesses to the I/O system that may slow down execution. Uncoordinated checkpointing on the other hand does not require any synchronisation between the processes at checkpoint time. Thus it can avoid the problem caused by burst accesses to the I/O system. One of the grounds on which existing fault tolerant potocols diﬀer is the amount of determinism they assume on applications. Both uncoordinated(without message logging) and coordinated checkpointing work on the assumption that all processes are non-deterministic. On the occurrence of a fault, these protocols have to restart all processes because the execution after a restart may be entirely diﬀerent compared to that before the fault. Message logging protocols are a slight improvement in this regard because they assume some amount of determinism. They assume that the code is piecewise deterministic, i.e the state of a process is inﬂuenced only deterministic local actions and message receptions. However, message logging protocols have high overheads again, making them unsuitable for HPC applications. In short, if the communication in HPC applications were deterministic, we could reduce the overheads of fault tolerant protocols. Uncoordinated checkpointing can be combined with message logging to avoid the domino eﬀect and limit the number of processes to rollback in the event of a failure if we assume that the application is piecewise deterministic. A new property called send determinism has been shown to be common to many MPI 3

HPC applications. In a send deterministic application, given a set of input parameters, the sequence of message emissions for any process is the same in any correct execution[4].

3

Fault-tolerance techniques for Shared Memory

In the shared memory programming model, the tasks of an application have direct access to each other’s memories, allowing them to interact without explicitly sending and receiving messages. However since multiple threads are simultaneously accessing the shared address space, you need additional primitives for thread coordination such as locks, semaphores, critical sections, critical regions and monitors. Depending on the shared memory API chosen, all or a subset of the above primitives may be available to you. Shared memory APIs also provide additional functionalities like the data distribution primitives in High Performance Fortran[5] and the worksharing constructs in OpenMP[6] and UPC[7].OpenMP,UPC,Co-Array Fortran[8] and Global Arrays[9] are some of the shared memory APIs commonly used in high performance computing. Fault-tolerance techniques for shared memory applications diﬀer from those for message passing ones because of the underlying memory model. Each thread may have a memory that only it can access along with the common shared address space independent of all threads. Therefore it is important to understand the memory model used by an API.

3.1

Memory Models for Shared Memory

A memory model is a description of how read and write operations executed by the same and diﬀerent threads behave relative to each other. The sequential memory model is as follows: a read to a given variable must return the value written by the most recent write to this variable. The simplest shared memory model is sequential consistency [10]. The intuition behind sequential consistency is that the execution of the application must correspond to some interleaving of operations on diﬀerent threads executing on the same processor and accessing the same memory. In other words, the results of all reads performed by the applications threads must correspond to some total order on all read and write operations such that a given read of a variable returns the value of the write to this variable that was most recent according to the total order.However, enforcing sequential consistency on a multi-processor system requires a signiﬁcant amount of stalling and additional communication and prevents the hardware and the compiler 4

from employing optimizations such as instruction reordering. This results in reduced performance for systems that guarantee sequential consistency. Since sequential consistency is expensive, a number of additional weaker memory models have been developed, such processor consistency [11],weak ordering [12], release consistency [13], total store ordering [14], the Digital Equipment Alpha memory model [15] and the PowerPC memory model [16]. These models are designed to allow for additional hardware and compiler optimizations, such as instruction reordering, the use of non-FIFO networks and having multiple processors simultaneously write to the same variable that is stored in their private caches. Since we are dealing with OpenMP, it would make sense that we discuss its memory model in some detail. The OpenMP memory model is most closely related to weak ordering. It separates memory accesses into data and synchronization accesses. No ordering provided for data accesses: the shared memory implementation is allowed to reorder operations on diﬀerent variables. data writes are guaranteed to be atomic: no thread may see the partial result of a data write. synchronization accesses are provided with much stronger guarantees. First, any interactions between synchronization accesses are guaranteed to be sequentially consistent. Second, all data accesses that precede a given synchronization access in the programs source code must complete before this access may begin. Similarly, data accesses that follow a given synchronization access in the programs source code may not begin until this access has completed.The intuition behind weak ordering is that most of each threads memory accesses will be to variables that no other thread is touching at the same time. In this case it can use the highly optimized data accesses, which in the absence of multithread data races behave as if they were sequentially consistent. However, threads occasionally need to synchronize their executions in order to ensure that data accesses do not race with each other. Synchronization accesses can be used for this task. Although their guarantee of sequential consistency makes synchronization accesses slower, they donot occur frequently in an OpenMP program. We have analysed the NAS and Spec OpenMP benchmarks for the number of synchronisation points in them. The results we obtained are tabulated below. [TODO:include table]

4

Protocols

There has been a variety of work on developing rollback restart solutions for shared memory. The space of possible solutions is strongly inﬂuenced by the way shared memory is commonly implemented.Applications are written

5

to work with some shared memory API and memory models, where one API may support one or more memory models while each memory model may be supported by one or more APIs (while a memory model deﬁnes the behavior of reads and writes, a full API provides additional functionality such as synchronization functions and data parallel primitives). Each API is implemented by one or more shared memory implementations in hardware, software or a combination of both. Each implementation may use one or more consistency protocols and a given protocol may be used by one or more implementations. Shared memory implementations run on one or more network fabrics that at a low level provide a message passing model of communication that must be used by the implementation to provide the application with the abstraction of shared memory. Since all shared memory implementations are ultimately based on message passing, it is theoretically possible to use the techniques described in the previous sections for message passing applications to provide rollback restart for any shared memory implementation. However, in practice this can be unnecessarily expensive since shared memory implementations send many small messages, many of which are not relevant for restart. Moreover, they perform many non-deterministic actions (every read may be a non-deterministic event), further complicating this approach and reducing its eﬃciency. As such,typical solutions must tailor themselves to the details of a particular implementation, memory model or consistency protocol. The fact that shared memory protocols have the same types of interprocessor interactions as their message passing counterparts suggests that we can organize them into a similar taxonomy. This presentation divides shared memory roll- back restart protocols into the following types: coordinated, uncoordinated, quasi-synchronous and message logging. Protocols in each of these families closely resemble their message passing counterparts but focus more tightly on the internal details of their target protocols. 1. Coordinated Checkpointing : SafetyNet [17] is a framework for adding checkpointing to hardware multiprocessors for the purpose of surviving transient faults. It focuses on the sequential consistency memory models and augments existing protocols with checkpointing functionality. SafetyNet relies on the availability of some sort of logical time being maintained by the systems processors such that no message can arrive at an earlier logical time than the time when it was sent. All processors checkpoint themselves at the same logical time, once every 5,000-1,000,000 processor cycles. The processors registers are checkpointed directly to a nearby checkpoint buﬀer while its cache and 6

memory are saved lazily in a copy-on-write fashion. In order to make sure that the ownership of each memory block is known at the time of a checkpoint and is not in some intermediate state, a given checkpoint is not committed until all processors have determined that all their memory block ownership requests issued before the checkpoint have been completed. SafetyNet shows very low overheads because of its relatively rare (on a processor time scale) checkpoints and the fact that they log only .1% of all cache accesses. However, it is also quite limited in the severity of the faults that it can deal with. In particular, it cannot roll back from with processor failures or cache corruptions due to soft errors. Revive [18] is another example of a hardware coordinated checkpointing protocol, except that it uses blocking coordination instead of the non-blocking protocol used by SafetyNet. Revive targets oﬀ-the-shelf processors with caches and modiﬁes the networking components of common hardware shared memory implementations, including network cards, directory controllers and memory controllers. To record a checkpoint all processors synchronize using a 2-phase commit protocol and ﬂush their pipelines and caches, causing the main memories to contain a consistent state of the shared address space. Since main memories are still vulnerable to failure, their state is encoded using a single-bit parity error correcting code, with extra memory modules dedicated to holding the parity bits of the other modules. The memory controllers are modiﬁed to ensure that the parity module is always upto date with any changes to the other memory modules. Whenever a memory location is written to for the ﬁrst time after a checkpoint, its original value is copied out into a special log in a copy-on-write fashion before the write is allowed to proceed. This is to ensure that the state of the memory at the time of the checkpoint is not lost. On restart the state of the shared address space is recreated from the current memory state and the overwritten values stored in the log. The result is that Revive has an time overhead of 6% and memory overhead of 14% while taking 10 checkpoints per second. Coordinated checkpointing is extended to software shared memory by [19], which presents a protocol closely related to sync-and-stop where processors checkpoint themselves inside application barriers. Because at barrier-time all shared memory state is consistent and there is no in-ﬂight communication, it is possible to simply save the state of all pages and all directories in a checkpoint and directly restore them on restart. However, the fact that the applications own barriers are used 7

for synchronization means that no additional coordination cost is incurred.[20] presents a software-based checkpointer for SMP systems. In particular, their approach works at the user-level, above Pthreads. As such, they do not have direct control over the implementation of the API and in particular over the Operating Systems kernel threads. Their checkpointing protocol is initiated by a single master thread that sends a signal to all other threads. Since PThreads does not deﬁne what happens when a thread that is blocked on a mutex receives a signal, it is possible for such blocked threads to never be informed that a checkpoint has begun. As such, the master thread, along with any threads that have begun their checkpoint, release all their mutexes so that blocked threads may acquire them and be informed that a checkpoint has begun. In order to ensure that threads that have been released on this fashion do not continue executing application code, potentially corrupting application state, the master thread sets the prevent locks ﬂag in order to inform forcefully released threads that they should not continue executing. When all threads have been informed of the checkpoint, they block and the master thread saves the state of the process to disk. It then informs all the other threads, which reacquire their released mutexes, and resume executing application code, which may involve re-blocking on mutexes that they were forcefully released from during the checkpoint. This solution has the advantage of high portability, since it works above the PThreads implementation rather than inside of it. The disadvantage is that at checkpoint time they save all application state without diﬀerentiating whether it belongs to the application or the PThreads implementation. This makes restart diﬃcult since the PThreads implementation may have pre-restart information that may not remain valid on restart.BLCR [21] takes a diﬀerent approach toward checkpointing PThreads applications by embedding this functionality inside the Linux kernel. While this approach sacriﬁces portability, kernel-level solutions have much more power over the PThreads implementation than their user-level counterparts. In BLCRs checkpointing protocol the master thread sends a signal to each application thread. When this signal is delivered to a thread, the signal handler makes a system call, informs the master thread and blocks the thread at a barrier. When the master thread has been informed that all threads are waiting for it, it releases them from the barrier and all threads participate in saving the applications state. A checkpoint is terminated via another barrier, after which point all threads return from the system call and resume 8

executing application code. The relative simplicity of working at the kernel level has helped to make BLCR more functional but has limited its portability. 2. Uncoordinated Checkpointing : In uncoordinated checkpointing individual threads are allowed to independently checkpoint their own state without any additional synchronization or communication between them. While this minimizes the regular-execution time overhead of checkpointing, on restart it can cause threads to roll back far into the past via the domino eﬀect, causing both a time overhead and a space overhead for additional checkpoint storage. The extent of the domino eﬀect depends directly on the frequency of communication in a given application. [TODO:Add details about rebound here] 3. Quasi-Synchronous Checkpointing - The approach in [22] focuses on providing fault tolerance for distributed memory implementations of shared memory. The primary feature of such implementations is that each processor has its own memory (not just cache) that is not accessible by other processors. In particular it extends invalidation-based protocols to support checkpointing. This protocol uses the disk to hold the contents of the entire application address space, with processors holding copies of some of these pages in their memories. Multiple processors may have a read-only copy of a page in their memory but if a process wishes to write to this page it must ﬁrst invalidate all the read-only copies by sending messages to their host processors. If later on some processor wants to read from or write to this page, it must ask for the modiﬁed copy to be sent over and if it is trying to read it, change the pages status to read-only.In this protocol, checkpointing is achieved in the following manner : • A processor saves its state to disk • It ﬂushes all the dirty pages modiﬁed by it since the last checkpoint to disk. On restart all of the memory pages of a processor are initially invalid. It can then reacquire any pages it needs by loading them from other processors or the disk whenever the application tries to read from or write to them via the same shared memory mechanisms that are used for normal read and write requests. One complication is that allowing processors to checkpoint independently can cause cascading rollbacks. These rollbacks may force the system to roll back to the start 9

of the applications execution. In order to avoid this they use a quasisynchronous method where a process is forced to take a checkpoint when another process asks it for a page that it has modiﬁed since its last checkpoint. 4. Message Logging - The protocol presented in [23] is based on Homebased Lazy Release Consistency(HLRC) and works as follows. Processors checkpoint themselves independently. Each processors checkpoint contains • the processors internal state, including any non-shared memory • the data of any pages that it is home to(every page is assigned to a home processor which maintains the most recent copy of the page). Since these checkpoints are uncoordinated, they are vulnerable to cascading rollbacks. This protocol deals with this via sender-based message logging. A message logging protocol must do three things: • record the outcomes of all non-deterministic events • record the data of any communication from processors that did not roll back to the one that did (so that they can be re-sent on restart) • suppress any communication from the rolled back processor to the others until it has ﬁnished restarting. Non-determinism appears in HLRC in the the behavior of locks where after a given processor has released a lock, the choice of the processor that will acquire it next is non-deterministic. As such, whenever a processor acquires a lock, both it and the locks releaser record this event in logs that they keep in their volatile memories. The only forms of communication in HLCR is processors sending page modiﬁcations during release operations and home nodes sending page copies to nodes that are performing acquire operations. Both of these communications are logged at their senders. Writer processors log the modiﬁcations that they send to home nodes and home nodes log old versions of their pages. When some processor rolls back, it reloads its state and the states of all of its pages from its checkpoint. It then receives from other processors the details of when it acquired its locks, the data of any pages that it received from other processors in the course of its execution and the updates sent to it by other processors when they 10

modiﬁed the pages that it is home to. It uses this information to locally recompute its pre-rollback state, including the state of the pages that it was home to, without doing any additional communication. This method has overheads caused by the cost of checkpointing and the cost of logging. The protocol tries to minimise the overheads by removing outdated log entries and checkpoints via garbage collection. Applications with considerable synchronisation and load imbalance can still cause signiﬁcant overheads in this protocol. 5. Some additional protocols - [24] discusses an extension of the CacheAided Rollback Error Recovery (CARER) technique. CARER maintains exactly one checkpoint at all times. Processor state is saved by using a set of backup registers into which the values of regular registers are copied. Memory is checkpointed by ﬂushing all dirty cache lines to main memory. After this has been done main memory corresponds to its state at checkpoint time and any writes performed by the processor are kept in dirty cache lines, which are not ﬂushed to main memory until the next checkpoint. In the shared memory context they propose three checkpointing protocols for coherence protocols that work in the following scenario • processors have their own caches • processors are connected by a bus • they are connected to a single common main memory. The ﬁrst protocol has all processors checkpoint at the same time and when one processor rolls back,so does every other (i.e. this is a coordinated protocol). The second protocol tracks the read/write interactions between processors and when a processor takes a checkpoint, it only forces other processors to checkpoint if it has participated in interactions since its last checkpoint and only processors that have interacted with others will need to checkpoint (i.e. its a quasi-synchronous protocol). The same logic is used to decide which processors will roll back. The third protocol forces processors to checkpoint whenever they participate in interactions with other processors (again, a quasisynchronous protocol). The advantage is that when a single processor wants to restart, no other processor is forced to restart as well. Transactional memory is an variant of shared memory, where certain pieces of code are identiﬁed as transactions. Code in these transactions can access a shared address space and the underlying runtime system 11

is responsible for guaranteeing the illusion that these transactions are executed in some serial order. Although this is trivial to guarantee on sequential systems, on parallel systems this is typically done by optimistically running multiple transactions at the same time and dynamically detecting if one transaction writes to a variable that is accessed by a concurrently executing transaction. If a conﬂict is detected where two transactions cannot be sequentially ordered relative to each other, one is rolled back and retried. This method of rollback-restart can be used for fault tolerance in transactional memory model[25,105].

5
5.1

Application Level Checkpointing
Need for Application Level Checkpointing

Despite the great variety of rollback restart protocols for shared memory systems,they are almost uniformly consistent on one point: no protocol applies to more than a small fraction of the shared memory solution space. As shown in the ﬁgure below[TODO:add ﬁgure] shared memory systems are created via a layering of APIs, memory models, shared memory implementations, consistency protocols and network interfaces. Each protocol focuses on a fraction of this space: a particular memory model, a type of network interface or a type of consistency protocol. As such, no protocol can hope to provide rollback restart functionality for even the majority of applications. One option would be to take a high-performance implementation of shared memory that works on a wide variety of platforms and supports a variety of shared memory APIs and augment it with rollback restart functionality. But the fact is that there is no one shared memory implementation that both works on and oﬀers high performance for the majority of shared memory systems and the majority of shared memory applications.Another option would be to create an application-level rollback restart protocol that would work with any shared memory implementation, consistency protocol and memory model. Such a protocol could then be applied to any shared memory API to create a full rollback restart solution for shared memory applications. As an application-level approach, it would require one round of eﬀort to apply this protocol for each shared memory API of interest. This is in contrast to lower-level solutions that would require a round of implementation eﬀort for each shared memory implementation, consistency protocol and/or memory model, which are far more numerous than shared memory APIs. We studied the application level checkpointing protocol proposed by Pingali et al[26] in detail and its implementation in OpenMP. The next section dis12

cusses this protocol. We also used BLCR to measure the overheads caused by user-level checkpointing in HPC applications. The benchmarks and the results obtained are shown in section X. Based on these results we propose a suitable checkpointing strategy for these HPC benchmarks in section Y.

5.2

Application level Checkpointing for OpenMP

In this method, the checkpointer sits as a thin layer between the application and the underlying shared memory system. It intercepts all the function calls that the application may issue, for example, thread creation and resource acquisition. The job of the checkpointing layer is to coordinate the state saving activities of all threads such that the checkpoint saved on stable storage can be used to restart the application when needed. It is assumed that there exists some way of recording the state of the application. In particular, the authors assume there is a save state function which when called by all the threads is guaranteed to save the private and shared state of all threads to stable storage. the basic protocol used is the following : Global_Barrier Save_State Global_Barrier The two barriers ensure that the application is not running while the checkpoint is being recorded. Furthermore, the ﬁrst barrier ensures that any writes to shared data that were initiated before the checkpoint, will have completed by the time the ﬁrst barrier returns. This makes this protocol applicable to both sequential consistency as well as any relaxed memory model model, since in virtually any memory model barriers force memory to be consistent at that point in time. Integrating this into an application presents certain diﬃculties since the program uses synchronisation to coordinate its own activities. It can lead to deadlocks which did not exist earlier. We will look at these issues in detail in the rest of this section. This discussion will bring to light some of the methods for application level checkpointing in shared memory programs. Suppose the shared memory API allows the application to acquire and release mutually exclusive resources. Problem occurs when one thread tries to checkpoint while holding a resource that another thread is waiting to acquire. this can lead to a lead to a deadlock. Now that a checkpoint has led to a deadlock, it can be removed in the following manner. You can force the thread waiting for the resource to participate in a checkpoint and resume 13

acquiring the resource after the checkpoint was completed. The approach can be summarised as follows: • Before checkpoint begins ensure that all threads are awake and not blocked on resources by releasing all the resources • Before checkpoint completes each thread reacquires its released resources • Immediately after checkpoint completes put each thread that was forcefully woken back in to a blocked state by having it try to reaquire the resource it was blocked on. This strategy can be extended to all the synchronisation constructs of the shared memory API. 1. Locks and critical sections • Before checkpoint begins the thread holding a resouce must release it • Before checkpoint completes each owner thread must reacquire its resource • After checkpoint completes each thread that was forced to wake up should try to reacquire the resource it was blocked on 2. Barriers • Before checkpoint begins all threads that are not waiting on a barrier must execute a barrier in order to release the threads blocked at a barrier. • After checkpoint completes each forcefully awoken thread goes back to being blocked at the barrier 3. Semaphores • Before checkpoint begins the semaphore counter must be decremented n times where n > 0 and n ≤semaphore counter at the time of the checkpoint. • Before checkpoint completes the semaphore’s counter is n less that it was the checkpoint began, so it must be waited on n times to restore the state of the counter to uts precheckpoint state.

14

• After the checkpoint completes each forcefully awoken thread tries to wait on the semaohore again 4. Condition variables • Before checkpoint begins each condition variable must be repeatedly notiﬁed doe as long as it is not known that all threads have begun their checkpoints • After checkpoint completes each forcefully awoken thread tries to wait on the condition variable again While applying the above protocol to OpenMP, we need to be aware of its additional constructs like parallel loops where multiple threads may execute the iterations of a given for loop and ordered regions which work like critical sections except that they must be executed in-order. Pingali et al, adapted the protocol to support all features of OpenMP except nested parallelism. The protocol was implemented as part of the Cornell Checkpointing Compiler[27] system. We found this protocol attractive due to its portability and ﬂexibilty since the user can decide the location of potential checkpoints.

5.3

Architecture

The architecture of the OpenMP checkpointer is given below. [TODO:include diagram here] The OpenMP API is deﬁned as a set of compiler #pragma directives and several functions that describe how a given program should be parallelized. As such, the only way to insert the protocol layer between the application and the OpenMP implementation is to perform compiler transformations on the source code that convert the original OpenMP directives into modiﬁed versions. The C3 pre-compiler reads the source ﬁles of the OpenMP application written in C and instruments them to execute the checkpoint coordination protocol and to perform application-level saving of shared and private state. The C 3 runtime is then placed as a layer between this modiﬁed executable and the native OpenMP library, performing additional checkpoint coordination and state saving functionality. This structure makes checkpointing a property of the application rather than of the underlying system, making it available on any platform on which the application is executed. The only modiﬁcation programmers make to source ﬁles is to insert calls to function potential checkpoint at program points where checkpoints may be taken. These points should ideally have minimal live state and need to be executed at least as often as the desired checkpoint interval.

15

In practice placing potential checkpoint locations at the top of the applications main computation loop usually satisﬁes these conditions. Checkpoints need not be taken at every ccc potential checkpoint call; instead,the choice can be based on a timer or an adaptive mechanism. The output of the precompiler is compiled with the native compiler on the hardware platform, and linked with a library that implements the above protocol for generating consistent snapshots of the state of the computation. This layer sits between the application and the OpenMP runtime layer, and intercepts all calls from the instrumented application program to the OpenMP library.

5.4

What needs to be saved?

OpenMP processes have four types of state: • Application state - Includes application variables(global and local), heap objects and the stack activation frames that exist at checkpoint time • Hidden state - Any state inside the OpenMP runtime system that must be recreated on restart. This includes the locations of privatised and reduction variables, the binding between stack regions and thread numbers and the scheduling of worksharing constructs • Synchronisation state - The state of any synchronisation objects the threads held or was blocked on at checkpoint time. These include barriers, locks, critical sections and ordered regions in worksharing constructs. Depending on the degrees of access a checkpointing protocol has to various portions of the application state, it may have to use all or some of the following methods to restore the state during restart. • Restore - State that can be directly manipulated by the application and can therefore be directly saved at checkpoint-time and restored on restart. Example: Global and local variables. • Replay - State that cannot be directly manipulated by the application but can be recreated using a sequence of deterministic operations. On restart it can be regenerated by replaying these operations. Example: OpenMP locks since their state is inaccessible to the application (the implementation details of the omp lock t structures are unknown) but can be recreated using lock routines such as omp init lock and omp set lock. 16

• Reimplement - State that cannot be recreated by replaying operations. In this case the operations that create and maintain this state need to be reimplemented so that this state can be saved and restored. Example: heap objects in C since on restart it is not possible to force malloc to place these objects at their original locations. • Restrict - State that cannot be recreated by reimplementing the operations.This type of state cannot be supported and the application must be restricted from using this kind of state or to only using it in a restricted manner. Example: the use of application-implemented synchronization inside the application can cause our checkpointing protocol to deadlock and thus, its use must be restricted. The authors of C3 have used a combination of these techniques to instrument and tranform the original OpenMP application to acheive rollback and restart. The C3 system uses Restore to recreate the values of variables and heap objects, Replay to recreate activation frames and Reimplement to recreate the heap. On restart, it is necessary that a thread be reassigned the same stack region that it had in the original execution. Since OpenMP provides no guarantees about the locations of stacks, they have relaxed this condition and allowed stacks on restart to move relative to the original stacks by a small amount. The stack contents are recreated via replay by calling the function calls and re-entering the OpenMP directives that were on the stack at checkpoint time. Privatised and reduction variables are allocation by reimplementing their address assignment. Worksharing constructs are handled by a mix of replay and reimplement. On restart, each thread must resume executing the same workunit it had when checkpoint was taken, and it must also make sure that the remaining unexecuted workunits are scheduled to run on some threads. This is done by rewriting the OpenMP worksharing construct. The OpenMP synchronisation constructs also needed rewriting in order to support explicit acquiring and release of resources on checkpoint.

6

Measurements and Analysis

Although we wanted to measure the cost of checkpointing on the C3 system, it was not available to us. So we used BLCR instead to measure the overheads of checkpointing HPC benchmarks. BLCR is a system-level checkpoint restart system developed mainly for HPC applications. The checkpointing overheads measured by BLCR will not represent those produced 17

by C3. But we wanted to study the nature of the benchmarks we had to propose a better solution. The overheads produced by C3 will atleast be as much as those generated by BLCR. To simulate user initiated checkpointing supported by C3 in BLCR, we used the callback utility of BLCR to initiate checkpoints from within the OpenMP application.Before we proceed, we to the results we give a short introduction to BLCR. [TODO:Add note on BLCR] The experiments were conducted on [TODO:add details and architecture of the cluster used] The overheads we measured were the following: [TODO:Add list of overheads measured] [TODO:Add graphs]

7
7.1

Benchmarks
NAS Parallel Benchmarks

The NAS Parallel Benchmarks are derived from the CFD codes. They are designed to compare the performance of parallel computers and are widely recognized as a standard indicator of computer performance. NPB consists of ﬁve kernels and three simulated CFD applications derived from important classes of aerophysics applications. These ﬁve kernels mimic the computational core of ﬁve numerical methods used by CFD applications. We analyzed the OpenMP implementation code of the following applications: • BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional (3-D) compressible Navier-Stokes equations. The ﬁnite diﬀerences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems are Block-Tridiagonal of 55 blocks and are solved sequentially along each dimension. • SP is a simulated CFD application that has a similar structure to BT. The ﬁnite diﬀerences solution to the problem is based on a BeamWarming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. • LU is a simulated CFD application that uses symmetric successive over-relaxation (SSOR) method to solve a seven-block-diagonal system resulting from ﬁnite-diﬀerence discretization of the Navier-Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. 18

• FT contains the computational kernel of a 3-D fast Fourier Transform (FFT)-based spectral method. FT performs three one-dimensional (1-D) FFTs, one for each dimension. • MG uses a V-cycle MultiGrid method to compute the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and ﬁne. It tests both short and long distance data movement. • CG uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries. • EP is an Embarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a speciﬁc scheme. The goal is to establish the reference point for peak performance of a given platform. • IS is a large integer sort operation testing both integer computation speed and interprocessor communication. This kernel stresses the integer performance of the underlying node. More information about this benchmark can be found in [1].

7.2

SPEC OMP Benchmarks

The SPEC OMP benchmark suite was developed by the SPEC High Performance Group for performance testing of shared memory systems. It used the OpenMP versions of SPEC CPU2000 benchmarks and candidates. The benchmark suite has OMPM2001 for medium and OMPL2001 for large compute intensive applications. OMPM2001 has 9 applications which are written in Fortran and 2 that are written in C. The applications are listed below: • 310.wupwise m - this application is used for Quantum Chromodynamics • 312.swim m - Shallow water modelling • 314.mgrid m - Multi-grid solver in 3D potential ﬁeld • 316.applu m - Parabolic/elliptic partial diﬀerential equations • 318.galgel m - Analysis of oscillatory instability(Fluid Dynamics) 19

• 320.equake m - Finite element simulation for earthquake modelling • 324.apsi m - Distribution of various pollutants • 326.gafort m - Genetic Algorithm • 328.fma3d m - Finite element crash simulation • 330.art m - Neural network simulation • 332.ammp m - Computational chemistry

8

Pattern analysis

By investigating the code manually, we extracted possible memory access patterns from all 8 applications and kernels. We measured the number of synchronisation and worksharing constructs in the applications. 1. Loop parallelization The normal loop parallelizations, for and do, where each thread operated on a sub-set of entries from an array. Here the program is non-deterministic within the loop since the threads may execute in any order. However, the parallel do construct in OpenMP has an implicit barrier in at its exit point. This barrier will ensure that the results written in shared data are consistent for diﬀerent iterations of the loop. We intend to explore the behaviour of the parallel do construct. #pragma omp parallel for for(i=0; i

A Survey of Checkpointing Strategies for Shared-Memory Hpc Applications

Similar Documents

Checkpoints

Computer Organization and Architecture Designing for Performance 8th Edition

Popular Essays