Abstract
Shared Memory Multi processors require cache coherence in order to keep cached values updated while performing operations. Snooping and directory based protocols are two well known standards of cache coherence. However both of them possess some problems. Snooping protocol is not scalable and is only suitable for systems of 2 to 8 SMP’s. Whereas directory based protocol gives rise to memory overhead when there are too many sharers of a particular block. We propose a novel protocol for cache coherence that exploits the private block of memory. Coherence protocol invoked only for shared data blocks. This reduces network and storage overhead and it does not compromise with scalability as well.
Introduction
Shared Memory Multi Processor has multiple processors with their caches and a global memory. Memory is connected with processors and a global address space is maintained. When a block of data is cached by one processor, it said to private. The block of data is called shared if more than one processor cache the same block of data. In later case it necessary that read operation of any processor should return the value that was recently written. This phenomenon is called cache coherence. Snooping protocol and directory based protocol are to de facto standards for cache coherence. Snooping protocol performs well in a bus based shared memory model. However it is not scalable. It is only good for small systems containing not more than 8 processors. Increasing demand of processing power requires the system to be scalable. The bus in SMP is therefore replaced with scalable network. Snooping protocol shows poor performance in network based SMP. Whenever a write operation is performed by any processor, a snoop is sent to all caches to invalidate or update the shared block of data. This can increase communication overhead when there are too many sharers. This problem was resolved in directory based protocol. In this protocol every cache looks up the directory of blocks with status bits. Status bits keep track of all sharers and their corresponding status of block. Snoop is sent only when a shared block is required for read. This outperforms the drawback of snooping protocol. However directory based protocol has some disadvantages. Those disadvantages are: i. Directory storage
Each block in L2 cache maintains a directory of sharers. For each sharer a bit is reserved in the block hence the storage increases linearly with number of sharers. ii. Indirection
Multiple messages are exchanged through network before the coherence operation is deemed to complete. For example, when a processor performs write, the directory first sends invalidation to the sharers and the write operations is only performed after acknowledgement is received from all sharers. iii. Complexity
Directory-based coherence conventions are frequently error-inclined and whole research communities are handling their efficient outline with formal verication.
Many of the applications running on today's multi-core machines are still single-threaded applications that do not explicitly rely on cache coherence. Further, future many-cores in servers and datacenters will likely execute multiple VMs (each possibly executing a multi-programmed workload), with no data sharing between VMs, again removing the need for cache coherence.
Moreover, several machines share data through message passing hence no cache coherence protocol is needed.
Figure 1 indicates that very little data is actually shared by two or more cores; on average 77.0% of all memory locations are touched by only a single processor.
Based on above arguments, we claim that the percentage of processing required for shared memory multi-threaded execution that actually needs cache coherence will be much less than that utilized by traditional hardware cache coherent multiprocessors.
Proposed Solution (SWEL)
The proposed solution is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. The basic claim is this: (i) many blocks do not need coherence and can be freely placed in L1 caches; (ii) blocks that would need coherence if placed in L1 are only placed in L2. Given this claim, it appears that the coherence protocol is all but eliminated.
This is only partially true as other book-keeping is now required to identify which of the above two categories a block falls into. If a cache block is either private or is read-only, then that block can be safely cached in the L1 without ever needing to worry about coherence. If the block is both shared (not private) and written (not read-only), then it must never exist in the L1 and must exist at the lowest common level of the cache hierarchy, where all cores have equal access to it without fear of ever requiring additional coherence operations. If a block is mis-categorized as read-only or as private, then it must be invalidated and evicted from all L1 caches and must reside permanently in the lowest common level of cache. a. States of SWEL
Every block in L2 has 3 bits to represent state and the block in L1 has 1 bit. The 3 bits are Shared (S), Written (W) and Exclusive Level (EL). The ‘S’ bit keeps track if the block is shared, ‘W’ bit depicts if the block is written and the ‘EL’ bit denotes which cache has exclusive access to the block. The last bit ‘EL’ is also the one bit in L1 cache. Now we explain how a block goes through various transitions throughout its lifetime. b. Initial Access
When a data block is 1st read by a processor, the block is brought into the L2 and also in L1 of that processor. It is in the ‘Private Read’ state and EL state bit in the L1. When processor evicts this block, it sets it EL bit off and the block goes in ‘L2 only state’. The state in the L2 is now non-shared, non-written and exclusive to the L2. c. Writes
When a block is 1st written by a processor, it will enter the Private
R/W state. A message is sent as part of the write through policy to the L2 so that it may update its state to set the W bit. This message also contains a bit that is set if this writing processor has the EL token for that block. This is so the L2 knows to not transition to a shared state. From the Private R/W state, an eviction of this block will send a message to the L2 giving back its EL bit and it will go back to the L2 only state. Private data spends all of its time in these three states: L2 only, Private Read and Private R/W. d. Shared State
If a processor reads or writes a block which is exclusive at another processor, the ‘S’ bit of that block is set in L2. The block then enters in shared read or shared write state. One ‘S’ bit in L2 is set, it will never change unless the block is evicted from L2. e. Requirements for Sequential Consistency
A sequentially consistent (SC) implementation requires that each core preserve program order and that each write happens atomically in some program-interleaved order [1]. In a traditional invalidation based directory MESI protocol, the atomic-write condition can be simplified to state, .a cached value cannot be propagated to other readers until the previous write to that block has invalidated all other cached copies. [1]. This condition is trivially met by the SWEL protocol, when the above situation arises, the block is relegated to L2 after a broadcast that invalidates all other cached copies. As a performance optimization, processors can issue reads speculatively, but then have to re-issue the read at the time of instruction commit to verify that the result is the same as an SC implementation. Such an optimization would apply to the SWEL protocol just as it would to the baseline MESI protocol.
Figure 2 State diagram of SWEL protocol
Optimizations of SWEL
SWEL protocol evicts data from L1 once the system enters shared state. There is no way to come back to private state again and the data remains in L2. This increases latency. We reconstitute SWEL so that the evicted data is re-cacheable in L1 if it is private again. However this technique will give rise to a problem that the data will keep bouncing between L1 and L2, this is similar to directory-based protocol. Therefore RSWEL reconstitutes data after a certain period of time, and only if it is unlikely to cause more coherence operations in future. In order to achieve this, we add 2-bit counter to each block. The counter is initialized at 2 and incremented every time a block is written while it is in shared state. The counter can approach to its maximum value N and reset again. RSWEL with N=0 will behave same as directory-based cache and with N=infinity will behave like normal SWEL protocol.
Dynamically Tuned RSWEL
RSWEL optimization assumes fixed N throughout the execution period. However we may prefer one value of N to another depending upon transition states. Dynamically Tuned RSWEL checks the performance for different values of N and every time it chooses the optimized value. The value is changed whenever program phase change is detected.
Implementation
SWEL protocol was tested using Virtutech Simics [2] full system simulator version 3.0.x. We model a Sunfire 6500 machine architecture modified to support a 16-way CMP processor implementation of in-order ultra SPARC III cores with system parameters as shown in Table 1.
Experiment and Results
Four parameters were taken under consideration for comparison of SWEL with other protocols. Those parameters are performance, L1 miss rate per instruction, L1 eviction rate per instruction, network load, and energy requirement. The comparison is depicted in following graphs
Figure 3 L1 eviction per kilo instructions
Figure 4 Average flits in the grid network at given time time
Figure 5 L1 misses per kilo instructions
Figure 6 Coherence operations per kilo instructions
Figure 7 Energy consumption profile
Conclusion
We have devised a novel protocol for shared memory multiprocessors. It will take advantage of private blocks of memory keeping in L1 in L2 and setting EL token after every operation accordingly. Since most of the time the data is not shared by all processors simultaneously, therefore SWEL protocol overcomes the drawback of traditional directory based protocol and snooping protocol.
Further we optimized SWEL so that data can be brought in L1 again if it becomes private while being in Shared state. We checked the status of each block after fixed period of time N. Dynamically tuned RSWEL chooses the value of N depending upon the performance of the program.
References
[1] D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1999.
[2] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50.58, February 2002.