Distributed Shared Memory: A survey of Issues and Algorithms - Bill Nitzberg and Virginia Lo
Disclaimer: These notes were taken while attending the COL 733:Cloud Computing course at IIT Delhi and the 6.5840:Distributed Systems course offered at MIT by Robert Morris.
Why use Multiprocessors?
We are reaching the limits of processors and shared memory, it is useful to have multiprocessors to increase computing power.
Two kinds of parallel processors:
A tightly coupled shared-memory multiprocessors
Distributed-memory multiprocessors
Tightly Coupled Multiprocessor System
- consisting of multiple CPUs and a single global physical memory
- easy to program because it is a natural extension of a single CPU system.
- BottleNeck: Main memory is accessed via a common bus –> a serialization point –> this limits the system size to tens of processors.
Distributed-Memory Multiprocessors
- these do not suffer from the same drawback like the tightly coupled system
- consists of a collection of independent computers connected by a high speed interconnection network.
- If correct network topolgy is chosen to design –> system can contain many orders of magnitude more processors than a tightly coupled system.
Systems before were limited to a message-passing paradigm(a basic approach for inter-process communication(IPC) in distributem systems for communication.
Now Systems have implemented an abstraction on top of message-passing paradigm called as the shared-memory abstraction –> this new paradigm gave the systems the illusion of physically shared memory and allows the programmers to use it.
DSM - Why??
- It provides a virtual address space shared among processes on loosely coupled processors.
- Advantages:
- Ease of programming
- Portability due to the shared memory paradigm
- Low-cost of distributed-memory machines.
- Scalability due to absence of hardware bottlenecks.
- Low Runtime Overhead
DSM - GOALS!!
- overcome the architectural limitations(memory size)
- to support a better programming paradigm
DSM - Implementation??
- Implemented using 3 approaches (some systems use more than one approach).
- Hardware implementations: –> Extend traditional caching techniques to work in scalable architectures.
- Operating system and library implementations: that achieve sharing and coherence through virtual memory-management mechanisms,and
- Compiler Implemetations: –> Automatically convert shared data accesses into synchronized operations.
DSM - Issues!!
- Memory Coherence
- Design Choices
- Implementation Methods
Design Choices ??
System Designer should make choices regarding:
- Structure
- Granularity
- Access
- Coherence Semnatics
- Scalability
- Heterogenity
Structure and Granularity:
- Structure and Granularity are closely related
- Structure refers to the layout of the shared data in the memory. (most DSM systems do not structure memory –> it is a linear array of words.)
- Some structure the data as objects, language types.
-
Granularity refers to the size of the unit of sharing: byte word, page or complex data structure.
- Shared-Memory programs provide locality of reference, a process is likely to access a large region of its shared memory space in a small amount of time. SO, Larger page size reduce paging overhead.
- But sharing may also cause
contention–> larger the page size, higher the chance that more than page will require access to a page. Smaller pages reduce the possibility of false sharing. - Another thing that affects the page size is the need to keep directory information about the page. Smaller the page size, larger the directory.
How to structure the shared memory?
By data type. Due to this method the shared memory is structured as objects in the distributed object oriented system or structured as variables in the source language. Another way to structure the shared memory is to do it like a database. They order the shared memory as an associative memory called a tuple space.
Coherence semantics.
- Programmers need to understand how parallel memory updates are propagated throughout the system.
- Strict consistency - most intuitive semantics for memory coherence –> a read operation will return the most recent written value
- In Distributed Systems, most recently is ambiguous concept. –> so DSM systems provide only a reduced form of memory coherence –> Relaxed Consistency ( only certain part has consistency)
- Plus provides processor consistency
- Dash provides release consistency
Strict and Sequential Consistency
Strict Consistency - Every time you read data, you get the most recent update Sequential Consistency - Operations appear in a sequence as if they were executed one after another, maintaining a logical order.
Relaxed Consistency Models
Processor Consistency - Each node writes data in order, but different computers might not agree on the order.
Weak and Release Consistency:
Weak Consistency - Consistency is managed using specific commands to control when data updates are visible.
Release Consistency - A type of weak consistency where these commands ensure that updates are seen in a consistent oreder.
DSM systems
| System | Details and Features |
|---|---|
| Agora | Heterogeneous DSM system that allows data structures to be shared across machines. Supports weak consistency. (Bisiani and Forin, Carnegie Mellon University, 1987- ) |
| Amber | Object-based DSM system where sharing is performed by migrating processes to data as well as data to processes. (Chase, Feeley, and Levy, University of Washington, 1988- ) |
| Capnet | An extension of DSM to a wide area network. (Tam and Farber, University of Delaware, 1990- ) |
| Choices | DSM incorporated into a hierarchical object-oriented distributed operating system. (Johnston and Campbell, University of Illinois, 1988- ) |
| Clouds | Object-oriented distributed operating system where objects can migrate. (Ramachandran and Khalidi, Georgia Institute of Technology, 1987- ) |
| Dash | Hierarchical implementation of DSM with a directory-based coherence protocol. Supports release consistency. (Lenoski, Laudon, Gharachorloo, Gupta, and Hennessy, Stanford University, 1988- ) |
| Emerald | An object-oriented language and system that indirectly supports DSM through object mobility. (Jul, Levy, Hutchinson, and Black, University of Washington, 1986-1988) |
| Ivy | Early page-oriented DSM on a network of Apollo workstations. (Li, Yale University, 1984-1986) |
| Linda | Shared associative object memory with access functions. Can be implemented for many languages and machines. (Carriero and Gelernter, Yale University, 1982- ) |
| Memnet | Hardware implementation of DSM on a 200-Mbps token ring used to broadcast invalidates and read requests. (Delp and Farber, University of Delaware, 1986-1988) |
| Mermaid | Heterogeneous DSM system where the compiler forces shared pages to contain a single data type. Type conversion is performed on reference. (Stumm, Zhou, Li, and Wortman, University of Toronto and Princeton University, 1988-1991) |
| Mether | Transparent DSM built on SunOS 4.0. Allows applications to access an inconsistent state for efficiency. (Minnich and Farber, Supercomputing Research Center, Bowie, MD, 1990- ) |
| Mirage | Kernel-level implementation of DSM. Reduces thrashing by prohibiting a page from being stolen until a certain time interval has elapsed. (Fleisch and Popek, University of California at Los Angeles, 1987-1989) |
| Munin | Object-based DSM system that investigates type-specific coherence protocols. (Bennett, Carter, and Zwaenepoel, Rice University, 1989- ) |
| Plus | Hardware implementation of DSM. Uses a write-once protocol and performs replication only by program request. (Bisiani and Ravishankar, Carnegie Mellon University, 1988- ) |
| Shiva | Ivy-like DSM system for the Intel iPSC/2 hypercube. (Li and Schaefer, Princeton University, 1988- ) |
| Shared Data-Object Model | Implementation of DSM on top of the Amoeba distributed operating system. (Bal, Kaashoek, and Tanenbaum, Vrije University, The Netherlands, 1988- ) |
Design Choices
Scalability
- DSM systems scale better than tightly coupled shared-memory multiprocessors.
- Scalability in DSM greatly reduces:
- Central bottlenecks (such as the bus of a tightly coupled shared-memory multiprocessor)
- Global Common Knowledge operations (such as broadcast messages or full directories, whose sizes are proportional to the number of nodes.)
Heterogeneity
Sharing of memory between two machines having different architectures may seem impossible.
- because they may not be using the same representation for basic data types(integer, floats etc).
- could be easier if DSM systems were structured as variables or objects then a compiler could have just converted to shared memory
- for eg:
Mermaidin this memory is shared in pages and a page can contain only one type of data.- If a page is moved between different architectures, a conversion routine can convert the data in the page to appropriate format.
- Conversion overhead outweighs the benefits.
Implementation
DSM system must automatically transform shared-memory access into interprocess communication.
- requires algos to *locate and access shared data, maintain coherence and replace data.
- algos should support process synchornization and memory management.
Data Location and access.
To share data in a DSM system
- program must find and retrieve the data.
- if data does not move around in the system –> it resides in a static location –> finding is easy
Lindaimplementation use hashing on the tuples to distribute data statically –> simple and fast but may cause bottleneck if not distributed properly.
- if data moves around the system –> allows the data to be redistributed dynamically where it is being used. –> finding is difficult
- Best way to locate data in this is to have a
centralized serverthat keeps track of all the shared data.- 2 drawbacks of centralized server:
- The server serializes location queries –> reducing parallelism.
- Server may become heavily loaded –> slowing the entire system.
- Cold have broadcasted requests for data but it does not scale well.
- 2 drawbacks of centralized server:
- Another approach:
- Owner-Based distributed scheme:Each piece of data has an owner, When the data migrates through the system the owner(node) keeps changing. when another node needs a copy of the data it sends a request to the owner to see if it still has the data or has given to some other node. it forwards the request to the new owner.
- Drawback: a request is beign forwarded many times before reaching current owner.
- Replicated Data: The system must keep track of the replicated copies. We keep a directory to see if data is stored locally, shared remotely or shared dirty. If shared remote –> directory entry also indicates the location of replicated copies.
- Owner-Based distributed scheme:Each piece of data has an owner, When the data migrates through the system the owner(node) keeps changing. when another node needs a copy of the data it sends a request to the owner to see if it still has the data or has given to some other node. it forwards the request to the new owner.
- Best way to locate data in this is to have a
- if data does not move around in the system –> it resides in a static location –> finding is easy
Coherence Protocol.
All DSM systems provides some type of memory coherence.
- If the shared data is not replicated
- enforcing memory coherence is trivial.
- network automatically serializes requests in order they occur. –> ensures strict memory consistency. –> strongest form of coherence.
- Serializing data creates a bottleneck –> makes prallelism impossible.
- If the shared data is replicated
- increases parallelism –> for eg: multiple reads can be performed in parallel.
- Replication complicates the coherence protocol
Two protocols handle replication
- Write-Invalidate –> there can be many copies of a read-only piece of data, but only one copy of a writable piece of data. –> invalidates all copies of a piece of data except one before a write can proceed. (MOST DSM systems use this.)
- Each piece of data has a status tag indicating whether data is valid or not! whether it is shared or not! whether it is read only of writable.
- For a read
- if the data is valid –> it is returned immediately.
- if the data is not valid –> a read request is sent to the location of a valid copy, and a copy of the data is returned.
- if the data was writable on another node –> this read request will cause it to become read only. –> remains valid until an invalid request is sent.
- For a write
- If the data is valid and writable –> the request is satisfied immediately.
- If the data is not writable –> the directory controller sends out an invalidate request along with a request for a copy of the data is the local copy is not valid. –> When the invalidate conmpletes, the data is valid locally and writable, and the original write request may complete.
- Write-Update –> a write updates all copies of a piece of data. –> can be implemented well in hardware implemented DSM.
### Replacement Strategy. When system allows data to move around, two problems occur when the available space for “caching” shared data fills up:
- Which data should be replaced to free space.
- Where should it go?
DSM systems choose by differentiating the status of data items and prioritize them. - Priority given to shared item over exclusively owned items
DSM needs to make sure that the data that is replaced is not lost. In a caching system of a multiprocessor the data would be put in main memory.
Some systems use a similar approach to the caching systems, they transfer the data items to a home-node that has a statically allocated space(perhaps on disk), to store data when it is not needed. –> simple to implement method but wastes a lot of memory.
- An improvement would be to simply have the node that wants to delete the item to page it out onto disk. –> does not waste memory but time consuming.
- Or a better way (as in
Shiva) would be to keep to track of free memory in the system and to simply page the item out to a node with space available to it.
### Thrashing
DSM systems particularly prone to thrashing
- Two nodes competing to write access to a single data item.
- It may be transferred back and forth at such a hight rate that no real work can get done. (a PING PONG effect)
MuninandMirageattack this problem.- Munin - allows programmers to associate types with shared data
- write-once, write many, producer-consumer, private, migratory, result, read-mostly, synchronization and general read/write.
- shared data of differnt types get differnt coherence protocols.
- to avoid throashing, the programmer can specify the type as write-many and the system would implement a delayed write policy.
- but programmers are bad at predicting the behaviour of their programs, so this method may not be very helpful.
- also since type remains static once specified, Munin cannot dynamically adjust of application’s changing behaviour.
- Mirage - specifically examines for cases when nodes might be competing for access to same page.
- To stop PING PONG effect –> adds a dynamically tunable parameter to the coherence protocol
- this parameter determines the minimum amount of time($\Delta$) a page will be available at a node.
- a node can only perform a write to a shared page, the page would be writable on that node for $\Delta$ time.
- This way, a process can complete a write run(or read run) before losing accesss to that page.
- $\Delta$ - is similar to a time slice in a multitasking OS.
- but in Mirage it can be dynamically adjusted to meet an application’s specific needs.
- Munin - allows programmers to associate types with shared data
## Related Algorithms
To support DSM systems Synchronization operations and Memory Management must be specialy tuned.
### Synchronization
Semaphores implemented on shared-memory systems by using spin locks, but spin locks in DSM can cause heavy thrashing due to multiple nodes accessing the shared data.
Clouds provide semaphore operations by grouping sempahores into centrally managed segments
Munin suppports synchronization memory type with distributed locks.
Plus has a variety of synchronization instructions and supports delayed execution, in which synchronization can be initiated.
### Memory Management
- Typically as in
malloc()memory is allocated out of a common pool which is searched each time a request is made. –> this approach can be expensive. - Better would be to partition available memory into private buffers on each node and allocate memory from the global buffer space only when the private buffer is empty.
DSM - CHALLENGES ??!
- High cost of communication.
- factors that affect the performance: granularity and replacement algorithm
## Conclusion
- Performance of DSM is greatly affected by memory-access patterns and replication of shared data.
- Hardware implementations reduce communication latency
- DSM effectively supports prallel processing.
Comments powered by Disqus.