Distributed Shared Memory: A survey of Issues and Algorithms - Bill Nitzberg and Virginia Lo

Posted Aug 4, 2024

By Adnan 13 min read

Disclaimer: These notes were taken while attending the COL 733:Cloud Computing course at IIT Delhi and the 6.5840:Distributed Systems course offered at MIT by Robert Morris.

Why use Multiprocessors?

We are reaching the limits of processors and shared memory, it is useful to have multiprocessors to increase computing power.

Two kinds of parallel processors:

A tightly coupled shared-memory multiprocessors Distributed-memory multiprocessors

Tightly Coupled Multiprocessor System

consisting of multiple CPUs and a single global physical memory
easy to program because it is a natural extension of a single CPU system.
BottleNeck: Main memory is accessed via a common bus –> a serialization point –> this limits the system size to tens of processors.

Distributed-Memory Multiprocessors

these do not suffer from the same drawback like the tightly coupled system
consists of a collection of independent computers connected by a high speed interconnection network.
If correct network topolgy is chosen to design –> system can contain many orders of magnitude more processors than a tightly coupled system.

Systems before were limited to a message-passing paradigm(a basic approach for inter-process communication(IPC) in distributem systems for communication.

Now Systems have implemented an abstraction on top of message-passing paradigm called as the shared-memory abstraction –> this new paradigm gave the systems the illusion of physically shared memory and allows the programmers to use it.

DSM - Why??

It provides a virtual address space shared among processes on loosely coupled processors.
Advantages: - Ease of programming
- Portability due to the shared memory paradigm
- Low-cost of distributed-memory machines.
- Scalability due to absence of hardware bottlenecks.
- Low Runtime Overhead

DSM - GOALS!!

overcome the architectural limitations(memory size)
to support a better programming paradigm

DSM - Implementation??

Implemented using 3 approaches (some systems use more than one approach).
1. Hardware implementations: –> Extend traditional caching techniques to work in scalable architectures.
2. Operating system and library implementations: that achieve sharing and coherence through virtual memory-management mechanisms,and
3. Compiler Implemetations: –> Automatically convert shared data accesses into synchronized operations.

DSM - Issues!!

Memory Coherence
Design Choices
Implementation Methods

Design Choices ??

System Designer should make choices regarding:

Structure
Granularity
Access
Coherence Semnatics
Scalability
Heterogenity

Structure and Granularity:

Structure and Granularity are closely related
Structure refers to the layout of the shared data in the memory. (most DSM systems do not structure memory –> it is a linear array of words.)
Some structure the data as objects, language types.
Granularity refers to the size of the unit of sharing: byte word, page or complex data structure.
Shared-Memory programs provide locality of reference, a process is likely to access a large region of its shared memory space in a small amount of time. SO, Larger page size reduce paging overhead.
But sharing may also cause contention –> larger the page size, higher the chance that more than page will require access to a page. Smaller pages reduce the possibility of false sharing.
Another thing that affects the page size is the need to keep directory information about the page. Smaller the page size, larger the directory.

How to structure the shared memory?

By data type. Due to this method the shared memory is structured as objects in the distributed object oriented system or structured as variables in the source language. Another way to structure the shared memory is to do it like a database. They order the shared memory as an associative memory called a tuple space.

Coherence semantics.

Programmers need to understand how parallel memory updates are propagated throughout the system.
Strict consistency - most intuitive semantics for memory coherence –> a read operation will return the most recent written value
In Distributed Systems, most recently is ambiguous concept. –> so DSM systems provide only a reduced form of memory coherence –> Relaxed Consistency ( only certain part has consistency)
- Plus provides processor consistency
- Dash provides release consistency

Strict and Sequential Consistency

Strict Consistency - Every time you read data, you get the most recent update Sequential Consistency - Operations appear in a sequence as if they were executed one after another, maintaining a logical order.

Relaxed Consistency Models

Processor Consistency - Each node writes data in order, but different computers might not agree on the order.

Weak and Release Consistency:

Weak Consistency - Consistency is managed using specific commands to control when data updates are visible.

Release Consistency - A type of weak consistency where these commands ensure that updates are seen in a consistent oreder.

DSM systems

System	Details and Features
Agora	Heterogeneous DSM system that allows data structures to be shared across machines. Supports weak consistency. (Bisiani and Forin, Carnegie Mellon University, 1987- )
Amber	Object-based DSM system where sharing is performed by migrating processes to data as well as data to processes. (Chase, Feeley, and Levy, University of Washington, 1988- )
Capnet	An extension of DSM to a wide area network. (Tam and Farber, University of Delaware, 1990- )
Choices	DSM incorporated into a hierarchical object-oriented distributed operating system. (Johnston and Campbell, University of Illinois, 1988- )
Clouds	Object-oriented distributed operating system where objects can migrate. (Ramachandran and Khalidi, Georgia Institute of Technology, 1987- )
Dash	Hierarchical implementation of DSM with a directory-based coherence protocol. Supports release consistency. (Lenoski, Laudon, Gharachorloo, Gupta, and Hennessy, Stanford University, 1988- )
Emerald	An object-oriented language and system that indirectly supports DSM through object mobility. (Jul, Levy, Hutchinson, and Black, University of Washington, 1986-1988)
Ivy	Early page-oriented DSM on a network of Apollo workstations. (Li, Yale University, 1984-1986)
Linda	Shared associative object memory with access functions. Can be implemented for many languages and machines. (Carriero and Gelernter, Yale University, 1982- )
Memnet	Hardware implementation of DSM on a 200-Mbps token ring used to broadcast invalidates and read requests. (Delp and Farber, University of Delaware, 1986-1988)
Mermaid	Heterogeneous DSM system where the compiler forces shared pages to contain a single data type. Type conversion is performed on reference. (Stumm, Zhou, Li, and Wortman, University of Toronto and Princeton University, 1988-1991)
Mether	Transparent DSM built on SunOS 4.0. Allows applications to access an inconsistent state for efficiency. (Minnich and Farber, Supercomputing Research Center, Bowie, MD, 1990- )
Mirage	Kernel-level implementation of DSM. Reduces thrashing by prohibiting a page from being stolen until a certain time interval has elapsed. (Fleisch and Popek, University of California at Los Angeles, 1987-1989)
Munin	Object-based DSM system that investigates type-specific coherence protocols. (Bennett, Carter, and Zwaenepoel, Rice University, 1989- )
Plus	Hardware implementation of DSM. Uses a write-once protocol and performs replication only by program request. (Bisiani and Ravishankar, Carnegie Mellon University, 1988- )
Shiva	Ivy-like DSM system for the Intel iPSC/2 hypercube. (Li and Schaefer, Princeton University, 1988- )
Shared Data-Object Model	Implementation of DSM on top of the Amoeba distributed operating system. (Bal, Kaashoek, and Tanenbaum, Vrije University, The Netherlands, 1988- )

Design Choices

Scalability

DSM systems scale better than tightly coupled shared-memory multiprocessors.
Scalability in DSM greatly reduces:
- Central bottlenecks (such as the bus of a tightly coupled shared-memory multiprocessor)
- Global Common Knowledge operations (such as broadcast messages or full directories, whose sizes are proportional to the number of nodes.)

Heterogeneity

Sharing of memory between two machines having different architectures may seem impossible.

because they may not be using the same representation for basic data types(integer, floats etc).
could be easier if DSM systems were structured as variables or objects then a compiler could have just converted to shared memory
for eg: Mermaid in this memory is shared in pages and a page can contain only one type of data.
- If a page is moved between different architectures, a conversion routine can convert the data in the page to appropriate format.
- Conversion overhead outweighs the benefits.

Implementation

DSM system must automatically transform shared-memory access into interprocess communication.

requires algos to *locate and access shared data, maintain coherence and replace data.
algos should support process synchornization and memory management.

Data Location and access.

To share data in a DSM system

program must find and retrieve the data.
- if data does not move around in the system –> it resides in a static location –> finding is easy
  - Linda implementation use hashing on the tuples to distribute data statically –> simple and fast but may cause bottleneck if not distributed properly.
- if data moves around the system –> allows the data to be redistributed dynamically where it is being used. –> finding is difficult
  - Best way to locate data in this is to have a centralized server that keeps track of all the shared data.
    - 2 drawbacks of centralized server:
      1. The server serializes location queries –> reducing parallelism.
      2. Server may become heavily loaded –> slowing the entire system.
    - Cold have broadcasted requests for data but it does not scale well.
  - Another approach:
    1. Owner-Based distributed scheme:Each piece of data has an owner, When the data migrates through the system the owner(node) keeps changing. when another node needs a copy of the data it sends a request to the owner to see if it still has the data or has given to some other node. it forwards the request to the new owner.
      - Drawback: a request is beign forwarded many times before reaching current owner.
    2. Replicated Data: The system must keep track of the replicated copies. We keep a directory to see if data is stored locally, shared remotely or shared dirty. If shared remote –> directory entry also indicates the location of replicated copies.

Coherence Protocol.

All DSM systems provides some type of memory coherence.

If the shared data is not replicated
- enforcing memory coherence is trivial.
- network automatically serializes requests in order they occur. –> ensures strict memory consistency. –> strongest form of coherence.
- Serializing data creates a bottleneck –> makes prallelism impossible.
If the shared data is replicated
- increases parallelism –> for eg: multiple reads can be performed in parallel.
- Replication complicates the coherence protocol
Two protocols handle replication
1. Write-Invalidate –> there can be many copies of a read-only piece of data, but only one copy of a writable piece of data. –> invalidates all copies of a piece of data except one before a write can proceed. (MOST DSM systems use this.) - Each piece of data has a status tag indicating whether data is valid or not! whether it is shared or not! whether it is read only of writable.
  - For a read
  - if the data is valid –> it is returned immediately.
  - if the data is not valid –> a read request is sent to the location of a valid copy, and a copy of the data is returned.
  - if the data was writable on another node –> this read request will cause it to become read only. –> remains valid until an invalid request is sent.
- For a write
  - If the data is valid and writable –> the request is satisfied immediately.
  - If the data is not writable –> the directory controller sends out an invalidate request along with a request for a copy of the data is the local copy is not valid. –> When the invalidate conmpletes, the data is valid locally and writable, and the original write request may complete.
1. Write-Update –> a write updates all copies of a piece of data. –> can be implemented well in hardware implemented DSM.

### Replacement Strategy. When system allows data to move around, two problems occur when the available space for “caching” shared data fills up:

Which data should be replaced to free space.
Where should it go?

DSM systems choose by differentiating the status of data items and prioritize them. - Priority given to shared item over exclusively owned items

DSM needs to make sure that the data that is replaced is not lost. In a caching system of a multiprocessor the data would be put in main memory.

Some systems use a similar approach to the caching systems, they transfer the data items to a home-node that has a statically allocated space(perhaps on disk), to store data when it is not needed. –> simple to implement method but wastes a lot of memory.

An improvement would be to simply have the node that wants to delete the item to page it out onto disk. –> does not waste memory but time consuming.
Or a better way (as in Shiva) would be to keep to track of free memory in the system and to simply page the item out to a node with space available to it.

### Thrashing DSM systems particularly prone to thrashing

Two nodes competing to write access to a single data item.
- It may be transferred back and forth at such a hight rate that no real work can get done. (a PING PONG effect)
- Munin and Mirage attack this problem.
  - Munin - allows programmers to associate types with shared data
    - write-once, write many, producer-consumer, private, migratory, result, read-mostly, synchronization and general read/write.
    - shared data of differnt types get differnt coherence protocols.
    - to avoid throashing, the programmer can specify the type as write-many and the system would implement a delayed write policy.
    - but programmers are bad at predicting the behaviour of their programs, so this method may not be very helpful.
    - also since type remains static once specified, Munin cannot dynamically adjust of application’s changing behaviour.
  - Mirage - specifically examines for cases when nodes might be competing for access to same page.
    - To stop PING PONG effect –> adds a dynamically tunable parameter to the coherence protocol
    - this parameter determines the minimum amount of time($\Delta$) a page will be available at a node.
    - a node can only perform a write to a shared page, the page would be writable on that node for $\Delta$ time.
    - This way, a process can complete a write run(or read run) before losing accesss to that page.
    - $\Delta$ - is similar to a time slice in a multitasking OS.
    - but in Mirage it can be dynamically adjusted to meet an application’s specific needs.

## Related Algorithms To support DSM systems Synchronization operations and Memory Management must be specialy tuned.

### Synchronization

Semaphores implemented on shared-memory systems by using spin locks, but spin locks in DSM can cause heavy thrashing due to multiple nodes accessing the shared data.

Clouds provide semaphore operations by grouping sempahores into centrally managed segments

Munin suppports synchronization memory type with distributed locks.

Plus has a variety of synchronization instructions and supports delayed execution, in which synchronization can be initiated.

### Memory Management

Typically as in malloc() memory is allocated out of a common pool which is searched each time a request is made. –> this approach can be expensive.
Better would be to partition available memory into private buffers on each node and allocate memory from the global buffer space only when the private buffer is empty.

DSM - CHALLENGES ??!

High cost of communication.
factors that affect the performance: granularity and replacement algorithm

## Conclusion

Performance of DSM is greatly affected by memory-access patterns and replication of shared data.
Hardware implementations reduce communication latency
DSM effectively supports prallel processing.

Also see:

Lecture 19: Distribute Shared Memory

Chapter 7: Distributed Shared Memory

This post is licensed under CC BY 4.0 by the author.