Learn Zig Series (#66) - Shared Memory and Semaphores

[IMAGE: https://images.hive.blog/DQmaHuB6qTWHaSpJHQ1S8FCCRmNQuUxTcPZdU4yKHsJ7vEP/zig-banner.png]

What will I learn

How POSIX shared memory works with shm_open and mmap for inter-process data sharing;
How to design shared data structures with careful memory layout;
How semaphores coordinate access to shared resources between processes;
How to implement the producer-consumer pattern using shared memory;
How memory barriers and atomic operations prevent data races in shared regions;
When to choose shared memory over message passing (and vice versa);
How to properly clean up shared memory segments with shm_unlink.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#66) - Shared Memory and Semaphores

Solutions to Episode 65 Exercises

Exercise 1: Build a "tee" command

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();
    _ = allocator;

    const args = std.process.argsAlloc(std.heap.page_allocator) catch std.process.exit(1);
    defer std.process.argsFree(std.heap.page_allocator, args);

    if (args.len &lt; 2) {
        const stderr = std.io.getStdErr().writer();
        try stderr.print("usage: tee \n", .{});
        std.process.exit(1);
    }

    const output_path = args[1];

    // create a pipe for forwarding data to the file-writing child
    const pipe_fds = try std.posix.pipe();

    const pid = try std.posix.fork();
    if (pid == 0) {
        // child: read from pipe, write to file
        std.posix.close(pipe_fds[1]);

        const file = std.fs.cwd().createFile(output_path, .{}) catch std.process.exit(1);
        defer file.close();

        var buf: [4096]u8 = undefined;
        while (true) {
            const n = std.posix.read(pipe_fds[0], &amp;buf) catch break;
            if (n == 0) break;
            file.writeAll(buf[0..n]) catch break;
        }
        std.posix.close(pipe_fds[0]);
        std.process.exit(0);
    }

    // parent: read stdin, write to both stdout AND pipe
    std.posix.close(pipe_fds[0]);

    const stdin = std.io.getStdIn();
    const stdout = std.io.getStdOut();
    var buf: [4096]u8 = undefined;

    while (true) {
        const n = stdin.read(&amp;buf) catch break;
        if (n == 0) break;
        stdout.writeAll(buf[0..n]) catch {};
        _ = std.posix.write(pipe_fds[1], buf[0..n]) catch break;
    }

    std.posix.close(pipe_fds[1]);
    _ = std.posix.waitpid(pid, 0);
}

The parent reads stdin and writes each chunk to both stdout (the terminal) and the pipe's write end. The child reads from the pipe's read end and writes to the file. When stdin hits EOF, the parent closes the pipe write end, the child sees EOF and exits.

Exercise 2: Bi-directional IPC echo server

const std = @import("std");

const Message = struct {
    payload: []const u8,

    fn encode(self: Message, writer: anytype) !void {
        const len: u32 = @intCast(self.payload.len);
        const len_bytes = std.mem.toBytes(std.mem.nativeToBig(u32, len));
        try writer.writeAll(&amp;len_bytes);
        try writer.writeAll(self.payload);
    }

    fn decode(allocator: std.mem.Allocator, reader: anytype) !?Message {
        var len_bytes: [4]u8 = undefined;
        const n = reader.readAll(&amp;len_bytes) catch return null;
        if (n &lt; 4) return null;
        const len = std.mem.bigToNative(u32, std.mem.bytesToValue(u32, &amp;len_bytes));
        if (len == 0) return null;

        const buf = try allocator.alloc(u8, len);
        const read_n = reader.readAll(buf) catch |err| {
            allocator.free(buf);
            return err;
        };
        if (read_n &lt; len) {
            allocator.free(buf);
            return null;
        }
        return .{ .payload = buf };
    }
};

fn toUpper(buf: []u8) void {
    for (buf) |*c| {
        if (c.* &gt;= 'a' and c.* &lt;= 'z') c.* -= 32;
    }
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();
    const stdout = std.io.getStdOut().writer();

    // two pipes: parent-&gt;child and child-&gt;parent
    const to_child = try std.posix.pipe();
    const to_parent = try std.posix.pipe();

    const pid = try std.posix.fork();
    if (pid == 0) {
        // child: echo server (uppercase)
        std.posix.close(to_child[1]);
        std.posix.close(to_parent[0]);

        const reader_file = std.fs.File{ .handle = to_child[0] };
        const writer_file = std.fs.File{ .handle = to_parent[1] };
        const reader = reader_file.reader();
        const writer = writer_file.writer();

        while (true) {
            const msg = Message.decode(allocator, reader) catch break;
            if (msg == null) break;
            const m = msg.?;

            // copy payload so we can mutate it
            const response = allocator.dupe(u8, m.payload) catch break;
            allocator.free(@constCast(m.payload));
            toUpper(response);

            const resp_msg = Message{ .payload = response };
            resp_msg.encode(writer) catch break;
            allocator.free(response);
        }

        std.posix.close(to_child[0]);
        std.posix.close(to_parent[1]);
        std.process.exit(0);
    }

    // parent: send messages and read responses
    std.posix.close(to_child[0]);
    std.posix.close(to_parent[1]);

    const writer_file = std.fs.File{ .handle = to_child[1] };
    const reader_file = std.fs.File{ .handle = to_parent[0] };
    const writer = writer_file.writer();
    const reader = reader_file.reader();

    const test_messages = [_][]const u8{
        "hello world",
        "zig is great",
        "pipes are cool",
        "inter-process communication",
        "final message",
    };

    for (test_messages) |text| {
        const msg = Message{ .payload = text };
        try msg.encode(writer);

        const resp = try Message.decode(allocator, reader);
        if (resp) |r| {
            try stdout.print("sent: '{s}' -&gt; got: '{s}'\n", .{ text, r.payload });
            allocator.free(@constCast(r.payload));
        }
    }

    std.posix.close(to_child[1]);
    std.posix.close(to_parent[0]);
    _ = std.posix.waitpid(pid, 0);
}

Two pipes create a full-duplex channel. The child reads from one pipe, uppercases the payload, and writes the response back through the other pipe. The parent sends all 5 messages and reads each response in order.

Exercise 3: Parallel command executor with ordered output

const std = @import("std");
const linux = std.os.linux;

const CommandResult = struct {
    output: std.ArrayList(u8),
    done: bool,
    read_fd: std.posix.fd_t,
    pid: std.posix.pid_t,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();
    const stdout = std.io.getStdOut().writer();

    const commands = [_][]const []const u8{
        &amp;.{ "sh", "-c", "sleep 0.3 &amp;&amp; echo 'command 1 line 1' &amp;&amp; echo 'command 1 line 2'" },
        &amp;.{ "sh", "-c", "echo 'command 2 line 1' &amp;&amp; echo 'command 2 line 2' &amp;&amp; echo 'command 2 line 3'" },
        &amp;.{ "sh", "-c", "sleep 0.1 &amp;&amp; echo 'command 3 done'" },
    };

    var results = try allocator.alloc(CommandResult, commands.len);
    defer {
        for (results) |*r| r.output.deinit();
        allocator.free(results);
    }

    // spawn all commands with piped stdout
    for (commands, 0..) |cmd, i| {
        const pipe_fds = try std.posix.pipe();

        const pid = try std.posix.fork();
        if (pid == 0) {
            std.posix.close(pipe_fds[0]);
            std.posix.dup2(pipe_fds[1], 1) catch std.process.exit(1);
            std.posix.close(pipe_fds[1]);

            const argv = @as([*:null]const ?[*:0]const u8, @ptrCast(cmd.ptr));
            _ = std.posix.execvpeZ(@ptrCast(cmd[0]), argv, @ptrCast(std.c.environ));
            std.process.exit(127);
        }

        std.posix.close(pipe_fds[1]);
        results[i] = .{
            .output = std.ArrayList(u8).init(allocator),
            .done = false,
            .read_fd = pipe_fds[0],
            .pid = pid,
        };
    }

    // poll all pipes concurrently, buffer per-command
    var buf: [4096]u8 = undefined;
    while (true) {
        var open_count: usize = 0;
        for (results) |r| {
            if (!r.done) open_count += 1;
        }
        if (open_count == 0) break;

        var pollfds = try allocator.alloc(linux.pollfd, results.len);
        defer allocator.free(pollfds);

        for (results, 0..) |r, i| {
            pollfds[i] = .{
                .fd = if (r.done) -1 else r.read_fd,
                .events = linux.POLL.IN,
                .revents = 0,
            };
        }

        const ready = linux.poll(pollfds.ptr, @intCast(pollfds.len), 1000);
        if (@as(isize, @bitCast(@as(usize, ready))) &lt;= 0) continue;

        for (results, 0..) |*r, i| {
            if (r.done) continue;
            if (pollfds[i].revents &amp; linux.POLL.IN != 0) {
                const n = std.posix.read(r.read_fd, &amp;buf) catch 0;
                if (n == 0) {
                    std.posix.close(r.read_fd);
                    r.done = true;
                    _ = std.posix.waitpid(r.pid, 0);
                } else {
                    try r.output.appendSlice(buf[0..n]);
                }
            }
            if (pollfds[i].revents &amp; linux.POLL.HUP != 0 and
                pollfds[i].revents &amp; linux.POLL.IN == 0)
            {
                std.posix.close(r.read_fd);
                r.done = true;
                _ = std.posix.waitpid(r.pid, 0);
            }
        }
    }

    // print in order: command 1 first, then 2, then 3
    for (results, 0..) |r, i| {
        try stdout.print("=== Command {d} ===\n{s}\n", .{ i + 1, r.output.items });
    }
}

All commands run simultaneously, and poll reads from whichever pipes have data. But each command's output is buffered separately in an ArrayList(u8). After all commands finish, we print the buffers in order. This prevents a fast command's output from interleaving with a slow command's.

Last episode we built pipes -- the oldest Unix IPC mechanism. Unidirectional byte streams between related processes, named FIFOs for unrelated ones, poll for multiplexing. Pipes are great for streaming data, but they have a fundamental limitation: every byte has to be copied. The producer writes bytes into the kernel pipe buffer, the consumer reads them out. For small messages that's fine, but when you need two processes to work on the same large dataset -- say a shared counter, a ring buffer, a lookup table with millions of entries -- copying everything through a pipe is wasteful.

Shared memory flips the model. Instead of copying data between address spaces, you map the SAME physical memory pages into multiple processes. Both processes read and write the same bytes at the same virtual addresses (well, potentially different virtual addresses pointing to the same physical pages). Zero copy. But with great power comes great responsibility -- you now have two processes touching the same memory without the kernel serializing their access. That's where semaphores come in.

Here we go!

POSIX shared memory: shm_open and mmap

POSIX shared memory uses two syscalls together: shm_open creates (or opens) a named shared memory object, and mmap maps it into the process's address space. The shared memory object lives in /dev/shm/ on Linux -- it's backed by a tmpfs filesystem, meaning it lives in RAM (not on disk). This makes it extremely fast.

Since Zig doesn't have high-level wrappers for shm_open, we call into libc via @cImport. We already covered C interop in episodes 27 and 28, so this should feel familiar:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
});

const SharedCounter = extern struct {
    value: i64,
    write_count: u64,
};

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const shm_name: [*:0]const u8 = "/zig_shm_demo";

    // create a shared memory object
    const fd = c.shm_open(
        shm_name,
        c.O_CREAT | c.O_RDWR,
        0o666,
    );
    if (fd &lt; 0) {
        try stdout.print("shm_open failed\n", .{});
        return;
    }

    // set the size to hold our struct
    const size = @sizeOf(SharedCounter);
    if (c.ftruncate(fd, @intCast(size)) != 0) {
        try stdout.print("ftruncate failed\n", .{});
        return;
    }

    // map it into our address space
    const ptr = c.mmap(
        null,
        size,
        c.PROT_READ | c.PROT_WRITE,
        c.MAP_SHARED,
        fd,
        0,
    );
    if (ptr == c.MAP_FAILED) {
        try stdout.print("mmap failed\n", .{});
        return;
    }

    const counter: *SharedCounter = @ptrCast(@alignCast(ptr));

    // initialize
    counter.value = 0;
    counter.write_count = 0;

    const pid = try std.posix.fork();
    if (pid == 0) {
        // child: increment the counter 1000 times
        var i: usize = 0;
        while (i &lt; 1000) : (i += 1) {
            counter.value += 1;
            counter.write_count += 1;
        }
        std.process.exit(0);
    }

    // parent: also increment the counter 1000 times
    var i: usize = 0;
    while (i &lt; 1000) : (i += 1) {
        counter.value += 1;
        counter.write_count += 1;
    }

    _ = std.posix.waitpid(pid, 0);

    try stdout.print("Final value: {d} (expected 2000)\n", .{counter.value});
    try stdout.print("Write count: {d}\n", .{counter.write_count});

    // cleanup
    _ = c.munmap(ptr, size);
    _ = c.close(fd);
    _ = c.shm_unlink(shm_name);
}

If you run this, the final value will almost certainly NOT be 2000. It might be 1200, 1847, 1993 -- different every time. That's a data race. Both processes read the value, add 1, and write it back, but they can interleave those steps. Process A reads 42, process B reads 42, both write 43, and you've lost an increment. We covered this exact problem with threads in episode 30 (atomics), and the situation is identical with shared memory between processes -- same physical memory, same race conditions.

The extern struct keyword is important here. Regular Zig structs may have padding and field ordering chosen by the compiler. extern struct guarantees C-compatible layout with fields in declaration order. When two processes (or a C and Zig program) share memory, they must agree on the exact byte layout.

Notice how the shared memory object has a name (/zig_shm_demo). Any process that knows this name can open it -- they don't need to be parent and child. This is similar to named pipes (FIFOs), but instead of a byte stream you get random-access memory. The name must start with / and contain no other slashes (POSIX requirement). On Linux the actual file lives at /dev/shm/zig_shm_demo.

Designing shared data structures with careful layout

When you put a struct in shared memory, you have to think about things that normally the compiler handles for you. Alignment, padding, field sizes -- all of it matters because both processes must interpet the bytes identically:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
});

// WRONG: this has compiler-dependent layout
// const BadShared = struct {
//     flag: bool,      // 1 byte + 7 bytes padding
//     counter: u64,    // 8 bytes
//     status: u8,      // 1 byte + 7 bytes padding
// };
// total: 24 bytes with gaps -- wasteful and fragile

// RIGHT: extern struct with explicit sizes, naturally aligned
const SharedHeader = extern struct {
    magic: u32,          // identifies this as our shared region
    version: u32,        // protocol version
    producer_pid: i32,   // who created this
    consumer_count: i32, // how many consumers attached
    data_offset: u32,    // byte offset to start of data area
    data_size: u32,      // size of data area in bytes
    sequence: u64,       // monotonically increasing write counter
};

const SharedRingBuffer = extern struct {
    head: u64,           // write position (producer advances)
    tail: u64,           // read position (consumer advances)
    capacity: u64,       // total buffer size
    element_size: u64,   // size of each element
    // data follows immediately after this header
};

fn verifyLayout() void {
    const stdout = std.io.getStdOut().writer();

    stdout.print("SharedHeader:\n", .{}) catch {};
    stdout.print("  size: {d} bytes\n", .{@sizeOf(SharedHeader)}) catch {};
    stdout.print("  magic offset: {d}\n", .{@offsetOf(SharedHeader, "magic")}) catch {};
    stdout.print("  version offset: {d}\n", .{@offsetOf(SharedHeader, "version")}) catch {};
    stdout.print("  sequence offset: {d}\n", .{@offsetOf(SharedHeader, "sequence")}) catch {};

    stdout.print("SharedRingBuffer:\n", .{}) catch {};
    stdout.print("  size: {d} bytes\n", .{@sizeOf(SharedRingBuffer)}) catch {};
    stdout.print("  head offset: {d}\n", .{@offsetOf(SharedRingBuffer, "head")}) catch {};
    stdout.print("  tail offset: {d}\n", .{@offsetOf(SharedRingBuffer, "tail")}) catch {};
}

pub fn main() !void {
    verifyLayout();

    const stdout = std.io.getStdOut().writer();

    // demonstrate creating and initializing a shared region
    const shm_name: [*:0]const u8 = "/zig_layout_demo";
    const total_size = @sizeOf(SharedHeader) + @sizeOf(SharedRingBuffer) + 4096;

    const fd = c.shm_open(shm_name, c.O_CREAT | c.O_RDWR, 0o666);
    if (fd &lt; 0) return;
    defer _ = c.close(fd);
    defer _ = c.shm_unlink(shm_name);

    _ = c.ftruncate(fd, @intCast(total_size));

    const ptr = c.mmap(null, total_size, c.PROT_READ | c.PROT_WRITE, c.MAP_SHARED, fd, 0);
    if (ptr == c.MAP_FAILED) return;
    defer _ = c.munmap(ptr, total_size);

    // initialize the header
    const header: *SharedHeader = @ptrCast(@alignCast(ptr));
    header.magic = 0xDEADBEEF;
    header.version = 1;
    header.producer_pid = @intCast(std.os.linux.getpid());
    header.consumer_count = 0;
    header.data_offset = @sizeOf(SharedHeader);
    header.data_size = @intCast(total_size - @sizeOf(SharedHeader));
    header.sequence = 0;

    // initialize the ring buffer after the header
    const ring_ptr: [*]u8 = @ptrCast(ptr);
    const ring: *SharedRingBuffer = @ptrCast(@alignCast(ring_ptr + @sizeOf(SharedHeader)));
    ring.head = 0;
    ring.tail = 0;
    ring.capacity = 4096;
    ring.element_size = 64;

    try stdout.print("Shared region initialized: {d} bytes total\n", .{total_size});
    try stdout.print("Header: magic=0x{X}, pid={d}\n", .{ header.magic, header.producer_pid });
    try stdout.print("Ring buffer: capacity={d}, element_size={d}\n", .{ ring.capacity, ring.element_size });
}

A few rules of thumb for shared memory structures:

Always use extern struct -- guarantees C layout, no compiler surprises
Keep fields naturally aligned -- a u64 at offset 0 or 8, a u32 at offset 0 or 4. Misaligned access is slow on x86 and may crash on ARM.
Include a magic number -- when a consumer opens the shared region, it checks the magic to verify it's looking at the right thing and not some random leftover from a previous run
Include a version field -- so you can change the layout in the future without both sides silently misinterpreting each other's data
Document the exact byte layout -- future you (or a C program connecting to your shared memory) needs to know exactly what byte goes where

The @offsetOf builtin is your friend here. It tells you the exact byte offset of each field, so you can verify the layout matches what you expect. If a field ends up at an unexpected offset, you've got a padding issue.

Semaphores: coordinating access to shared resources

We saw that unsynchronized shared memory access produces garbage results. Semaphores fix this. A POSIX named semaphore is a kernel-managed counter: sem_wait decrements it (blocking if the count is zero), and sem_post increments it. When used as a mutex (initial count = 1), it ensures only one process touches the shared data at a time:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
    @cInclude("semaphore.h");
});

const SharedData = extern struct {
    counter: i64,
    iterations: u64,
};

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const shm_name: [*:0]const u8 = "/zig_sem_demo";
    const sem_name: [*:0]const u8 = "/zig_sem_lock";

    // create shared memory
    const fd = c.shm_open(shm_name, c.O_CREAT | c.O_RDWR, 0o666);
    if (fd &lt; 0) return;
    _ = c.ftruncate(fd, @intCast(@sizeOf(SharedData)));

    const ptr = c.mmap(
        null, @sizeOf(SharedData),
        c.PROT_READ | c.PROT_WRITE,
        c.MAP_SHARED, fd, 0,
    );
    if (ptr == c.MAP_FAILED) return;
    const data: *SharedData = @ptrCast(@alignCast(ptr));
    data.counter = 0;
    data.iterations = 0;

    // create a named semaphore with initial value 1 (mutex)
    const sem = c.sem_open(sem_name, c.O_CREAT, 0o666, @as(c_uint, 1));
    if (sem == c.SEM_FAILED) {
        try stdout.print("sem_open failed\n", .{});
        return;
    }

    const iterations: usize = 100_000;

    const pid = try std.posix.fork();
    if (pid == 0) {
        // child: increment with semaphore protection
        var i: usize = 0;
        while (i &lt; iterations) : (i += 1) {
            _ = c.sem_wait(sem);
            data.counter += 1;
            data.iterations += 1;
            _ = c.sem_post(sem);
        }
        std.process.exit(0);
    }

    // parent: also increment with semaphore protection
    var i: usize = 0;
    while (i &lt; iterations) : (i += 1) {
        _ = c.sem_wait(sem);
        data.counter += 1;
        data.iterations += 1;
        _ = c.sem_post(sem);
    }

    _ = std.posix.waitpid(pid, 0);

    try stdout.print("Counter: {d} (expected {d})\n", .{ data.counter, iterations * 2 });
    try stdout.print("Iterations: {d}\n", .{data.iterations});

    // cleanup
    _ = c.munmap(ptr, @sizeOf(SharedData));
    _ = c.close(fd);
    _ = c.shm_unlink(shm_name);
    _ = c.sem_close(sem);
    _ = c.sem_unlink(sem_name);
}

NOW the counter will be exactly 200,000 every time. The semaphore serializes access -- only one process is inside the critical section at any moment. sem_wait blocks if the semaphore's value is 0 (meaning the other process is currently holding it), and sem_post releases it.

Named semaphores (created with sem_open) work across unrelated processes, just like named shared memory. They live in /dev/shm/sem.zig_sem_lock on Linux. There are also unnamed semaphores (initialized with sem_init) that you can place directly inside shared memory -- but named ones are simpler to manage for cross-process use.

Having said that, semaphore-based mutual exclusion is SLOW compared to what we had with threads. In episode 30 we used @atomicRmw for lock-free atomic increments within a single process. With shared memory across processes, we have the same option -- and it's much faster than semaphore round-trips to the kernel. We'll get to that in a moment.

The producer-consumer pattern with shared memory

The classic use of shared memory + semaphores is the producer-consumer pattern. One process produces data, another consumes it, and they coordinate through a shared ring buffer with two semaphores -- one counting empty slots, one counting filled slots:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
    @cInclude("semaphore.h");
    @cInclude("string.h");
});

const SLOT_SIZE = 64;
const NUM_SLOTS = 16;

const SharedQueue = extern struct {
    head: u32,             // producer writes here
    tail: u32,             // consumer reads here
    produced_count: u64,   // stats
    consumed_count: u64,   // stats
    slots: [NUM_SLOTS][SLOT_SIZE]u8,
};

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const shm_name: [*:0]const u8 = "/zig_prodcon";
    const sem_empty_name: [*:0]const u8 = "/zig_sem_empty";
    const sem_full_name: [*:0]const u8 = "/zig_sem_full";
    const sem_mutex_name: [*:0]const u8 = "/zig_sem_mutex";

    // create shared memory
    const fd = c.shm_open(shm_name, c.O_CREAT | c.O_RDWR, 0o666);
    if (fd &lt; 0) return;
    _ = c.ftruncate(fd, @intCast(@sizeOf(SharedQueue)));

    const ptr = c.mmap(
        null, @sizeOf(SharedQueue),
        c.PROT_READ | c.PROT_WRITE,
        c.MAP_SHARED, fd, 0,
    );
    if (ptr == c.MAP_FAILED) return;
    const queue: *SharedQueue = @ptrCast(@alignCast(ptr));
    queue.head = 0;
    queue.tail = 0;
    queue.produced_count = 0;
    queue.consumed_count = 0;
    @memset(std.mem.asBytes(&amp;queue.slots), 0);

    // semaphores: empty starts at NUM_SLOTS, full starts at 0
    const sem_empty = c.sem_open(sem_empty_name, c.O_CREAT, 0o666, @as(c_uint, NUM_SLOTS));
    const sem_full = c.sem_open(sem_full_name, c.O_CREAT, 0o666, @as(c_uint, 0));
    const sem_mutex = c.sem_open(sem_mutex_name, c.O_CREAT, 0o666, @as(c_uint, 1));
    if (sem_empty == c.SEM_FAILED or sem_full == c.SEM_FAILED or sem_mutex == c.SEM_FAILED) return;

    const num_items: usize = 30;

    const pid = try std.posix.fork();
    if (pid == 0) {
        // child = consumer
        var i: usize = 0;
        while (i &lt; num_items) : (i += 1) {
            _ = c.sem_wait(sem_full);    // wait for a filled slot
            _ = c.sem_wait(sem_mutex);   // lock

            const slot = &amp;queue.slots[queue.tail];
            const msg_end = std.mem.indexOfScalar(u8, slot, 0) orelse SLOT_SIZE;
            const stderr = std.io.getStdErr().writer();
            stderr.print("[consumer] item {d}: {s}\n", .{ i, slot[0..msg_end] }) catch {};
            queue.tail = (queue.tail + 1) % NUM_SLOTS;
            queue.consumed_count += 1;

            _ = c.sem_post(sem_mutex);   // unlock
            _ = c.sem_post(sem_empty);   // signal an empty slot
        }
        std.process.exit(0);
    }

    // parent = producer
    var i: usize = 0;
    while (i &lt; num_items) : (i += 1) {
        _ = c.sem_wait(sem_empty);   // wait for an empty slot
        _ = c.sem_wait(sem_mutex);   // lock

        var msg_buf: [SLOT_SIZE]u8 = [_]u8{0} ** SLOT_SIZE;
        _ = std.fmt.bufPrint(&amp;msg_buf, "message-{d}", .{i}) catch {};
        @memcpy(&amp;queue.slots[queue.head], &amp;msg_buf);
        queue.head = (queue.head + 1) % NUM_SLOTS;
        queue.produced_count += 1;

        _ = c.sem_post(sem_mutex);   // unlock
        _ = c.sem_post(sem_full);    // signal a filled slot
    }

    _ = std.posix.waitpid(pid, 0);

    try stdout.print("Produced: {d}, Consumed: {d}\n", .{
        queue.produced_count, queue.consumed_count,
    });

    // cleanup
    _ = c.munmap(ptr, @sizeOf(SharedQueue));
    _ = c.close(fd);
    _ = c.shm_unlink(shm_name);
    _ = c.sem_close(sem_empty);
    _ = c.sem_close(sem_full);
    _ = c.sem_close(sem_mutex);
    _ = c.sem_unlink(sem_empty_name);
    _ = c.sem_unlink(sem_full_name);
    _ = c.sem_unlink(sem_mutex_name);
}

Three semaphores work together:
- sem_empty counts available (empty) slots. Producer waits on this before writing. Starts at NUM_SLOTS.
- sem_full counts filled slots. Consumer waits on this before reading. Starts at 0.
- sem_mutex protects the actual read/write of head/tail pointers. Classic binary semaphore.

The beauty of this design is that the producer automatically blocks when the buffer is full (sem_empty hits zero), and the consumer blocks when it's empty (sem_full hits zero). No busy-waiting, no polling, no wasted CPU. The kernel handles the scheduling. This is the same backpressure mechanism we saw with pipes in episode 65, but now you control the buffer layout and can share complex data structures instead of just byte streams.

Avoiding data races: memory barriers and atomic operations

Semaphores work, but they're heavyweight -- each sem_wait/sem_post involves a syscall. For simple counters and flags, atomic operations are much faster. We covered @atomicRmw and @atomicLoad for threads in episode 30 -- the exact same operations work for shared memory between processes, because the hardware atomic instructions operate on physical memory addresses regardless of which process issues them:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
});

const AtomicShared = extern struct {
    counter: i64 align(8),
    flag: u32 align(4),
    sequence: u64 align(8),
};

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const shm_name: [*:0]const u8 = "/zig_atomic_demo";

    const fd = c.shm_open(shm_name, c.O_CREAT | c.O_RDWR, 0o666);
    if (fd &lt; 0) return;
    _ = c.ftruncate(fd, @intCast(@sizeOf(AtomicShared)));

    const ptr = c.mmap(
        null, @sizeOf(AtomicShared),
        c.PROT_READ | c.PROT_WRITE,
        c.MAP_SHARED, fd, 0,
    );
    if (ptr == c.MAP_FAILED) return;
    const shared: *AtomicShared = @ptrCast(@alignCast(ptr));

    // initialize with atomic stores
    @atomicStore(i64, &amp;shared.counter, 0, .seq_cst);
    @atomicStore(u32, &amp;shared.flag, 0, .seq_cst);
    @atomicStore(u64, &amp;shared.sequence, 0, .seq_cst);

    const iterations: usize = 500_000;

    const pid = try std.posix.fork();
    if (pid == 0) {
        // child: atomic increment
        var i: usize = 0;
        while (i &lt; iterations) : (i += 1) {
            _ = @atomicRmw(i64, &amp;shared.counter, .Add, 1, .seq_cst);
            _ = @atomicRmw(u64, &amp;shared.sequence, .Add, 1, .seq_cst);
        }

        // signal done
        @atomicStore(u32, &amp;shared.flag, 1, .release);
        std.process.exit(0);
    }

    // parent: atomic increment
    var i: usize = 0;
    while (i &lt; iterations) : (i += 1) {
        _ = @atomicRmw(i64, &amp;shared.counter, .Add, 1, .seq_cst);
        _ = @atomicRmw(u64, &amp;shared.sequence, .Add, 1, .seq_cst);
    }

    _ = std.posix.waitpid(pid, 0);

    const final = @atomicLoad(i64, &amp;shared.counter, .seq_cst);
    const seq = @atomicLoad(u64, &amp;shared.sequence, .seq_cst);

    try stdout.print("Counter: {d} (expected {d})\n", .{ final, iterations * 2 });
    try stdout.print("Sequence: {d}\n", .{seq});

    _ = c.munmap(ptr, @sizeOf(AtomicShared));
    _ = c.close(fd);
    _ = c.shm_unlink(shm_name);
}

This will print exactly 1,000,000 for the counter. Every. Single. Time. No semaphores, no syscalls for synchronization -- just hardware-level atomic instructions. The lock prefix on x86 (which is what @atomicRmw compiles down to) ensures the read-modify-write cycle is indivisible even when two CPUs are accessing the same cache line.

The .seq_cst ordering is the strongest (sequentially consistent) -- it guarantees that all atomic operations appear to happen in a single global order, visible to all processes. For simple counters this is fine. For more complex lock-free data structures you might use .acquire/.release ordering for better performance, but .seq_cst is the safe default.

The alignment annotations (align(8), align(4)) are critical here. Atomic operations on misaligned addresses can be non-atomic on some architectures (and will crash on others). Since we're using extern struct, the compiler won't add padding automatcally -- we need to ensure the fields land on naturally aligned boundaries ourselves.

Shared memory vs message passing: when to use each

After covering both pipes (episode 65) and shared memory (this episode), let's compare them properly. Here's a practical benchmark that shows the performance difference:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
});

const BenchResult = struct {
    elapsed_ns: u64,
    ops: u64,
};

fn benchPipe(iterations: usize) !BenchResult {
    const pipe_fds = try std.posix.pipe();

    const pid = try std.posix.fork();
    if (pid == 0) {
        std.posix.close(pipe_fds[0]);
        var val: [8]u8 = undefined;
        var i: usize = 0;
        while (i &lt; iterations) : (i += 1) {
            std.mem.writeInt(u64, &amp;val, @intCast(i), .little);
            _ = std.posix.write(pipe_fds[1], &amp;val) catch break;
        }
        std.posix.close(pipe_fds[1]);
        std.process.exit(0);
    }

    std.posix.close(pipe_fds[1]);
    const timer = try std.time.Timer.start();

    var buf: [8]u8 = undefined;
    var count: u64 = 0;
    while (true) {
        const n = std.posix.read(pipe_fds[0], &amp;buf) catch break;
        if (n == 0) break;
        count += 1;
    }

    const elapsed = timer.read();
    std.posix.close(pipe_fds[0]);
    _ = std.posix.waitpid(pid, 0);

    return .{ .elapsed_ns = elapsed, .ops = count };
}

fn benchSharedMemory(iterations: usize) !BenchResult {
    const shm_name: [*:0]const u8 = "/zig_bench_shm";
    const SharedBench = extern struct {
        counter: u64 align(8),
        done: u32 align(4),
        _pad: [4]u8,
    };

    const fd = c.shm_open(shm_name, c.O_CREAT | c.O_RDWR, 0o666);
    if (fd &lt; 0) return error.ShmOpenFailed;
    _ = c.ftruncate(fd, @intCast(@sizeOf(SharedBench)));

    const ptr = c.mmap(
        null, @sizeOf(SharedBench),
        c.PROT_READ | c.PROT_WRITE,
        c.MAP_SHARED, fd, 0,
    );
    if (ptr == c.MAP_FAILED) return error.MmapFailed;
    const bench: *SharedBench = @ptrCast(@alignCast(ptr));
    @atomicStore(u64, &amp;bench.counter, 0, .seq_cst);
    @atomicStore(u32, &amp;bench.done, 0, .seq_cst);

    const pid = try std.posix.fork();
    if (pid == 0) {
        var i: usize = 0;
        while (i &lt; iterations) : (i += 1) {
            _ = @atomicRmw(u64, &amp;bench.counter, .Add, 1, .seq_cst);
        }
        @atomicStore(u32, &amp;bench.done, 1, .release);
        std.process.exit(0);
    }

    const timer = try std.time.Timer.start();

    // parent also increments
    var i: usize = 0;
    while (i &lt; iterations) : (i += 1) {
        _ = @atomicRmw(u64, &amp;bench.counter, .Add, 1, .seq_cst);
    }

    _ = std.posix.waitpid(pid, 0);
    const elapsed = timer.read();

    const final = @atomicLoad(u64, &amp;bench.counter, .seq_cst);

    _ = c.munmap(ptr, @sizeOf(SharedBench));
    _ = c.close(fd);
    _ = c.shm_unlink(shm_name);

    return .{ .elapsed_ns = elapsed, .ops = final };
}

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const iters: usize = 100_000;

    try stdout.print("Benchmarking {d} operations...\n\n", .{iters});

    const pipe_result = try benchPipe(iters);
    try stdout.print("Pipe:          {d} ops in {d} us ({d} ops/sec)\n", .{
        pipe_result.ops,
        pipe_result.elapsed_ns / 1000,
        pipe_result.ops * 1_000_000_000 / pipe_result.elapsed_ns,
    });

    const shm_result = try benchSharedMemory(iters);
    try stdout.print("Shared memory: {d} ops in {d} us ({d} ops/sec)\n", .{
        shm_result.ops,
        shm_result.elapsed_ns / 1000,
        shm_result.ops * 1_000_000_000 / shm_result.elapsed_ns,
    });

    if (pipe_result.elapsed_ns &gt; 0 and shm_result.elapsed_ns &gt; 0) {
        const ratio = pipe_result.elapsed_ns / shm_result.elapsed_ns;
        try stdout.print("\nShared memory is ~{d}x faster for this workload\n", .{ratio});
    }
}

On a typical machine, shared memory with atomics is 10-50x faster than pipes for small messages. That's because pipes involve kernel copies (write data into kernel buffer, read it out), while shared memory is direct access to the same physical pages.

But that doesn't mean shared memory is always better. Here's when to use each:

Use pipes when:
- You need streaming data (logs, command output, process chains)
- The data flows one direction -- producer to consumer
- You want the kernel to handle flow control and buffering
- Simplicity matters more than raw speed
- The communicating programs might be written in different languages

Use shared memory when:
- Multiple processes need random access to the same data structure
- Performance is critical (microsecond-scale operations)
- You need zero-copy data sharing for large datasets
- You're building a high-performance IPC protocol (like database shared buffers)

Use sockets when:
- Processes might be on different machines
- You need bidirectional communication
- You want a clean client-server protocol

In practice, many real systems combine these. PostgreSQL uses shared memory for its buffer pool (the shared cache of database pages) plus Unix domain sockets for client connections. Redis uses shared memory for its dataset but communicates with clients via TCP. The right choice depends on your access pattern.

Cleanup: unlinking shared memory segments

Shared memory objects and named semaphores persist in /dev/shm/ until explicitly removed. If your program crashes without cleanup, they stick around. Good hygiene means: always unlink when done, and handle the case where a stale segment from a previous crash already exists:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
    @cInclude("semaphore.h");
});

const ShmRegion = struct {
    name: [*:0]const u8,
    fd: c_int,
    ptr: ?*anyopaque,
    size: usize,

    fn create(name: [*:0]const u8, size: usize) ShmRegion {
        // try to unlink first -- removes stale segments from previous runs
        _ = c.shm_unlink(name);

        var region = ShmRegion{
            .name = name,
            .fd = -1,
            .ptr = null,
            .size = size,
        };

        region.fd = c.shm_open(name, c.O_CREAT | c.O_RDWR | c.O_EXCL, 0o666);
        if (region.fd &lt; 0) {
            // O_EXCL failed -- someone else created it between unlink and open
            // try without O_EXCL
            region.fd = c.shm_open(name, c.O_CREAT | c.O_RDWR, 0o666);
            if (region.fd &lt; 0) return region;
        }

        if (c.ftruncate(region.fd, @intCast(size)) != 0) {
            _ = c.close(region.fd);
            _ = c.shm_unlink(name);
            region.fd = -1;
            return region;
        }

        region.ptr = c.mmap(
            null, size,
            c.PROT_READ | c.PROT_WRITE,
            c.MAP_SHARED,
            region.fd, 0,
        );
        if (region.ptr == c.MAP_FAILED) {
            region.ptr = null;
            _ = c.close(region.fd);
            _ = c.shm_unlink(name);
            region.fd = -1;
        }

        return region;
    }

    fn open(name: [*:0]const u8, size: usize) ShmRegion {
        var region = ShmRegion{
            .name = name,
            .fd = -1,
            .ptr = null,
            .size = size,
        };

        region.fd = c.shm_open(name, c.O_RDWR, 0);
        if (region.fd &lt; 0) return region;

        region.ptr = c.mmap(
            null, size,
            c.PROT_READ | c.PROT_WRITE,
            c.MAP_SHARED,
            region.fd, 0,
        );
        if (region.ptr == c.MAP_FAILED) {
            region.ptr = null;
            _ = c.close(region.fd);
            region.fd = -1;
        }

        return region;
    }

    fn destroy(self: *ShmRegion) void {
        if (self.ptr) |p| {
            _ = c.munmap(p, self.size);
            self.ptr = null;
        }
        if (self.fd &gt;= 0) {
            _ = c.close(self.fd);
            self.fd = -1;
        }
        _ = c.shm_unlink(self.name);
    }

    fn isValid(self: *const ShmRegion) bool {
        return self.fd &gt;= 0 and self.ptr != null;
    }
};

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();

    var region = ShmRegion.create("/zig_cleanup_demo", 4096);
    if (!region.isValid()) {
        try stdout.print("Failed to create shared memory\n", .{});
        return;
    }
    defer region.destroy();

    try stdout.print("Created shared memory: fd={d}, size={d}\n", .{ region.fd, region.size });

    // a second process would use open() instead of create()
    const pid = try std.posix.fork();
    if (pid == 0) {
        var child_region = ShmRegion.open("/zig_cleanup_demo", 4096);
        if (child_region.isValid()) {
            const stderr = std.io.getStdErr().writer();
            stderr.print("[child] attached to shared memory: fd={d}\n", .{child_region.fd}) catch {};

            // write something
            const data: *[4096]u8 = @ptrCast(child_region.ptr.?);
            @memcpy(data[0..12], "hello parent");

            // note: child does NOT destroy/unlink -- only the creator should unlink
            _ = c.munmap(child_region.ptr.?, child_region.size);
            _ = c.close(child_region.fd);
        }
        std.process.exit(0);
    }

    _ = std.posix.waitpid(pid, 0);

    // read what the child wrote
    const data: *[4096]u8 = @ptrCast(region.ptr.?);
    const msg_end = std.mem.indexOfScalar(u8, data, 0) orelse 4096;
    try stdout.print("[parent] child wrote: '{s}'\n", .{data[0..msg_end]});

    // region.destroy() called by defer -- unliks the shared memory
    try stdout.print("Cleaning up...\n", .{});
}

Important cleanup rules:

Only the creator should shm_unlink -- other processes should munmap and close, but NOT unlink. Unlinking removes the name from /dev/shm/ but doesn't destroy the memory until all mmaps are unmapped. Still, unlinking while other processes are still opening by name will break them.
Unlink before create for robustness -- if a previous run crashed, the old segment might still exist. Unlink it first, then create fresh. The O_EXCL flag with O_CREAT fails if the object already exists -- useful for detecting stale segments.
Check /dev/shm/ after crashes -- run ls /dev/shm/ and clean up any zig_* or sem.zig_* entries from crashed programs. They don't take much memory but they can confuse the next run.

Practical example: shared scoreboard between worker processes

Let's put it all together with something useful -- a shared scoreboard where multiple worker processes report their progress, and a monitor process reads the scoreboard to display live stats:

const std = @import("std");
const c = @cImport({
    @cInclude("sys/mman.h");
    @cInclude("fcntl.h");
    @cInclude("unistd.h");
});

const MAX_WORKERS = 8;

const WorkerStats = extern struct {
    pid: i32,
    items_processed: u64 align(8),
    errors: u64 align(8),
    status: u32,        // 0=idle, 1=running, 2=done
    _pad: [4]u8,
};

const Scoreboard = extern struct {
    magic: u32,
    num_workers: u32,
    start_time: i64,
    workers: [MAX_WORKERS]WorkerStats,
};

fn workerMain(board: *Scoreboard, worker_id: usize, work_items: usize) void {
    const slot = &amp;board.workers[worker_id];
    @atomicStore(i32, &amp;slot.pid, @intCast(std.os.linux.getpid()), .release);
    @atomicStore(u32, &amp;slot.status, 1, .release); // running

    var i: usize = 0;
    while (i &lt; work_items) : (i += 1) {
        // simulate work
        std.time.sleep(5 * std.time.ns_per_ms);
        _ = @atomicRmw(u64, &amp;slot.items_processed, .Add, 1, .seq_cst);

        // simulate occasional errors
        if (i % 17 == 0) {
            _ = @atomicRmw(u64, &amp;slot.errors, .Add, 1, .seq_cst);
        }
    }

    @atomicStore(u32, &amp;slot.status, 2, .release); // done
}

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const shm_name: [*:0]const u8 = "/zig_scoreboard";

    _ = c.shm_unlink(shm_name);
    const fd = c.shm_open(shm_name, c.O_CREAT | c.O_RDWR, 0o666);
    if (fd &lt; 0) return;
    _ = c.ftruncate(fd, @intCast(@sizeOf(Scoreboard)));

    const ptr = c.mmap(
        null, @sizeOf(Scoreboard),
        c.PROT_READ | c.PROT_WRITE,
        c.MAP_SHARED, fd, 0,
    );
    if (ptr == c.MAP_FAILED) return;
    const board: *Scoreboard = @ptrCast(@alignCast(ptr));

    // initialize
    board.magic = 0x5C08E;
    board.num_workers = 4;
    board.start_time = std.time.milliTimestamp();
    @memset(std.mem.asBytes(&amp;board.workers), 0);

    // spawn workers
    const num_workers: usize = 4;
    const work_per_worker: usize = 50;
    var child_pids: [MAX_WORKERS]std.posix.pid_t = undefined;

    for (0..num_workers) |w| {
        const pid = try std.posix.fork();
        if (pid == 0) {
            workerMain(board, w, work_per_worker);
            std.process.exit(0);
        }
        child_pids[w] = pid;
    }

    // monitor loop -- check the scoreboard periodically
    var all_done = false;
    while (!all_done) {
        std.time.sleep(100 * std.time.ns_per_ms);

        try stdout.print("\n--- Scoreboard ---\n", .{});
        all_done = true;
        var total_items: u64 = 0;
        var total_errors: u64 = 0;

        for (0..num_workers) |w| {
            const slot = &amp;board.workers[w];
            const status = @atomicLoad(u32, &amp;slot.status, .acquire);
            const items = @atomicLoad(u64, &amp;slot.items_processed, .acquire);
            const errors = @atomicLoad(u64, &amp;slot.errors, .acquire);
            const pid = @atomicLoad(i32, &amp;slot.pid, .acquire);

            const status_str: []const u8 = switch (status) {
                0 =&gt; "idle",
                1 =&gt; "running",
                2 =&gt; "done",
                else =&gt; "???",
            };

            try stdout.print("  Worker {d} (pid {d}): {s} - {d}/{d} items, {d} errors\n", .{
                w, pid, status_str, items, work_per_worker, errors,
            });

            total_items += items;
            total_errors += errors;
            if (status != 2) all_done = false;
        }

        const elapsed_ms = std.time.milliTimestamp() - board.start_time;
        try stdout.print("  Total: {d} items, {d} errors, {d}ms elapsed\n", .{
            total_items, total_errors, elapsed_ms,
        });
    }

    // wait for all children
    for (0..num_workers) |w| {
        _ = std.posix.waitpid(child_pids[w], 0);
    }

    try stdout.print("\nAll workers finished.\n", .{});

    _ = c.munmap(ptr, @sizeOf(Scoreboard));
    _ = c.close(fd);
    _ = c.shm_unlink(shm_name);
}

This pattern is used everywhere in production systems. Database connection pools have a shared status table showing which connections are active. Web servers have a shared scoreboard (Apache's mod_status literally calls it a "scoreboard") showing which worker is handling which request. Monitoring daemons read these scoreboards to generate metrics dashboards.

The key insight is that the monitor process never blocks -- it just reads the atomic values and prints them. The worker processes never block either (they only write atomics). There are no semaphores, no mutexes, no syscalls for synchronization. The only coordination overhead is cache coherence between CPU cores, which the hardware handles transparently. For a read-mostly monitoring scenario like this, it's about as fast as IPC can get.

NB: the magic number 0x5C08E is just a cute "SCOBE" in hex (sort of) -- in real code you'd pick something more distinctive like 0xDEADBEEF or 0xCAFEBABE ;-)

Exercises

Build a shared memory ring buffer that supports multiple producers. Create 3 child processes that each write 100 messages into a shared ring buffer, and a parent process that reads them all. Use an atomic head pointer (with @atomicRmw compare-and-swap) so producers don't need a mutex -- each producer atomically claims a slot by incrementing head, then writes its data. Verify that all 300 messages are received and none are corrupted or lost.
Create a "process-safe" shared hash map. Map a large shared memory region (e.g. 1 MB) and implement a fixed-size hash table inside it with open addressing (linear probing). Use a per-bucket atomic spinlock (a u32 flag with @atomicRmw exchange) to protect individual buckets rather than locking the whole table. Spawn 2 child processes: one inserts 1000 key-value pairs, the other reads them concurrently. Verify that all inserted pairs can be retrieved correctly after both processes finish.
Implement a shared memory "blackboard" where processes publish and subscribe to named channels. The blackboard has a fixed-size header with channel metadata (name, offset, size, last-update sequence number) and a data region. A publisher writes data to a channel and increments its sequence number atomically. Subscribers poll the sequence number and only read when it changes. Test with 2 channels ("temperature" and "pressure") and 3 processes: one publisher writing both channels, two subscribers each watching one channel. Use .acquire/.release ordering instead of .seq_cst and explain in a comment why that's sufficient.

Wat we geleerd hebben

POSIX shared memory (shm_open + mmap) maps the same physical pages into multiple processes -- zero-copy data sharing, much faster than pipes for structured data
Shared data structures must use extern struct for guaranteed C-compatible layout, with careful attention to alignment (misaligned atomics can be non-atomic or crash)
Named semaphores (sem_open) work as cross-process mutexes and counting semaphores -- the classic producer-consumer pattern uses three: empty-count, full-count, and a mutex
Atomic operations (@atomicRmw, @atomicLoad, @atomicStore) work across process boundaries because they operate on physical memory -- no syscall overhead, hardware handles coherence
Shared memory shines for random-access data structures (scoreboards, hash tables, ring buffers) while pipes are better for streaming byte data between process chains
Cleanup is your responsibility: shm_unlink removes the name, munmap unmaps the memory, sem_unlink removes semaphores -- crashed programs leave stale entries in /dev/shm/
The scoreboard pattern (atomic writes from workers, lock-free reads from monitor) is used in production by databases, web servers, and monitoring systems for near-zero-overhead status reporting

Shared memory is the fastest IPC mechanism available on a single machine. It's also the most dangerous -- you're sharing raw memory between processes with no kernel safety net. The next episodes in this OS programming arc will look at how processes interact with the kernel itself -- signal delivery, Unix domain sockets for local networking, and the machinery that lets a process detach from its terminal and run as a background service. Each of these builds on the foundation we've laid with fork, pipes, and shared memory.

Thanks for reading!

@scipio