Ross Bencina February 2016

Benefits & drawbacks of as-needed conditional std::atomic_thread_fence acquire?

The code below shows two ways of acquiring shared state via an atomic flag. The reader thread calls poll1() or poll2() to check for whether the writer has signaled the flag.

Poll Option #1:

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

Poll Option #2:

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

Note that option #1 was presented in an earlier question, and option #2 is similar to example code at cppreference.com.

Assuming that the reader agrees to only examine the shared state if the poll function returns true, are the two poll functions both correct and equivalent?

Does option #2 have a standard name?

What are the benefits and drawbacks of each option?

Is option #2 likely to be more efficient in practice? Is it possible for it to be less efficient?

Here is a full working example:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

int x; // regular variable, could be a complex data structure

std::atomic<int> flag { 0 };

void writer_thread() {
    x = 42;
    // release value x to reader thread
    flag.store(1, std::memory_order_release);
}

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
        

Answers


Cameron February 2016

I think I can answer most of your questions.

Both options are certainly correct, but they are not quite equivalent, due to the slightly broader applicability of stand-alone fences (they are equivalent in terms of what you want to accomplish, but the stand-alone fence could technically apply to other things as well -- imagine if this code is inlined). An example of how a stand-alone fence is different from a store/fetch fence is explained in this post by Jeff Preshing.

The check-then-fence pattern in option #2 does not have a name as far as I know. It's not uncommon, though.

In terms of performance, with my g++ 4.8.1 on x64 (Linux) the assembly generated by both options boils down to a single load instruction. This is hardly surprising given that x86(-64) loads and stores all have acquire and release semantics at the hardware level anyway (x86 is known for its quite strong memory model).

For ARM, though, where memory barriers compile down to actual individual instructions, the following output is produced (using gcc.godbolt.com with -O3 -DNDEBUG):

For while (!poll1());:

.L25:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    dmb     sy
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L25

For while (!poll2());:

.L29:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L29
    dmb     sy

You can see that the only difference is where the synchronization instruction (dmb) is placed -- inside the loop for poll1, and after it for poll2. So poll2 really is more efficient in this real-world case :-)

Post Status

Asked in February 2016
Viewed 3,232 times
Voted 6
Answered 1 times

Search




Leave an answer