Anatomy of an MVar operation : ezyang’s blog

Anatomy of an MVar operation

Adam Belay (of Dune fame) was recently wondering why Haskell’s MVars are so slow. “Slow?” I thought, “aren’t Haskell’s MVars supposed to be really fast?” So I did some digging around how MVars worked, to see if I could explain.

Let’s consider the operation of the function takeMVar in Control.Concurrent.MVar. This function is very simple, it unpacks MVar to get the underlying MVar# primitive value, and then calls the primop takeMVar#:

takeMVar :: MVar a -> IO a
takeMVar (MVar mvar#) = IO $ \ s# -> takeMVar# mvar# s#

Primops result in the invocation of stg_takeMVarzh in PrimOps.cmm, which is where the magic happens. For simplicity, we consider only the multithreaded case.

The first step is to lock the closure:

("ptr" info) = ccall lockClosure(mvar "ptr");

Objects on the GHC heap have an info table header which indicates what kind of object they are, by pointing to the relevant info table for the object. These headers are also used for synchronization: since they are word-sized, they can be atomically swapped for other values. lockClosure is in fact a spin-lock on the info table header:

EXTERN_INLINE StgInfoTable *lockClosure(StgClosure *p)
{
    StgWord info;
    do {
        nat i = 0;
        do {
            info = xchg((P_)(void *)&p->header.info, (W_)&stg_WHITEHOLE_info);
            if (info != (W_)&stg_WHITEHOLE_info) return (StgInfoTable *)info;
        } while (++i < SPIN_COUNT);
        yieldThread();
    } while (1);
}

lockClosure is used for some other objects, namely thread state objects (stg_TSO_info, via lockTSO) and thread messages i.e. exceptions (stg_MSG_THROWTO_info, stg_MSG_NULL_info).

The next step is to apply a GC write barrier on the MVar:

if (info == stg_MVAR_CLEAN_info) {
    ccall dirty_MVAR(BaseReg "ptr", mvar "ptr");
}

As I’ve written before, as the MVar is a mutable object, it can be mutated to point to objects in generation 0; thus, when a mutation happens, it has to be added to the root set via the mutable list. Since mutable is per capability, this boils down into a bunch of pointer modifications, and does not require any synchronizations. Note that we will need to add the MVar to the mutable list, even if we end up blocking on it, because the MVar is a retainer of the thread (TSO) which is blocked on it! (However, I suspect in some cases we can get away with not doing this.)

Next, we case split depending on whether or not the MVar is full or empty. If the MVar is empty, we need to block the thread until the MVar is full:

/* If the MVar is empty, put ourselves on its blocking queue,
 * and wait until we're woken up.
 */
if (StgMVar_value(mvar) == stg_END_TSO_QUEUE_closure) {

    // We want to put the heap check down here in the slow path,
    // but be careful to unlock the closure before returning to
    // the RTS if the check fails.
    ALLOC_PRIM_WITH_CUSTOM_FAILURE
        (SIZEOF_StgMVarTSOQueue,
         unlockClosure(mvar, stg_MVAR_DIRTY_info);
         GC_PRIM_P(stg_takeMVarzh, mvar));

    q = Hp - SIZEOF_StgMVarTSOQueue + WDS(1);

    SET_HDR(q, stg_MVAR_TSO_QUEUE_info, CCS_SYSTEM);
    StgMVarTSOQueue_link(q) = END_TSO_QUEUE;
    StgMVarTSOQueue_tso(q)  = CurrentTSO;

    if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
        StgMVar_head(mvar) = q;
    } else {
        StgMVarTSOQueue_link(StgMVar_tail(mvar)) = q;
        ccall recordClosureMutated(MyCapability() "ptr",
                                         StgMVar_tail(mvar));
    }
    StgTSO__link(CurrentTSO)       = q;
    StgTSO_block_info(CurrentTSO)  = mvar;
    StgTSO_why_blocked(CurrentTSO) = BlockedOnMVar::I16;
    StgMVar_tail(mvar)             = q;

    jump stg_block_takemvar(mvar);
}

A useful thing to know when decoding C-- primop code is that StgTSO_block_info(...) and its kin are how we spell field access on objects. C-- doesn’t know anything about C struct layout, and so these “functions” are actually macros generated by utils/deriveConstants. Blocking a thread consists of three steps:

We have to add the thread to the blocked queue attached to the MVar (that’s why blocking on an MVar mutates the MVar!) This involves performing a heap allocation for the linked list node as well as mutating the tail of the old linked list.
We have to mark the thread as blocked (the StgTSO modifications).
We need to setup a stack frame for the thread so that when the thread wakes up, it performs the correct action (the invocation to stg_block_takemvar). This invocation is also responsible for unlocking the closure. While the machinery here is pretty intricate, it’s not really in scope for this blog post.

If the MVar is full, then we can go ahead and take the value from the MVar.

/* we got the value... */
val = StgMVar_value(mvar);

But that’s not all. If there are other blocked putMVars on the MVar (remember, when a thread attempts to put an MVar that is already full, it blocks until the MVar empties out), then we should immediately unblock one of these threads so that the MVar can always be left in a full state:

    q = StgMVar_head(mvar);
loop:
    if (q == stg_END_TSO_QUEUE_closure) {
        /* No further putMVars, MVar is now empty */
        StgMVar_value(mvar) = stg_END_TSO_QUEUE_closure;
        unlockClosure(mvar, stg_MVAR_DIRTY_info);
        return (val);
    }
    if (StgHeader_info(q) == stg_IND_info ||
        StgHeader_info(q) == stg_MSG_NULL_info) {
        q = StgInd_indirectee(q);
        goto loop;
    }

There is one interesting thing about the code that checks for blocked threads, and that is the check for indirectees (stg_IND_info). Under what circumstances would a queue object be stubbed out with an indirection? As it turns out, this occurs when we delete an item from the linked list. This is quite nice, because on a singly-linked list, we don't have an easy way to delete items unless we also have a pointer to the previous item. With this scheme, we just overwrite out the current item with an indirection, to be cleaned up next GC. (This, by the way, is why we can't just chain up the TSOs directly, without the extra linked list nodes. [1])

When we find some other threads, we immediately run them, so that the MVar never becomes empty:

// There are putMVar(s) waiting... wake up the first thread on the queue

tso = StgMVarTSOQueue_tso(q);
StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
    StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure;
}

ASSERT(StgTSO_why_blocked(tso) == BlockedOnMVar::I16); // note: I16 means this is a 16-bit integer
ASSERT(StgTSO_block_info(tso) == mvar);

// actually perform the putMVar for the thread that we just woke up
W_ stack;
stack = StgTSO_stackobj(tso);
PerformPut(stack, StgMVar_value(mvar));

There is one detail here: PerformPut doesn’t actually run the thread, it just looks at the thread’s stack to figure out what it was going to put. Once the MVar is put, we then wake up the thread, so it can go on its merry way:

// indicate that the MVar operation has now completed.
StgTSO__link(tso) = stg_END_TSO_QUEUE_closure;

// no need to mark the TSO dirty, we have only written END_TSO_QUEUE.

ccall tryWakeupThread(MyCapability() "ptr", tso);

unlockClosure(mvar, stg_MVAR_DIRTY_info);
return (val);

To sum up, when you takeMVar, you pay the costs of:

one spinlock,
on order of several dozen memory operations (write barriers, queue twiddling), and
when the MVar is empty, a (small) heap allocation and stack write.

Adam and I puzzled about this a bit, and then realized the reason why the number of cycles was so large: our numbers are for roundtrips, and even with such lightweight synchronization (and lack of syscalls), you still have to go through the scheduler when all is said and done, and that blows up the number of cycles.

[1] It wasn’t always this way, see:

commit f4692220c7cbdadaa633f50eb2b30b59edb30183
Author: Simon Marlow <marlowsd@gmail.com>
Date:   Thu Apr 1 09:16:05 2010 +0000

    Change the representation of the MVar blocked queue

    The list of threads blocked on an MVar is now represented as a list of
    separately allocated objects rather than being linked through the TSOs
    themselves.  This lets us remove a TSO from the list in O(1) time
    rather than O(n) time, by marking the list object.  Removing this
    linear component fixes some pathalogical performance cases where many
    threads were blocked on an MVar and became unreachable simultaneously
    (nofib/smp/threads007), or when sending an asynchronous exception to a
    TSO in a long list of thread blocked on an MVar.

    MVar performance has actually improved by a few percent as a result of
    this change, slightly to my surprise.

    This is the final cleanup in the sequence, which let me remove the old
    way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the
    new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent).  It
    is now the case that only the Capability that owns a TSO may modify
    its state (well, almost), and this simplifies various things.  More of
    the RTS is based on message-passing between Capabilities now.

jberryman says:

March 5, 2014 at 7:29 pm

Thanks for writing this up. FWIW with the improvements in MVar operations in ghc 7.8 I measure a put+takeMVar at 14.3ns vs 11ns for a write+readIORef, with creation being about equal; so not currently particularly slow in comparison to anything capable of replacing it (of course my measurements may be nonsense).

Demi Marie says:

October 14, 2016 at 8:19 pm

I wonder how much of the delay is for the GC write barrier