Like A Girl

Pushing the conversation on gender equality.

Code Like A Girl

Treacherous shared counters

When several threads run concurrently on the cores of a multicore processor they may need to share data. In a write-sharing mode, where one thread reads and another one writes, data needs to be protected in order to avoid race conditions, which may result in lost or corrupted data. This excellent article by Hans Boehm and Sarita Adve goes in depth about all the bad things that can happen if you don’t properly synchronize your accesses to shared variables.

Most programmers are aware that synchronization, typically performed with atomic instructions wrapped in locks or mutexes, is costly. Few, however, realize that even if shared variables are left unprotected, accessing them can become extremely costly on modern multicore machines. Let’s understand why.

First, let’s see why anyone would ever want to use unprotected shared variables in the first place. While most programmers know that this is not safe in general, there are situations where they believe shared accesses to be so uncommon occasional race conditions could be tolerated. A typical example is a shared statistics counter.

volatile int ops_completed;
...
static void increment_ops() {
ops_completed++;
}

If multiple threads update this counter infrequently, a programmer might reason, the value is likely to be correct. If there is an occasional race we may lose an update or two (imagine two different threads simultaneously reading the counter into a register, incrementing the value in the register, and then writing their values back — this is how the function above will likely be compiled). But that’s not a big deal; after all, we are keeping track of statistics, not relying on accurate counter values for program correctness, so an approximation is good enough.

If races are in fact very rare, the programmer is probably right. With older machines and compilers we could get a counter totally corrupted if writing a single large counter (e.g., 64 bits) required two registers (e.g., 32 bits each), but with modern machines and compilers this situation is unlikely.

That being said, programmers greatly underestimate the likelihood of races on modern machines with dozens of cores. This has undesired consequences for both performance and accuracy of the resulting values. We will talk about accuracy later; for now, let’s focus on performance.

The following chart show performance of a workload retrieving database records from MongoDB’s WiredTiger key-value store circa 2013.

LevelDB dbbench executing read sequential workload on the WiredTiger key-value store circa 2013.

At that time statistics counters in the key-value store were unprotected (this is no longer the case). The x-axis shows the number of threads used to read the records, which in our case is equivalent to the number of cores given to the workload. The workload is read-only, sequential and memory resident. There is no synchronization whatsoever and we expect it to be blazingly fast. Why is it then that the system configured with Statistics ON performs FOUR TIMES WORSE than the system configured with Statistics OFF on 16 cores?

To find the answer we need to understand a little about how modern multicore hardware works. Even if the programmer is not invoking any synchronization primitives, the hardware is doing synchronization anyway! Modern systems are built with coherent caches. Each core has its own private cache — a small piece of fast memory that keeps a copy of a small portion of data from main memory. (Usually there are also shared caches on multicore systems, but that’s not important here). Roughly speaking, when a core writes a value into a memory location, it gets put into its private cache. The actual details are slightly more complicated, but again, that’s not important. If a copy of the same memory location happens to live in caches of other cores, the hardware will make sure that these copies have consistent values. So it will send messages to other caches to either invalidate the old copy or to forward the new value.

This whole process, called coherency protocol, is very expensive on modern multicore systems and is the culprit of the performance drop we are observing. Modern systems increasingly look like distributed systems — networks of interconnected nodes. For example, here is a schematic overview of one large AMD Bulldozer system in our lab (similar to the one used in the reported experiment).

Image courtesy of Baptiste Lepers

You see eight nodes, each containing a bunch of cores with private and shared caches and its own chunk of main memory (not shown), connected by an intricate network. Sending messages across this or other similar networks becomes expensive and slows down the cores. These deleterious performance effects can be observed on different kinds of hardware, not just my weird lab machine. I measured similar effects on AWS Intel nodes. For an excellent piece of scientific evidence, take a look at this excellent paper on statistics counters written by my friends from Oracle labs.

The first take-away from this experience is that performance can really suffer due to shared variables even if the software is not invoking any synchronization.

What about data accuracy? I have not done any measurements myself, but that same paper from my Oracle friends extensively explored this question. The take-away is that even with very infrequent writes to a shared counter (1% of all the work) we can lose between 10%, 75% or 99% (!) of all updates if we run with 8, 64 or 256 threads.These shared counters are very treacherous indeed!

So be careful if you choose to elide synchronization, make sure you know what you are doing. Again, I highly recommend Hans’ and Sarita’s excellent article.

What about WiredTiger? We reimplemented shared statistics counters to accumulate values per-thread and then safely synchronize them once in a while — an application of a classical reduction optimization. So this workload and other similar ones scale beautifully with statistics or without them.