Multithreaded simple data type access and atomic variables

Table of contents

IntroductionBACK TO TOC

In this article I would like to continue subject I started in my previous two posts (post 1 and post2). Question I am trying to answer is what is the most efficient, yet safe way of accessing simple data type variables from two or more threads. I.e. how to change a variable from two threads at the same time, without screwing its value.

In my first post I’ve shown how easy it is to turn variable value into garbage by modifying it from two or more threads. In my second post I am talking about spinlocks, a recent addition into pthread library. Spinlocks indeed can help to solve the problem. Yet spinlocks more suitable for protecting small data structures rather than simple data types such as int and long. Atomic variables, on the other hand, are perfect for the later task.

Key thing about atomic variables is that once someone starts reading or writing it, nothing else cannot interrupt the process and come in the middle. I.e. nothing can split the process of accessing atomic variable into two. This is why they called atomic.

On the practical side, atomic variables are the best solution for the problem of simultaneous access to a simple variable from two or more threads.

How atomic variables workBACK TO TOC

This is actually quiet simple. Intel x86 and x86_64 processor architectures (as well as vast majority of other modern CPU architectures) has instructions that allow one to lock FSB, while doing some memory access. FSB stands for Front Serial Bus. This is the bus that processor use to communicate with RAM. I.e. locking FSB will prevent from any other processor (core), and process running on that processor, from accessing RAM. And this is exactly what we need to implement atomic variables.

Atomic variables being widely used in kernel, but from some reason no-one bothered to implement them for user-mode folks. Until gcc 4.1.2.

Atomic variables size limitationsBACK TO TOC

From practical considerations, gurus at Intel did not implement FSB locking for every possible memory access. For instance, for quiet some time, Intel processors allow memcpy() and memcmp() implementation with one processor instruction. But locking FSB while copying large memory buffer can be too expensive.

In practice you can lock FSB while accessing 1, 2, 4 and 8 byte long integers. Almost transparently, gcc allows you to do atomic operations on int‘s, long‘s and long long‘s (and their unsigned counterparts).

Use casesBACK TO TOC

Incrementing a variable and knowing that no-one else screws its value is nice, but not enough. Consider following piece of pseudo-code.

decrement_atomic_value();
if (atomic_value() == 0)
    fire_a_gun();

Let us imagine that the value of an atomic variable is 1. What happens if two threads of execution try to execute this piece of pseudo-C simultaneously?

Back to our simulation. It is possible that thread 1 will execute line 1 and stop, while thread 2 will execute line 1 and continue executing line 2. Later thread 1 will wake up and execute line 2.

two_threads

When this happens, no one of the threads will run fire_a_gun() routine (line 3). This is obviously wrong behavior and if we were protecting this piece of code with a mutex or a spinlock this would not have happened.

In case you’re wondering how likely something like this to happen, be sure that this is very likely. When I first started working with multithreaded programing I was amazed to find out that despite our intuition tells us that scenario I described earlier is unlikely, it happens overwhelmingly often.

As I mentioned, we could solve this problem by giving up on atomic variables and using spinlock or mutex instead. Luckily, we can still use atomic variables. gcc developers have thought about our needs and this particular problem and offered a solution. Lets see actual routines that operate atomic variables.

The real thing…BACK TO TOC

There are several simple functions that do the job. First of all, there are twelve (yes, twelve – 12) functions that do atomic add, substitution, and logical atomic or, and, xor and nand. There are two functions for each operation. One that returns value of the variable before changing it and another that returns value of the variable after changing it.

Here are the actual functions:

type __sync_fetch_and_add (type *ptr, type value);
type __sync_fetch_and_sub (type *ptr, type value);
type __sync_fetch_and_or (type *ptr, type value);
type __sync_fetch_and_and (type *ptr, type value);
type __sync_fetch_and_xor (type *ptr, type value);
type __sync_fetch_and_nand (type *ptr, type value);

These are functions that return value of the variable before changing it. Following functions, on the other hand, return value of the variable after changing it.

type __sync_add_and_fetch (type *ptr, type value);
type __sync_sub_and_fetch (type *ptr, type value);
type __sync_or_and_fetch (type *ptr, type value);
type __sync_and_and_fetch (type *ptr, type value);
type __sync_xor_and_fetch (type *ptr, type value);
type __sync_nand_and_fetch (type *ptr, type value);

type in each of the expressions can be one of the following:

  • int
  • unsigned int
  • long
  • unsigned long
  • long long
  • unsigned long long

These are so called built-in functions, meaning that you don’t have to include anything to use them.

Time to see some actionBACK TO TOC

Back to the example I started in the first post I mentioned earlier.

To remind you, this small program opens several of threads. Number of threads is as number of CPUs in the computer. Then it binds each one of the threads to one of the CPUs. Finally each thread runs a loop and increments a global integer 1 million times.

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <linux/unistd.h>
#include <sys/syscall.h>
#include <errno.h>

#define INC_TO 1000000 // one million...

int global_int = 0;

pid_t gettid( void )
{
	return syscall( __NR_gettid );
}

void *thread_routine( void *arg )
{
	int i;
	int proc_num = (int)(long)arg;
	cpu_set_t set;

	CPU_ZERO( &set );
	CPU_SET( proc_num, &set );

	if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
	{
		perror( "sched_setaffinity" );
		return NULL;
	}

	for (i = 0; i < INC_TO; i++)
	{
//		global_int++;
		__sync_fetch_and_add( &global_int, 1 );
	}

	return NULL;
}

int main()
{
	int procs = 0;
	int i;
	pthread_t *thrs;

	// Getting number of CPUs
	procs = (int)sysconf( _SC_NPROCESSORS_ONLN );
	if (procs < 0)
	{
		perror( "sysconf" );
		return -1;
	}

	thrs = malloc( sizeof( pthread_t ) * procs );
	if (thrs == NULL)
	{
		perror( "malloc" );
		return -1;
	}

	printf( "Starting %d threads...\n", procs );

	for (i = 0; i < procs; i++)
	{
		if (pthread_create( &thrs[i], NULL, thread_routine,
			(void *)(long)i ))
		{
			perror( "pthread_create" );
			procs = i;
			break;
		}
	}

	for (i = 0; i < procs; i++)
		pthread_join( thrs[i], NULL );

	free( thrs );

	printf( "After doing all the math, global_int value is: %d\n",
		global_int );
	printf( "Expected value is: %d\n", INC_TO * procs );

	return 0;
}

To compile and run, throw this snippet into a file and run:

gcc -pthread "file name"

Then run ./a.out to execute the program.

Note lines 36 and 37. Instead of simply incrementing the variable, I use built-in function __sync_fetch_and_add(). Running this code obviously produces expected results – i.e. value of global_int is 4,000,000 as expected (number of CPUs in the machine multiply 1 million – in my case this is a 4 core machine). Remember that when I ran this code snippet leaving line 36 as is, the result was 1,908,090 and not 4,000,000 as we’d expect.

PrecautionsBACK TO TOC

When using atomic variables, some extra precautions have to be taken. One serious problem with atomic variable implementation in gcc is that it allows you to do atomic operations on regular variables. I.e. there is no clear distinction between atomic variables and regular variables. There is nothing that prevents you from incrementing value of the atomic variable with __sync_fetch_and_add() as I just demonstrated and later in the code doing same thing with regular ++ operator.

Obviously this might be a serious problem. Things tend to be forgotten and it is a matter of time until someone in your project or even you yourself will modify value of the variable using regular operators, instead of atomic functions that gcc has.

To address this problem, I strongly suggest wrapping around atomic functions and variables with either ADT in C or C++ class.

ConslusionBACK TO TOC

This article concludes a series or articles and posts where we investigate and study newest techniques in the world of multithreaded programming for Linux. Hope you’ll find these posts and article useful. As usual, in case you have further questions please don’t hesitate to email me to alexander.sandler@gmail.com.

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

46 Comments

  1. Andrey says:

    Hi ! May I ask you an unrelated question ?
    I ran into a strange behavior compiling your example code:

    gcc -lpthread -o atomic atomic.c

    produces the following error message:
    /tmp/ccCFT2Dq.o: In function `thread_routine’:
    atomic.c:(.text+0x3e): undefined reference to `CPU_ZERO’
    atomic.c:(.text+0×52): undefined reference to `CPU_SET’
    collect2: ld returned 1 exit status

    g++ -lpthread -o atomic atomic.c

    produces no errors.
    The results are just like you describe.

    What is wrong, and why do I experience this kind of difference between GCC and G++ ?

    Thank you.

  2. @Andrey
    I guess I have to apologize for not telling this. This code has to be compiled with -D_GNU_SOURCE compilation option. Otherwise, sched_setaffinity() and friends won’t compile.
    See this post as an example: http://www.alexandersandler.net/do-you-need-mutex-to-protect-int
    This post has same piece of code and also has a Makefile that compiles it.

  3. arnab says:

    Can the atomic variales be used in application space in arm 9 processor ?

  4. @arnab
    Honestly I don’t know and unfortunately I don’t have ARM system at hand to test. Anyway, you can try it yourself. Try calling one of those __sync_fetch_and_add() functions and see if gcc complains about them. You don’t have to include anything to use them, so if gcc complains, this means that atomic variables aren’t supported.
    Hope it helps.

  5. arnab says:

    @Alexander
    Thanks for your reply.
    i tried with a simple code in i386 machine with 2.6.18-1.2798.fc6 OS.

    The code is as follows

    #include
    void main() {
    atomic_t v;
    atomic_set(&v, 5); /* v = 5 (atomically) */
    atomic_add(3, &v); /* v = v + 3 (atomically) */
    atomic_dec(&v); /* v = v – 1 (atomically) */
    printf(“This will print 7: %d\n”, atomic_read(&v));
    }

    and when i compile , i get the following error….
    atomic.c: In function âmainâ:
    atomic.c:5: error: âatomic_tâ undeclared (first use in this function)
    atomic.c:5: error: (Each undeclared identifier is reported only once
    atomic.c:5: error: for each function it appears in.)
    atomic.c:5: error: expected â;â before âvâ
    atomic.c:6: error: âvâ undeclared (first use in this function)

    from the above observation it seems user space cannot support atomic variable. Is my understanding correct or am missing any thing ?

  6. @arnab
    It doesn’t work because compiler doesn’t know what atomic_t is. Try using __sync_fetch_and_add() on int.

  7. David Brown says:

    You can use the “atomic_t” type if you #include

    The __sync_XXX functions are not implemented on all targets supported by gcc – you basically have to try them and see. The “atomic_XXX()” functions in linux *will* work on all supported platforms, and will typically generate the same code as the corresponding __sync_XXX functions. For some targets that don’t have processor support for atomic RMW sequences, the linux functions will disable interrupts during the access – gcc built-ins can’t do that.

    It’s best to look at the generated assembly for some test code for your target, so that you understand exactly what’s going on.

    And watch out for “memory barriers”, or lack thereof – getting these wrong can lead to difficult bugs.

  8. Priyank says:

    Thanks again for the post, I was long looking for the InterlockedIncrement (http://msdn.microsoft.com/en-us/library/ms683614%28VS.85%29.aspx) equivalent API in Linux.

  9. @Priyank
    Sure. Thanks for visiting and please visit my web-site again :-)

  10. @David Brown
    First of all, let me apologize for the name of the header file. The web-site’s program interpreted the name of the header file as an embedded HTML of some sort and stripped it off. I suppose you meant to include asm/atomic.h

    To be honest I didn’t know that atomic_t can be used in user-mode. I’ll investigate it and perhaps write a new article comparing both gcc’s built-in functions and atomic_t. One thing is obvious already: atomic_t will work on any Linux platform, but then gcc works on many other platforms.

    As for memory barriers, I don’t think anyone should be worried about them too much. gcc supposed to take care of this.

    Thanks for your comment. It is most useful.

  11. David Brown says:

    Originally Posted By Alexander Sandler@David Brown
    First of all, let me apologize for the name of the header file. The web-site’s program interpreted the name of the header file as an embedded HTML of some sort and stripped it off. I suppose you meant to include asm/atomic.h

    To be honest I didn’t know that atomic_t can be used in user-mode. I’ll investigate it and perhaps write a new article comparing both gcc’s built-in functions and atomic_t. One thing is obvious already: atomic_t will work on any Linux platform, but then gcc works on many other platforms.

    As for memory barriers, I don’t think anyone should be worried about them too much. gcc supposed to take care of this.

    Thanks for your comment. It is most useful.

    I don’t know if atomic_t can be used in user-mode programs either, actually – I can only remember seeing it in kernel level code. I’m a beginner at low-level Linux programming, having worked mostly with smaller embedded systems.

    It’s true that gcc is available on more targets than Linux, but that doesn’t mean the gcc __sync functions are implemented on all of these. In particular, for platforms that don’t have any sort of hardware locking or atomic access support, there will be no __sync functions. However, you can simulate atomic access if you disable interrupts – that’s something gcc cannot do, but Linux’s atomic functions and macros *can* do.

    You are wrong about ignoring memory barriers, however. gcc *may* have barriers implicit in the __sync builtins, and various library functions *may* include them, but you have to check. In particular, many people incorrectly assume that “volatile” implies a memory barrier, and get caught out when their code apparently misbehaves. For example, consider this code:

    volatile int vi;
    void bar(void) {
    vi = 1;
    foo();
    vi = 0;
    }

    It’s often assumed that vi will be set to 1 before calling foo(), and cleared afterwards. In fact, you have no such guarantee – the compiler can happily move foo() around with respect to the “vi =” statements, as long as it knows that foo() does not make other volatile accesses. The correct way to ensure this code works is to use a memory barrier such as “asm volatile (“” ::: “memory”) before and after foo().

    When working on low-level code, it’s also important to remember that even this sort of code won’t guarantee that the processor will execute code as you want (due to out-of-order execution), it won’t store data in any given order (due to write buffers and caches), and it won’t read data in a given order (due to caches). These are issues that are not directly part of atomic access issues, but they are related, and can turn up in similar code. If you want to stray from the well-worn paths of using standard OS locks (and paying the price in run-time, as you’ve discussed), you need to be aware of these sorts of thing and how they may affect the correctness of your code. After all, correctness is more important than any speed factor!

  12. Andre Goddard Rosa says:

    @David Brown/Alexander: for userspace, we have sig_atomic_t for atomic variables, which resembles an int normally. And the restrictions you raised about RMW (read, modify, write) surely apply to it too.

  13. Cyril says:

    sig_atomic_t is only atomic with respect to signal. Don’t expect:

    // sig_atomic_t a;
    a++;

    To be atomic. If you want this operation to be atomic, you must use
    __sync_fetch_and_add(&a, 1);
    BTW, you’ll get in “a” either the previous value or the expected value depending on when you’ll look at a.

  14. Uday says:

    hi

    is there any way by which we can apply similar locking on structure or arrays (pointers) ???

    i dont want to use mutex… b’use it does not give performance….

    thnks

  15. @Uday
    Well, I am afraid there’s nothing similar that protects structures. One thing does work better than muteces is spinlock. But by all means study the subject well before using them. I’ve written a couple of articles on the subject. Here is one of them: http://www.alexonlinux.com/pthread-spinlocks

  16. @Cyril
    @Andre Goddard Rosa
    Guys, thanks for making this clear. I already saw sig_atomic_t, but didn’t have time to dig into all the details.

  17. Angelo Borsotti says:

    Hi Alexander,

    it seems that the __synch_xxx builtins have problems. See:
    http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html

    Bohem provides here an alternative implementation that
    claims to be correct: http://bdwgc.cvs.sourceforge.net/viewvc/bdwgc/bdwgc/libatomic_ops/

    -Angelo

  18. chenghuige says:

    Thanks a lot. I am new and interested in multi threading, your articles greatly help me! Especially when I read the gcc 4.4.2 string code.

  19. [...] I have seen no less than four questions this week on Stack Overflow regarding facilities for atomic ops on Linux (and portably speaking, anything else that hosts GCC). Yes, gcc does have some facilities for this, take a look at this rather good overview. [...]

  20. Shirish says:

    Excellent. I have been searching for ways to implement atomic variables in C++ user-level code. I never knew about these __sync_ methods.

    My workaround was to use atomic_t that was defined in . As “arnab” and “David Brown” mentioned, the header is not available for user-level code. I copied the header from the kernel source and used it as a normal user-defined header file. It worked fine. Each atomic operation is implemented as an assembly instruction along with memory barriers, whenever required.

  21. @Shirish
    Great. I am happy to hear you found it useful. Please visit my web-site again :-)

  22. [...] [8] http://www.newsmth.net/bbscon.php?bid=335&id=186723 [9] Multithreaded simple data type access and atomic variables [...]

  23. chinmoy says:

    Hi, thanks for nice article. i have a simple program that works on atomic locks.actually i want to simulate a program where 2 process will work on atomic variable, to show that it prevents race condition.

    /* Includes */
    #include /* Symbolic Constants */
    #include /* Primitive System Data Types */
    #include /* Errors */
    #include /* Input/Output */
    #include /* General Utilities */
    #include /* POSIX Threads */
    #include /* String handling */
    #include /* Semaphore */
    #include

    atomic_t counter = ATOMIC_INIT(0);

    int main()
    {
    int val;

    /* START CRITICAL REGION */

    printf(“Now in critical region…\n”);

    atomic_add( 1, &counter );
    atomic_inc( &counter );
    atomic_sub( 1, &counter );
    atomic_dec( &counter );
    val = atomic_read( &counter );

    printf(“Counter Value: %d\n”, val);
    printf(“Exiting critical region…\n”);
    /* END CRITICAL REGION */

    /* exit */
    exit(0);

    } /* main() */

    here is my program. but i could not compile it in linux using gcc. how can i compile it? btw, with 2 process what i mean is that say my src file name is test.c, so i will run the same program twice in two sperate console but will use atomic variable counter as a shared variable. how to do it?

    Best Regards.
    Chinmoy

  24. chinmoy says:

    #include (unistd.h)
    #include (sys/types.h)
    #include (errno.h)
    #include (stdio.h)
    #include (stdlib.h)
    #include (pthread.h)
    #include (string.h)
    #include (semaphore.h)
    #include (asm/atomic.h)

    sorry include tag is not posted correctly. so i replace lt and gt by brakcet. hope you understand.
    sorry i forgot to mention my error while compling to gcc. it says
    “atomic_t undeclared first use in this function”

    thanks again

  25. @chinmoy
    The problem with your code is that you’re using kernel atomic variables. Since your program is a user-mode program, asm/atomic.h doesn’t provide you with atomic_t type.
    I’d suggest you to use atomic variables the way I described them in the article.

  26. chinmoy says:

    Hi Alex, thanks for your suggestion. and really sorry to reply late. ya i have done it the way you suggested in your article. now the program compiles and runninng well.
    thanks again.

  27. Manju says:

    Hello Alex,
    Thanks for informative article and interesting conversations following that.
    I have two queries.
    1. I need to do a automic_read which of the __sync shd I use.
    __sync_fetch_and_add(pts, 0) be fine ? or is there any other specific function for read

    2. Are sync* function interrupt safe ? i.e. will gcc take care of disabling interrupts while performing these operations ?

    Thanks

  28. Manju says:

    One more
    3. whats the equivalent of atomic_set ?
    I see in atomic.h that atomic_set is defined as
    #define atomic_read(v) ((v)->counter)
    So a normal assignment statement should be safe here ?

  29. @Manju
    1. __sync_fetch_and_add(pts, 0) is the way!
    2. The question of interrupt safety is irrelevant here because all these functions translate into single processor instruction – interrupts land between instructions, not in the middle of an instruction.
    3. There are couple of solutions for this. Most of them are not pretty. First there’s a __sync_lock_test_and_set(). gcc documentation says that one should be careful when using this function because it is not implemented on all gcc platforms. Should be safe on Intel though.
    Another option is to do assignments before you use the variable atomically and then to do relative adjustments only (those you can do atomically). This is not very convenient, but this is all gcc got.
    Hope it helps.

  30. Michael Cole says:

    Exceptionally beneficial bless you, It looks like your trusty readers would most likely want far more writing of this nature carry on the excellent work.

  31. Mohammed Shahid says:

    Wouldn’t __sync_fetch_and_sub correspond to fetch-and-subtract rather than fetch-and-substitute ?

  32. Zamir says:

    Alex,
    This is a wonderful piece. Thank you for helping us understand atomic operations in C!

  33. This and the two before are great articles, useful and pleasant to read.
    Thanks

  34. yt says:

    Thanks Alex,I has found we can implement these function by gcc asm,like __sync_fetch_and_add,I write this:
    typedef atomic_t int;
    static inline atomic_t fetch_and_add(atomic_t *value,atomic_t key)
    {
    asm volatile(“lock;addl %1,%0″
    :”+m”(*value)
    :”ir”(key)
    :”memory”);
    return *value;
    }
    but it’s limited for intel x86,reference this in detail:
    http://lxr.linux.no/linux+v2.6.26.4/include/asm-x86/atomic_32.h

  35. @yt
    Yes, your implementation uses i386 assembler. Whenever you try to do something like this on some other platform, it probably will not compile.

  36. __sync_fetch_and_add | Asterisk FAQs says:

    [...] An excellent article about Multithreaded simple data type access and atomic variables, by Alexander Sandler,  is available here. [...]

  37. Angel Genchev says:

    FSB actually stands for Front SIDE Bus, not serial. Since P6 (Intel pentium Pro) some cpus have a back side bus which allows aynchronous operation with L2 while the front side bus is busy.
    Nothing should be worshiped. To me it`s interesting what happens in an 64-core NUMA system (possible 4x AMD 16-core cpus) when a thread wants to execute an atomic operation.
    If the instruction stalls all 48 cores in the rest (than current) CPUs, than … it`s evil.

  38. Jose Fonseca says:

    Does atomic functions work in signal handler functions?

  39. surya says:

    Articles are really good, advanced concepts are easier to understand. Thanks for your time to write such concepts.

  40. surya says:

    I know how IoCompletion port works in Windows, is there any equivalent implementation in unix, if you provide how select and pool works in socket programming where single server like stock exchnage sends same data to multiple clients in TCP environment.

  41. barfat says:

    Greeting!!

    I used your sample code to test cpu performance , as link showes :

    http://stackoverflow.com/questions/14392656/sched-setaffinity-cpu-affinity-in-linux

    Since this server has 1 socket cpu with 4 cores, I think it should
    make no big differences which 2 cores to run the 2 threads,
    but I am wrong , May I ask what cause this result ?

  42. kappen says:

    I have search what is “Front Serial Bus”,
    does FSB stand for Front Side Bus?
    you have said when access the atomic variables,may lock the FSB,
    this may cost a lot.
    i hava test the sync_add_and_fetch,
    one operatiron takes 13ns in my machine, a memory access takes 100ns,so i do not agree with you

Leave a Reply


nine − 7 =