Do you need a mutex to protect an int?

Posted on October 23, 2008, 5:16 pm, by Alexander Sandler, under Short articles.

Recently I ran into few pieces of code here and there that assumed that int is an atomic type. I.e. when you modify value of the variable from two or more different threads at the same time, all of the changes you’ve made to the value will remain intact.

But really, can you modify variables of basic types (integers, floats, etc), from two or more threads, at the same time, without screwing their value?

Until now I’ve been very precautious with questions of this kind. I’ve spend enormous amount of time solving synchronization problems. So when I have a variable that being modified or accessed from two or more threads I always put some mutex or semaphore (spinlock in kernel) around it – no questions asked.

Still, I decided to check things out, perhaps one more time. I wanted to see what happens with a variable if two or more threads modifying it. So, I’ve written a short program that demonstrates what happens when you do something like this.

Here’s the code. See my comments below to understand what it does and how it does it.

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <linux/unistd.h>
#include <sys/syscall.h>
#include <errno.h>

#define INC_TO 1000000 // one million...

int global_int = 0;

pid_t gettid( void )
{
	return syscall( __NR_gettid );
}

void *thread_routine( void *arg )
{
	int i;
	int proc_num = (int)(long)arg;
	cpu_set_t set;

	CPU_ZERO( &set );
	CPU_SET( proc_num, &set );

	if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
	{
		perror( "sched_setaffinity" );
		return NULL;
	}

	for (i = 0; i < INC_TO; i++)
	{
		global_int++;
	}

	return NULL;
}

int main()
{
	int procs = 0;
	int i;
	pthread_t *thrs;

	// Getting number of CPUs
	procs = (int)sysconf( _SC_NPROCESSORS_ONLN );
	if (procs < 0)
	{
		perror( "sysconf" );
		return -1;
	}

	thrs = malloc( sizeof( pthread_t ) * procs );
	if (thrs == NULL)
	{
		perror( "malloc" );
		return -1;
	}

	printf( "Starting %d threads...\n", procs );

	for (i = 0; i < procs; i++)
	{
		if (pthread_create( &thrs[i], NULL, thread_routine, (void *)(long)i ))
		{
			perror( "pthread_create" );
			procs = i;
			break;
		}
	}

	for (i = 0; i < procs; i++)
		pthread_join( thrs[i], NULL );

	free( thrs );

	printf( "After doing all the math, global_int value is: %d\n", global_int );
	printf( "Expected value is: %d\n", INC_TO * procs );

	return 0;
}

You can download the code and a Makefile here.

The setup is simple. Number of threads is as number of CPUs in your machine (this includes cores and even hyperthreaded semi-cores). Each thread affined to certain core with sched_setaffinity(). I.e. scheduler configured to run certain thread on certain core, to make sure that all cores access that variable. Each thread increases a global integer named global_int one million times.

In the meantime, main thread waits for the worker threads to do their job. Once they’re done, it prints the value of global_int.

And the result is:

Starting 4 threads...
After doing all the math, global_int value is: 1908090
Expected value is: 4000000

I guess numbers speak for themselves.

In case you still have some doubts, consider this. When I compile the code with -O2 compiler option (enabling optimization), the value of the global_int is 4000000, as you may have expected. So, it is not really black and white. Also, things may be different on different computers.

However, the bottom line is that you want to be on the safe side. This code is a little simple proof that in some cases, simultaneous access to certain basic variable can cause its value to become ambiguous. So, to be on the safe side, protect your shared variables, no matter if it is a complex data structure or simple integers.

If you have further questions, comment here or send me an email to alex@alexonlinux.com.

PS: I think this issue deserves full scale article. What do you think?

Bookmark: digg, del.icio.us, reddit, stumbleupon, technorati, twitter, google, yahoo, facebook
Tags: critical-section, integer, multi-threaded, mutex, semaphore, spinlock, synchronization, threads
Comment (RSS) | Trackback

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

26 Comments

Stavros says:

March 30, 2009 at 6:48 pm

Hello Alexander, I was looking around your site,
the articles I read were very useful.

I should mention that the above program can work
without mutexes or locks.
For such quantities usually
the “inc” or “add” instruction is used ,
that all the recent processors has.
So if the “global_int++” is replaced with
__asm__ __volatile__(“inc %0″ :”=m” (global_int) :”m” (global_int));
it should work. Sorry for the gcc__IDIOT__assembly, I never learned
it’s syntax completely…

Reply to this comment
Alexander Sandler says:

March 31, 2009 at 7:13 am

@Stavros
Actually, no, it will not work. This is exactly what I’ve been trying to say in this article. Try running it on multi-core system and you will see it for yourself – replace global_int++ with pieces of assembly below and see the result.

This is the version that should produce correct result.
__asm__ __volatile__(“lock; incl %0;” : “=m”(global_int) : );

This is the one you’ve suggested (fixed assembly :-)). It will not work.
__asm__ __volatile__(“incl %0;” : “=m”(global_int) : );

BTW, I figured out why it produced correct result when compiled with -O2. It appears that with -O2 gcc gets rid of entire loop and simply adds INC_TO to global_int – it doesn’t do 1,000,000 increments, but adds 1,000,000 to the variable

Reply to this comment
David Brown says:

May 20, 2009 at 3:31 pm

There are two separate issues here. Read or write access to an int *is* atomic (on an x86 – if you are dealing with different sized processors, or different sized data, things can get more complicated). Reads and writes that are bigger than the processor’s size, such as accessing a 64-bit long long on a 32-bit x86, are *not* atomic (though accessing a 64-bit double *might* be atomic).

However, incrementing an int consists of three operations – read the old value, increment it, and write the new value. Each of these individually is atomic, but the combination is not. Even if a single assembly instruction is used, the three parts are still broken up in the processor, and by the bus logic.

But you can still use unlocked ints for atomic access, if you are careful about exactly what you do with them. For example, replace the for() loop in thread_routine with this code:

while (*((volatile int *) &global_int) != proc_num) ;
global_int = proc_num + 1;

(The “volatile” hack is to limit the compiler’s optimisation – check the generated assembly, especially with -O2, to see why it’s needed.)

Here you are only requiring atomic reads and writes, and the code will work properly.

Reply to this comment
Alexander Sandler says:

May 21, 2009 at 2:30 pm

David, I am a little confused with some of the things in your comment.

Originally Posted By David BrownThere are two separate issues here. Read or write access to an int *is* atomic (on an x86 – if you are dealing with different sized processors, or different sized data, things can get more complicated). Reads and writes that are bigger than the processor’s size, such as accessing a 64-bit long long on a 32-bit x86, are *not* atomic (though accessing a 64-bit double *might* be atomic).

Here you say that accessing int is an atomic operation.

However, incrementing an int consists of three operations – read the old value, increment it, and write the new value. Each of these individually is atomic, but the combination is not. Even if a single assembly instruction is used, the three parts are still broken up in the processor, and by the bus logic.

Here you say that accessing int is not an atomic operation.

But you can still use unlocked ints for atomic access, if you are careful about exactly what you do with them. For example, replace the for() loop in thread_routine with this code:

while (*((volatile int *) &global_int) != proc_num) ;
global_int = proc_num + 1;

(The “volatile” hack is to limit the compiler’s optimisation – check the generated assembly, especially with -O2, to see why it’s needed.)

I don’t understand the code. I didn’t try it, but I think I understand the volatile trick. Yet the rest of it is quiet confusing. What is proc_num? Why there’s a semicolon at the end of while loop?

Reply to this comment
- BigYellow says:
  
  December 29, 2020 at 11:30 am
  
  for (i = 0; i < INC_TO; i++)
  {
  // procs = sysconf( _SC_NPROCESSORS_ONLN );
  // #define INC_TO 100000 ( 5 zeros )
  
  while( (*(volatile int*)(&global_int)) % procs != proc_num );
  
  global_int ++;
  }
  
  Reply to this comment
David Brown says:

May 23, 2009 at 12:11 am

Originally Posted By Alexander SandlerDavid, I am a little confused with some of the things in your comment.

Sorry – perhaps I was a bit too concise! My background is in embedded programming (mostly on systems far too small to run Linux), so things like “volatile” in C, and how memory bus cycles work, are second nature to me. I haven’t done much of this sort of thing under Linux, however, so your articles are very informative.

Originally Posted By David BrownThere are two separate issues here. Read or write access to an int *is* atomic (…).

Here you say that accessing int is an atomic operation.

However, incrementing an int consists of three operations – read the old value, increment it, and write the new value. Each of these individually is atomic, but the combination is not. Even if a single assembly instruction is used, the three parts are still broken up in the processor, and by the bus logic.

Here you say that accessing int is not an atomic operation.

Accessing an int (or another type that the processor can read or write in a single access) is atomic – it’s a single bus cycle that can’t be broken up. But a read-modify-write operation consists of three parts, and on a powerful processor (such as a modern x86 cpu), these *can* be broken up. If one thread starts a read-modify-write sequence, a second thread can break into it (on a processor like the x86, with an “inc” instruction, this can only happen with multiple processors – on RISC processors, it can happen with a single processor). However, the “read” part can’t be broken, nor can the “write” part. It’s the sequence that can be broken.

But you can still use unlocked ints for atomic access, if you are careful about exactly what you do with them. For example, replace the for() loop in thread_routine with this code:

while (*((volatile int *) &global_int) != proc_num) ;
global_int = proc_num + 1;

(The “volatile” hack is to limit the compiler’s optimisation – check the generated assembly, especially with -O2, to see why it’s needed.)

I don’t understand the code. I didn’t try it, but I think I understand the volatile trick. Yet the rest of it is quiet confusing. What is proc_num? Why there’s a semicolon at the end of while loop?

The semicolon at the end of the while loop is because the while loop’s body is empty – it’s a busy waiting loop. The “volatile” access typecast is necessary otherwise the compiler’s optimiser would simply read “global_int” once, compare it to “proc_num”, and then either skip the loop or jump forever – “volatile” forces the compiler to re-read “global_int” for each iteration. Finally, “proc_num” is the processor number, taken from the argument given to thread_route() – it’s in your code.

I’d highly recommend you look at the generated assembly code for examples such as these, also using different optimisations (-O0, -O1 and -O2). It would give an insight into exactly what’s happening, and among other things show the necessity of “volatile”. There is a general rule that if code does not function identically (except for time and code size) with and without optimisation, then the code is wrong.

Reply to this comment
Alexander Sandler says:

May 24, 2009 at 12:06 pm

Originally Posted By David Brown
The semicolon at the end of the while loop is because the while loop’s body is empty – it’s a busy waiting loop. The “volatile” access typecast is necessary otherwise the compiler’s optimiser would simply read “global_int” once, compare it to “proc_num”, and then either skip the loop or jump forever – “volatile” forces the compiler to re-read “global_int” for each iteration. Finally, “proc_num” is the processor number, taken from the argument given to thread_route() – it’s in your code.

I’d highly recommend you look at the generated assembly code for examples such as these, also using different optimisations (-O0, -O1 and -O2). It would give an insight into exactly what’s happening, and among other things show the necessity of “volatile”. There is a general rule that if code does not function identically (except for time and code size) with and without optimisation, then the code is wrong.

I can see that the semicolon at the end of the loop is because its a busy loop. I am wondering what it is needed for? I mean, if you replace the for loop in the program with while loop as you’ve suggested, then the program will never leave the loop. It will spin infinitely, waiting for global_int to become proc_num. This will never happen, because while threads will wait in a busy loop, the value of global_int will remain unchanged. Unless I’m missing something, I’d say that the code you’ve suggested is broken.

As for volatile, I assume you mean the fact I didn’t know why the program was working with -O2. Yet later I found that with -O2 gcc optimizes entire loop out, leaving simple: global_int += INC_TO;
I suppose you wanted to say that the program behaves differently when compiled with and without optimization. You’re right of course and I will fix as you suggests.

Reply to this comment
David Brown says:

May 24, 2009 at 12:44 pm

Originally Posted By Alexander Sandler

I can see that the semicolon at the end of the loop is because its a busy loop. I am wondering what it is needed for? I mean, if you replace the for loop in the program with while loop as you’ve suggested, then the program will never leave the loop. It will spin infinitely, waiting for global_int to become proc_num. This will never happen, because while threads will wait in a busy loop, the value of global_int will remain unchanged. Unless I’m missing something, I’d say that the code you’ve suggested is broken.

As for volatile, I assume you mean the fact I didn’t know why the program was working with -O2. Yet later I found that with -O2 gcc optimizes entire loop out, leaving simple: global_int += INC_TO;
I suppose you wanted to say that the program behaves differently when compiled with and without optimization. You’re right of course and I will fix as you suggests.

It’s certainly possible that I’ve made a mistake in the code, as I haven’t tried it at all. The idea, however, is that the threads will busy-wait until “global_int” is equal to their “proc_num” argument, passed at the thread creation. Since “global_int” is initialised to 0, the first thread (with a “proc_num” of 0) will break out of its loop, and set “global_int” to 1. This will then free the next thread’s loop, and so on. In practice, you’d probably find the threads starting and exiting before the next thread has even been fully created, making it a bit boring – you could create the threads in reverse order to give them all a chance to run.

Regarding optimisation, I was just making the point that optimisation does not affect the functionality of a program unless the code is incorrect. This is probably no surprise to you, but it does come as a surprise to many people dealing with these issues. In support lists for embedded development tools, I regularly see people complaining that “the compiler is broken” because their code works with no optimisation (or when using an inferior compiler), but fails when optimised. “Volatile” is one of the tools a programmer has for making sure the program does what he wants it to do, rather than just hoping the compiler will read his mind!

But “volatile” and it’s connection with optimisation is straying a bit from your topic here. It’s perhaps worth an article of its own one day, if you are interested in such low-level topics.

Reply to this comment
Sauvaget says:

June 11, 2009 at 8:37 am

Alexander Sandler,

This is a good article that gives a good understanding of a common problem with simple C++ types manipulated in threads. Good job !
I was wondering what __sync_fetch_and_add() does if the “int global_int” data is unaligned ? I guess it is better to force alignement with a compilation option. However I didn’t test anything yet. You do not mention any specific compilation option about this.

Nicolas.

Reply to this comment
Kd says:

July 13, 2009 at 10:34 pm

Thank you David for clearing up the biggest misunderstanding in this post that increment is a multiline operation. the read and write individually are atomic

Reply to this comment
Andre Goddard Rosa says:

July 25, 2009 at 7:01 am

Thank you all for the insights and comments.

@David Brown: Do you have any specific reference that you would recommend? I’d like to learn more about those memory barriers which you’ve commented before. Would you mind sharing how did you get your background on that?

Reply to this comment
Angelo Borsotti says:

October 3, 2009 at 5:18 pm

Hi Alexander,

according to your knowledge, are there any elementary data types
(e.g. integers, etc.) for which assignment is guaranteed to be
atomic on all architectures? I tried to have a look, e.g. to the
ISO C standard, but have found nothing so far.
Thanks
-Angelo

Reply to this comment
David Brown says:

October 4, 2009 at 12:37 pm

Originally Posted By Angelo BorsottiHi Alexander,

according to your knowledge, are there any elementary data types
(e.g. integers, etc.) for which assignment is guaranteed to be
atomic on all architectures? I tried to have a look, e.g. to the
ISO C standard, but have found nothing so far.
Thanks
-Angelo

There are no data types that are guaranteed to be atomic on all architectures. However, you can get close if you limit yourself to some architectures. In particular, any 32-bit (or more) cpu will be able to do a simple read or write to an aligned 32-bit int atomically (and ints will typically be aligned unless you specifically avoid alignment). And most devices (but not all) are able to do atomic writes to an 8-bit byte.

Don’t bother trying to look in the standards for this sort of thing – you won’t find anything relevant.

Reply to this comment
Alexander Sandler says:

October 14, 2009 at 9:35 am

@Angelo Borsotti
See David Brown‘s reply

@David Brown
Thanks for the reply

Reply to this comment
多线程程序中操作的原子性 | 并行实验室 | Parallel Labs says:

May 20, 2010 at 10:45 am

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4] Do you need mutex to protect an ‘int’? [5] C++ Concurrency in Action [6] http://www.newsmth.net/bbscon.php?bid=335&id=236629 […]

Reply to this comment
关于多线程中操作的原子性 « 我的博客's Blog says:

October 28, 2010 at 8:05 am

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4] Do you need mutex to protect an ‘int’? [5] C++ Concurrency in Action [6] Multithreaded simple data type access and atomic […]

Reply to this comment
簡單而有趣的雙執行緒同步問題 « Hitripod says:

January 11, 2011 at 7:22 pm

[…] 今天看到一系列有趣的文章：1, 2, 3 […]

Reply to this comment
Multithreaded simple data type access and atomic variables - Alex on Linux says:

September 29, 2013 at 6:25 am

[…] this article I would like to continue subject I started in my previous two posts (post 1 and post2). Question I am trying to answer is what is the most efficient, yet safe way of accessing […]

Reply to this comment
多线程程序中操作的原子性 – 马开东博客 says:

June 2, 2014 at 11:28 pm

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4]Do you need mutex to protect an ‘int’? [5]C++ Concurrency in Action [6]Multithreaded simple data type access and atomic variables [6] […]

Reply to this comment
spriha says:

June 4, 2014 at 12:51 pm

I didnt understsand how the value of global_int is not same as expected.. please be brief ..

Reply to this comment
多线程程序中操作的原子性 - IT牛人博客聚合 says:

June 24, 2014 at 7:04 am

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4]Do you need mutex to protect an ‘int’? [5]C++ Concurrency in Action [6]Multithreaded simple data type access and atomic variables [6] […]

Reply to this comment
多线程程序中操作的原子性-IT大道 says:

January 3, 2016 at 11:31 am

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4] Do you need mutex to protect an ‘int’? [5] C++ Concurrency in Action [6] Multithreaded simple data type access and atomic […]

Reply to this comment
多线程程序中操作的原子性-IT大道 says:

January 18, 2016 at 9:53 am

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4] Do you need mutex to protect an ‘int’? [5] C++ Concurrency in Action [6] Multithreaded simple data type access and […]

Reply to this comment
多线程程序中操作的原子性-IT技术 says:

May 30, 2016 at 4:59 pm

[…] [4] Do you need mutex to protect an ‘int’? […]

Reply to this comment
Why you need a mutex to protect an int - Alex on Linux says:

July 5, 2016 at 4:49 am

[…] is a follow up on my previous post about need to protect access to even simplest variables when working in multi-threaded environment. […]

Reply to this comment
多线程程序中操作的原子性 – 源码巴士 says:

September 13, 2022 at 12:17 pm

[…] [3] 关于内存对齐、bit field等 –《Linux C编程一站式学习》 [4] Do you need mutex to protect an ‘int’? [5] C++ Concurrency in Action [6] Multithreaded simple data type access and atomic variables […]

Reply to this comment