Alex on Linux

Why you need a mutex to protect an int

Alexander Sandler — Tue, 05 Jul 2016 02:49:56 +0000

This is a follow up on my previous post about need to protect access to even simplest variables when working in multi-threaded environment. In this post I would like to explain what’s going on under the hood and why you actually need some protection here.

As in many other computer architectures, in x86 memory access is fairly slow. To give you some perspective, accessing register in CPU takes roughly .5 nano-seconds. Changing value in main memory takes roughly 100 nano-seconds. That’s 200 times slower.

There is number of reasons for that. Perhaps the simplest reason is that typical memory works at much lower frequencies compared to CPU. Typical CPU works at frequencies 2.5-3.5GHz while typical DRAM memory works at 1.333GHz. In addition to that, typical DRAM has to recharge it’s memory cells once in awhile. So typical DRAM doesn’t actually work some of these 1.333GHz.

Another reason for slow memory access is the fact that simple memory access require multiple operations. First of all CPU has to translate virtual address to physical address. Then it has to calculate memory bank that has specific memory cell. Then it has to send a request to read the memory. Somewhere along the way it has to gain access to Front Side Bus – bus that CPUs use to access RAM in x86.

Here’s another reason. CPU is 20-30cm away of memory. It takes roughly 1 nano-second for electrical signal to travel 30cm.

Typical x86 processor tries to solve the problem of slow memory using several different techniques:

Memory works in units of 32 and sometimes even 64 bytes called memory lines. I.e. when CPU reads memory cell, it typically reads 32 byte line.
Multiple levels of cache make sure that memory access is effective. Typically there are 3 levels of cache named L1, L2 and L3. Sometimes there are four levels. Sometimes there are two. With three levels of cache, L1 usually is local to CPU core, L2 local to physical CPU and L3 shared between several CPUs.
Special cache called instruction cache caches program’s instructions.
CPU does not immediately writes changes to main memory. Instead CPU tries to delay updating main memory hoping there will be additional changes in same memory line. Changes stored in so called Store Buffer.

Store buffer is the key to what happens with plain integer when multiple threads try to change it at the same time.

x86 maintains so called cache coherency. I.e. when one of the processors modifies one of the memory lines, all other processors become aware of that and modify their own cache. So it is impossible for one processor to modify some memory region, without other processors being aware of this.

Processors use variation of a protocol called MESI (stands for Modified, Exclusive, Shared, Invalid, denoting four states of a state machine representing state of a memory line from single processor’s point of view) to synchronize ownership of memory lines. All processors listen to Front Side Bus at all times and synchronize their state of the memory line based on what other processors send over FSB.

So, if x86 is cache coherent, when one of the threads updates an integer, all other threads should become aware of that, right? Well, store buffer really messes up with this process. When one CPU updates an integer, the update is not written to main memory. It is stored in store buffer until CPU decides to flush the buffer. Only then variable value gets written to main memory.

When one thread increments the variable, it assumes current value of that variable based on what is in main memory. It may increment the variable once or twice before writing it to main memory. However, value that it writes is value of variable some time ago, plus few increments it performed. At the same time other processors do exactly the same. This causes both of them to miss some of the increments they performed.

Cautionary tale about using threads and fork()

Alexander Sandler — Mon, 25 May 2015 03:34:31 +0000

I ran into one interesting problem. I found a process stuck on mutex. This sounds like a common deadlock, but it wasn’t that obvious. Thread that was stuck is the only thread in the process.

I clearly saw from printouts that that mutex was locked in the past by some other thread that no longer exist. Sometimes you can get into this kind of problems when you are not handling exceptions properly. The code is written in C++.

Unfortunately for me, that mutex is always locked using guard variable – guard variable locks the mutex in constructor and unlocks it in destructor. So any exception thrown while mutex locked, wouldn’t have caused the problem – when guard variable goes out of scope, it’s destructor would unlock the lock.

My first hunch was signals. One very common behavior when working with signals is when you are locking a mutex and then receiving a signal. If signal handler tries to lock same mutex you end up with deadlock. Fortunately for me the process was still alive and I was able to check the backtrace. Backtrace showed that thread is not stuck in the signal handler.

And this is it. I did not have second hunch. It took me a while to realize what really happened.

This process expected to run in the background. So the most natural thing for it to do was to fork() right after starting. A moment before that it was launching a thread that handle some asynchronous tasks. When process forks, the child process inherits memory and file descriptors from parent process. One thing that it is not inheriting is its threads.

In my particular case parent created a thread. Thread started running, locking the mutex. At this exact moment parent process called fork() and child process was born with that mutex locked.

To conclude, it never stops to amaze me how simple and yet how complicated multi-threaded programming can be. As a precaution, try not to mix multi-processed and multi-threaded programs. You may get surprising results.

Spam

Alexander Sandler — Sun, 27 Apr 2014 19:34:17 +0000

Folks, there seems to be some technical issues with “Notify me when new comments arrive” feature. The feature allows you to post a comment on a alexonlinux.com and receive email notifications when someone responds.

I got few complains about spam comments getting to people’s mailbox. And now it sends multiple emails too. So I decided to disable this feature. Please accept for apologies for the inconvenience. Please let me know if you see other problems.

Best regards,
Alex.

Bloom filters

Alexander Sandler — Wed, 04 Sep 2013 03:08:24 +0000

Bloom filter is a data structure that contains set of elements. Unlike regular data structures it cannot contain data that is associated with certain key. Neither it can contain keys themselves. The only type of information it can contain is whether certain key belongs to a set or not.

You must be wondering what it is useful for. Here’s typical scenario for using bloom filter. Lets say you have large data structure and you often have to check if particular member is in the data structure. For example, lets say you have large binary tree and you often query the tree if it contains some element.

Since actual data is in large binary tree, checking if some element in the tree takes logarithmic time. For a tree that contains one million objects, that would be 20 levels tall tree. To optimize access time, you may want to check if certain entry is in the data structure before you search the tree. Lets say when adding a new entry, you may want to check if that entry already in the tree. When deleting, you may also verify that element you’re trying to delete is there.

I came across bloom filters through disk based data structures. Accessing some data structure in logarithmic time may be good enough even for a very large data structures. But not when data structure is on disk. For just any large data structure that lies on disk, I/O operations is a scarce resource. So, reducing number I/O operation can be a game changer.

Representing large data structures is where bloom filters shines. When you insert an element into your data structure, you first insert it into bloom filter. Bloom filter resides in memory and so its easy to check before you check your main data structure. This is why it is called filter – it actually filters out unnecessary look-ups on main data structure.

Another interesting property of the bloom filter is that it often consumes less space per element than element’s key. In fact it can perform fairly well with just 10 bits per element it stores (more on this later). However, this does not come for free. Query on bloom filter can return false positive result on some of the queries. This is why it is called probabilistic data structure. Probability of false positive result can be controlled and calculated.

So, the way to use bloom filter is this. If bloom filter says entry is not in the set, then it is for sure not in the set. If bloom filter says entry is in the set, then check large data structure for the entry. Most likely it will be there.

Now lets talk about structure of the bloom filter. Bloom filter is just a set of bits. Each bit represents a sub-set of elements in the element universe. Am I talking like a mathematician here? Oh well… if you have a bunch of possible values for your object, one bit of bloom filter represents not just one, but some of them.

When one of the elements present in the set, bit that represents it set to one. This also means that if we query about some other element that by chance represented by the same bit, we will get a positive result, despite the element may not be in the set – this is where probabilistic property of bloom filters come from.

Controlling probabilities

First of all, lets see what happens when you have large number of entries per bloom filter bucket. Lets say your bloom filter represents entries in the large data structure. Lets also assume that bloom filter has bits.

There is an interesting problem in probability theory called birthdays problem. The problem asks following question: how many people should work in one company so that probability that two of them have birthday on the same day is above 50%. Interestingly enough, the answer is approximately square root of all possibilities. There are 365 options for birthdays, so the answer is 19.

This problem often called paradox because the right answer is not intuitive. One would expect the answer to be much bigger.

Birthdays paradox applies here as well. It is enough to insert entries into bloom filter for probability of single bit representing two or more elements to be more than 50%. This is a serious problem when working with bloom filter.

Luckily there is a way out. For simplicity, lets say we just 10 bits in the bloom filter. We added one element into bloom filter, so one of the bits is on. Now, as you remember, when querying for element that is in the filter, we will get 100% accurate result. It is when we query for element that is not in the filter where we get false positives. Lets see when we get false positive for element that is not in the filter.

With just 1 element inside of 10 bit long bloom filter, probability of getting false positive result is . This is so because we assume equal distribution of the hash function that convert element into right bit in the bloom filter. So, one element inside, means one bit is set and querying for different element can accidentally hit this bit with probability of .

Now lets see how we make this probability smaller. Lets say that when inserting element into bloom filter, instead of setting one bit, we will set two bits. So now, to query if element is in the bloom, we will check two bits. Its important that for every entry, it will be two different bits. If we find that both bits set, we will conclude that element is in the filter.

In this case, probability of false positive will be . This is because probability of hitting first set bit is and probability if hitting second set bit is . So, we clearly see that probability of hitting drops when we use several bits.

However, we cannot increase number of bits per element forever. After all, our entire bloom filter is 10 bits so using 10 bits for single element will make false positive probability of 100% after inserting just one element. In fact, jump to 100% chance of false positive is gradual. Using more bits per element decreases number of elements we can safely insert into bloom filter. Lets see how.

With just one bit per element, probability of hitting false positive answer for querying is . With two bits inside, probability of hitting false positive when querying for third element is . This is obviously lower than .

With two bits per element, probability of hitting false positive is . Adding second element and querying for third, makes probability of false positive . So, its clear that probability of false positive when querying for third element decreased from to simply because we’re using one more bit per element.

In fact, as we will see in later, bloom filter is a function of four parameters: overall number of bits in the bloom, number of bits per element, probability of false positive and how many elements are in the bloom.

Fine tuning bloom filter

Like I said, bloom filter is a function of four different variables. Lets name them:

– number of bits in bloom filter
– number of bits that will represent single entry
– number of entries in the bloom filter when calculating probability of false positive
– probability of false positive answer

As we just saw, calculating probabilities for large bloom filter can be quiet complicated task. Luckily there are some approximations that can help us to get a clue of what’s going on.

Normally, when calculating various properties of the bloom filter, we can start with making an assumption on maximum number of elements in the filter (). Also, you can decide what is the probability of false positive you can tolerate. With this in hand, lets see how we choose . I’ll spare you the task of calculating probabilities and how they affect our and give you the final formula:

Chart below shows number of bits in bloom filter per element as a function of probability of getting false positive.

As you can see, with probability around 0.01%, you will need 19 bits per element and with probability of 5% you can settle with just 6 bits per element. And with 9 bits per element, you will get 1% chance of getting false positive.

Now lets see how many bits in the set will represent single entry in the filter. Here comes another useful approximation.

So, with you will set 7 bits per element in the bloom. This will give us 1% chance of false positive. From here, the math is fairly simple, so you can handle it yourself.

Deleting entries from bloom filter

As long as you set single bit per every element that you insert into bloom filter, you can always reset that bit to delete the entry. However, once you start using multiple bits per element, you cannot delete entries anymore. This is because when you reset a bit, same bit can represent other entry in the bloom filter. Bloom filters have this property of returning 100% correct answer when entry is in the bloom filter. When you zero a bit, you break this property.

In fact, you cannot delete entries from bloom filter. You can only insert new entries. This means that when implementing bloom filter and its use, it might be wise to add mechanisms that will expire it. I.e. delete entire bloom filter once in a while.

There is a way to implement bloom filter so that you’d be able to delete its entries. However, this comes with a price.

Counting bloom filter

The idea is this. Instead of having array of bits in the bloom filter, lets have an array of small counters. For instance, you may want to use 4 bit counter instead of just one bit. This obviously increases bloom filter size 4 times, but it brings a useful property.

With counter instead of plain bit, you can actually count how many times you’ve set this “bit”. I.e. instead of setting the bit, you’ll increment the counter. This allows you to delete entries. To delete an entry, decrement all counters that represent that entry.

There is one caveat however. You have to be careful with not overflowing the counter. If it overflows and becomes 0 again, this will break “true positive” property of the bloom filter. I.e. checking if entry that is in the filter is indeed in the filter will return negative result.

So, to keep the counter from overflowing, when incrementing, you should check if you’ve reached maximum value. Once bloom filter reaches it maximum value, you cannot change it anymore. If you increase it, it will overflow and you will break the filter. If you decrease it, well, its value will no longer represent correct number of times you’ve incremented it and you won’t know that. So the only way around this problem is to keep the counter at its maximum value and never change it again.

To conclude, obviously this solution has two major drawbacks:

Bloom filter size increases significantly.
This approach does not solve the problem completely because if you reach maximum value of your counter, you cannot change it anymore, thus you revert to original bloom filter design (one with one bit), at least for one entry in the filter.

However, with careful design, you can use this approach to solve the problem of deletes from bloom filter using this approach, while keeping memory consumption under control.

Hash function

One last topic that I would like to touch is hash function. As we saw, we need a hash function that will receive an object as its input and produce number of buckets as its output. Obviously we can use one of the well known cryptographically secure hash functions such as SHA1 or MD5.

One problem with these hash functions is that they require more processing and may be an overkill. Another problem is that it is not obvious how to get two, three or even may-be four different hash values for the same input. One way might be running SHA1 on input, then running it on output of the first run, then running on output of the second run, etc.

While searching for various hash functions for bloom filter, I ran into one particular implementation that I liked. Its called murmur and it is available here.

printf() vs stream IO in C++

Alexander Sandler — Tue, 14 Aug 2012 19:10:51 +0000

Before joining Dell I was mostly working in kernel writing in C programming language. At Dell I still work on mostly low level stuff, but this time it is user-mode, so I am not tied up to C anymore. We’re writing in C++ and I am learning C++. One of the less appealing things for me in C++ was streaming support and the way its input/output implemented. In particular I got used to printf() functions family and leaving those in favor of streams and cout was tough. What really strikes me is the fact that no C++ book explains this stuff. All C++ books just tell you – so, my dear, this is how this stuff is done in C++. It took me some time to realize how C++ style input/output is much more convenient and powerful than printf() family of functions. Here’s why.

The C++ way

So, In C++ you normally work with stream of some sort. Stream is an abstraction that may hide anything in the world. It can be string stream that writes things to string or debug output stream that sends debug printouts to some remote debug log server. Once you have a stream, you redefine operator << for various classes to write into stream instead of doing bit-shift. For input streams you define operator >> that reads data from string or from socket and initializes object of some class with values it read. All those operator << and >> seems kind of weird. And in fact, this can be any of the operators. For some reason, standard library uses bit-shift operators and this turned into a standard. Boost, popular C++ extensions library uses operator% to implement similar functionality, although in slightly different way.

Convenience

At first, defining operator << for every class I write seemed like a tedious task. But you just do it once and then you can use it forever. Moreover, you can use classes that already have it to implement more stuff. So you just do it once really.

It makes much more sense in object oriented language

Since we’re talking object oriented here, C++ way to output stuff actually makes much more sense. The way object represented as a string is a property is the object itself.

Performance

Here is the interesting part. In printf(), the standard C library has to parse format string every time you print something. So, at runtime, every time you print something, standard C library parses your format string to figure out what and how many parameters you passed to printf(). This is important for high performance systems. Intensively printing, just anything, whether this is a debug log or part of some communication protocol rises CPU consumption of your program to the roof.

I worked for one company where they came up with a really neat trick to overcome the problem of high CPU consumption. They had to implement debug logging in C for their high performance product. So, instead of formatting the string every time there is a printout, they saved format string and then every printout they just saved the arguments in their binary form. To read the debug log, they had to read format strings that were periodically saved in the debug log file. Then they read binary arguments for the debug printout and finally formatted the message. So instead of formatting the printouts at runtime, they formatted the printouts when debug log is read.

Another problem occurs when you by mistake pass wrong data type to printf() function and compiler misses it. Another problem is when you pass a string that is not properly null terminated. Etc. Etc. C++ way solves all these problems. First, data type resolving happens during compilation. It is up to compiler to detect what type of data you’re trying to print and to call right operator<<. This improves CPU consumption situation. For the same reason, it is much harder to print something that is of unexpected type. Compiler is usually pretty good at resolving data types.

Wrapping it up

So, it seems that C++ way of producing output both better conforms object oriented way of programming and yields higher quality results. The only problem is that it requires you to get used to, but then, you do it just once.

gcc macro language extensions

Alexander Sandler — Wed, 08 Feb 2012 22:28:41 +0000

One of the great things about gcc and in particular its C/C++ preprocessor is various extensions that it has. In this post I would like to briefly describe three of them. One allows to turn C/C++ token into a string. Here token is anything that you can pass as an argument to a macro. Second allows you concatenate two tokens to create new expression. The last one allows C/C++ macros with variable number of arguments.

Stringifying a token

Its amazing how useful this is. Take following code for example.

std::cout << "obj.member1: " << obj.member1 << std::endl;
std::cout << ", obj.member2: " << obj.member2 << std::endl;
std::cout << ", obj.member3: " << obj.member3 << std::endl;
std::cout << ", obj.member4: " << obj.member4 << std::endl;
std::cout << ", obj.member5: " << obj.member5 << std::endl;
std::cout << ", obj.member6: " << obj.member6 << std::endl;
std::cout << ", obj.member7: " << obj.member7 << std::endl;
std::cout << ", obj.member8: " << obj.member8 << std::endl;
std::cout << ", obj.member9: " << obj.member9 << std::endl;
std::cout << ", obj.member10: " << obj.member10 << std::endl;
std::cout << ", obj.member11: " << obj.member11 << std::endl;
std::cout << ", obj.member12: " << obj.member12 << std::endl;
std::cout << ", obj.member13: " << obj.member13 << std::endl;
std::cout << ", obj.member14: " << obj.member14 << std::endl;

Wouldn’t you give a kidney just not to write name of every single member of obj twice? Well, it appears that this can be done. Watch this:

#define PMEM(mem) #mem ": " << mem
#define PCMEM(mem) ", " #mem ": " << mem

Now you can do the following:

std::cout << PMEM(obj.member1) << std::endl;
std::cout << PCMEM(obj.member2) << std::endl;
std::cout << PCMEM(obj.member3) << std::endl;
std::cout << PCMEM(obj.member4) << std::endl;
std::cout << PCMEM(obj.member5) << std::endl;
std::cout << PCMEM(obj.member6) << std::endl;
std::cout << PCMEM(obj.member7) << std::endl;
std::cout << PCMEM(obj.member8) << std::endl;
std::cout << PCMEM(obj.member9) << std::endl;
std::cout << PCMEM(obj.member10) << std::endl;
std::cout << PCMEM(obj.member11) << std::endl;
std::cout << PCMEM(obj.member12) << std::endl;
std::cout << PCMEM(obj.member13) << std::endl;
std::cout << PCMEM(obj.member14) << std::endl;

These two macros will do most of the job for you. Unfortunately, they cannot write the code for you, so you will have to write names of members of obj at least once. # operator does one simple thing. Whatever you use it on turns into a string. Just in case you’re wondering, I am using here another gcc’s feature – string concatenation. gcc allows you to take two immediate strings and concatenate them. First I turned expression obj.member1 into a string using # operator # and then I concatenated it with ": ". Note that stringification of tokens only works inside of macro. Writing something like this:

std::cout << #some_token << std::endl;

will produce compilation error and for a good reason. Another interesting thing is the fact that you can turn anything into a string, even if it is not a valid C/C++ expression. Take a look at the code below:

#define DPRINT(a) #a
std::cout << DPRINT(a + b) << std::endl;
std::cout << DPRINT(hello world) << std::endl;

This code will print two strings, first is a + b and second is hello world. This is despite the fact that hello world is not a valid C/C++.

Token concatenation

Using this feature you can construct new C/C++ tokens using existing tokens. For instance, if you have a large structure and you want to write a function for every member of the structure. One way to do that is by writing the code manually. But I guess you don’t need me for that.

struct some_struct {
    int member1;
    bool member2;
    unsigned long member3;
}

#define ADD_GETTER(TYPE, MEMBER) \
    TYPE get_ ## MEMBER(struct some_struct& st) { \
        return st.MEMBER; \
}

ADD_GETTER(int, member1);
ADD_GETTER(bool, member2);
ADD_GETTER(unsigned long, member3);

Lets analyze this piece of code for second. First I defined a structure called some_struct. Next, I wanted a macro that defines getter function for every member of some_struct. I added ADD_GETTER macro for that. Then I called it three times in a row providing type of the field in some_struct and name of the member.

Calling a macro for member1 expanded to following piece of code:

int get_member1(struct some_struct& st) {
    return st.member1;
}

Notice how it created name of the function. This is concatenation operation in action. ## makes gcc and g++ preprocessor concatenate two tokens, get_ and member1 into single token. ## operator removes all space characters between two tokens. Another thing that it does is eliminating white space and punctuation characters between two tokens. This is especially useful when implementing macros with variable number of arguments.

Macros with variable number of arguments

You can define a macro with variable number of arguments following way:

#define VMACRO(argument1, argument2, ...) do_something()

The three dots as last argument of the macro tells compiler that this is a variadic macro. I.e. this is a macro that receives variable number of arguments. To get access to arguments, you have to use special keyword __VA_ARGS__. Like this:

#define VMACRO(argument1, argument2, ...) do_something(__VA_ARGS__)

In this example I am ignoring argument1 and argument2 and passing remaining arguments to do_something() routine. When I first learned about this feature, I immediately tried to use it for debug printouts macros. This is the code that I’ve written.

#include 

#define DPRINT(format, ...) printf("DEBUG: " format, __VA_ARGS__) 

int main()
{
    DPRINT("hello world");
}

Note that strings have to be immediate values. For instance calling DPRINT(format, "..."); where format is pointer to string will not work because gcc cannot concatenate format with “DEBUG” string. Anyway, I wanted to address something different. You will be surprised to learn that this code doesn’t compile. This is because after preprocessing this code turns into something that is not valid C/C++. This is how main will look like after preprocessing:

int main()
{
    printf("DEBUG: " "hello world", );
}

Note the comma character after “hello world”. The thing is that empty token is valid token in gcc, so passing nothing as argument translates into nothing. There is a workaround for this problem. That is using concatenation operation. Lets change our implementation of DPRINT a little.

#include 

#define DPRINT(format, ...) printf("DEBUG: " format, ##__VA_ARGS__) 

int main()
{
    DPRINT("hello world");
}

Note concatenation operator before __VA_ARGS__. I already mentioned that concatenation operator gets rid of white space and punctuation characters between two tokens. This is exactly what it does in this case – it removes comma between format and empty token leaving clean printf("DEBUG: " "hello world"); This is exactly what we needed.

UML cheatsheet

Alexander Sandler — Mon, 19 Sep 2011 16:48:44 +0000

Every once in awhile, I have to draw a UML diagram. I rarely do serious designs with UML, however sometimes I do need to depict some piece of code in a diagram and UML seems to be the best notation around.

Unfortunately, various sources of information on UML tend to over-complicate things. I am not software architect and drawing UMLs is not my job. So my UML skills are poor by definition. Moreover, I am happy with this situation and don’t see it changing in the future (even if I get promoted ).

So from time to time I need a simple UML reference card. Simple search finds references like this one, which are excellent if you are serious about UML, and I am not.

Eventually, I decided to write a short UML class diagram reference card for myself. I hope you will enjoy it as well.

Inheritance

So, this is how classes inherit one from another. Here Child class inherits from Parent Class.

Use

This is how User class uses Resource class.

Contains and manages

Here, Whole class contains and manages Part class. This type of relation can be extended to one of the following:

One to One
One to Many
Many to One
Many to Many

References

Here, Whole class references to Part class, but does not manage it. Again, can be extended with:

One to One
One to Many
Many to One
Many to Many

This is enough information for now. I’ll probably extend it over time. In any case, please post your corrections and suggestions.

Making writes durable – is your data on disk?

Alexander Sandler — Mon, 19 Sep 2011 16:26:03 +0000

Here is an interesting article written by Evan Jones. The article explains how you can be guaranteed when your data is on disk.

In case you’re wondering, when write(), fwrite() or any other library call that writes data to disk reports success you are not guaranteed that the data is actually on the disk. In fact, in Linux, write() reports success when data is in dirty cache. Then, special kernel thread kicks in and makes sure that the data is on disk.

Depending on circumstances, it may take some time until writer kernel thread will finish writing. Anyway, in his post Evan talks about how to make sure that the data is actually stable on disk.

Models for multithreaded applications

Alexander Sandler — Thu, 21 Apr 2011 22:00:18 +0000

As you know, I changed a couple of workplaces during my career. Long story short, one interesting thing that I noticed in different companies is various models for multi-threaded programs (mostly for large embedded systems).

Lets say we have a multi-threaded program. The program should handle heterogeneous tasks. On a very basic level, our program probably receives some data, processes it and produces an output. Question of how to manage threads usually arises in performance critical applications. Otherwise, it is quiet simple – we squeeze into single thread as many tasks as possible.

Actual data processing may take several steps. We may want to split it to several tasks, each handled by thread of its own.

Thread per task model

When using this model, we allocate single thread per task. Since we want to pass work from one thread to another, we will need some synchronization mechanism.

I suppose we’ll need some queues to pass things between threads. Also, we may need muteces or semaphores to synchronize access to queues and other shared objects.

One obvious weakness of this approach is performance. When moving task objects from one thread to another we often move them from CPU to CPU. Modern CPUs heavily deploy caching mechanisms. So keeping objects local to one CPU brings significant performance boost. By moving task from one CPU to another we heart caching and cause performance degradation.

Another performance related problem is that while your task object moves from one thread to another it should undergo at least one context switch. Depending on overall number of tasks, time it takes to process single task can be greater. Another cause of context switches is mutual exclusion mechanism that have to protect the data structure used to communicate objects between threads.

One last significant performance issue that I would like to mention is in the core of this approach. Each thread is bound by processing power of a single core. But what if one of the tasks require more processing power. The computer may have enough processing power. This is especially true with recent multi-core trend. But when a single thread runs on one core at a time inevitably each thread processing the request, is a potential bottle-neck.

Advantage of this approach is in its simplicity. You don’t have to develop significant infrastructure to make it work. All you need is pthreads and a couple of classes that do the job. All tools that you will use are available out of the box. Because it is so common, it will be easier to find skilled and experienced engineers to implement and use with this model.

Worker threads

In this model, each task split into multiple jobs. Each job being handled by single thread. There is number of worker threads each taking care of at most single job at any particular moment. To improve performance, system tries to handle jobs that belong to same task on the same core. This means that there might be number of threads per single core.

Unlike thread per task model, this model is much more difficult to implement. It implies some constraints on the tasks:

Engineer designing such system should be able to split tasks into multiple jobs. Alternatively, tasks can be very short.
Jobs should be more or less equal in size in terms of required processing resources.

In addition, this approach requires smart task management. Such system should implement scheduler that spreads tasks between cores. A serios problem arises when number of running tasks becomes larger than number of cores in the system. When it happens, scheduler should start two or more tasks on single core. To do it wisely, it should be aware of how much processing power required by tasks. It should manage number of tasks running on particular core and assign new task to least busy core.

Needless to say that this approach is way more complex than naive thread per task model. It requires more skilled engineers. Moreover, these engineers should spend their time developing something that is not directly related to the matters of the project – that is various infrastructures for the project.

On the other hand, this approach allows engineers to squeeze every bit of performance out of the machine. This approach will take the hardware to its limit.

Other approaches

There are many other approaches of course. Most of them however, are an attempt to merge between two models I described. For instance, one company started with thread per task model for their embedded system and quickly ran into performance problems. So they tried to solve problems by splitting busiest tasks into multiple threads, basically changing threading model for one particular thread into worker threads model.

Recently I ran into another model that is somewhat unique. This model implements so called user-mode threads. Entire program runs in a single thread. Like worker threads model, in this model tasks split into smaller jobs. However, instead of letting the operating system to schedule between job execution, this model implements a user-mode scheduler.

The actual scheduling is made using setjmp() and longjmp() calls. One obvious constrain that this model presents is that if thread goes to sleep in operating system terms it will send entire system to sleep. As a result, the system should have very tight control over what threads does and in particular how it sleeps. I.e. threads are not allowed to do regular I/O. Instead they should do I/O in a centralized way.

For example, thread that wishes to read from a file-descriptor should register this file-descriptor with scheduler. Scheduler will select() or poll() on all registered file-descriptors. Depending on what file-descriptor requires attention, scheduler will schedule thread that takes care of that file-descriptor.

In this model you can run only a single instance of scheduler on single processor. So this model is not entierly optimized for MP machines. If you still want to run multiple schedulers on single computer then you can either run them in separate processes or multiple threads of the same process. In both cases communication between threads that belong to different schedulers is cumbersome. For instance if run multiple schedulers in multiple threads, you cannot use mutexes to synchronize between threads that belong to various schedulers because locking such mutex will send entire system to sleep. Instead you should use some sort of I/O based IPC.

Despite number of disadvantages that this approach has, there are couple of good things into it. Synchronizing between threads that belong to same instance of scheduler is much easier because there is, at most, one thread that runs at any particular moment. So, if, for instance, you have some simple data structure that being used by multiple threads there is no need to protect it from simultaneous access.

Python for bash replacement

Alexander Sandler — Sun, 26 Dec 2010 16:25:37 +0000

When I started learning Python, I was looking for a programming language that would replace BASH, AWK and SED. I am a C/C++ programmer and as such I better invest my time into studying C and C++. Instead, every time I needed some complex script I opened up a book on BASH and refreshed my knowledge. And since bumping into boundaries of what BASH can do is relatively easy, I always opened awk/sed book few minutes later.

Actually, this is quiet common. Once in a while I see my colleagues, just like myself, open up a book on BASH. The problem is that because we don’t actively program BASH, the knowledge and experience that we gain from this experience wear out over time. So next time we approach, so we have to repeatedly study BASH stuff over and over again. And again, this is not only BASH I am talking about, but also AWK and SED.

It is utterly broken state of affairs and I wish there was a solution. Unfortunately there is no solution yet. The good thing is that with some effort the solution may arise. I am talking about Python programming language.

Needless to say that Python can do everything that BASH can do. However, it was never designed as shell environment and as such it is hardly convenient.

For instance, running external commands is easy. You should use subprocess module to run the command and there you go, you have both return value and the output. But this is miles away from convenience of simply typing the command and hitting enter.

Another issue is input/output redirections and piping. Regular shell does this with ease. Python obviously does this as well, but not without a hassle of importing the right module and creating the right object, etc.

On the other hand things are not that bad. For instance implementing command line should be somewhat easy in Python because it has native support for completion (which it uses in its own command line interface).

There is a project that makes few steps in the right direction. I am talking about IPython of course. Unfortunately, I doubt that IPython will ever be able to replace BASH. It was never their goal. Even though IPython developers implemented some features that make IPython command line a little down to earth, I am in doubt they will continue moving in this direction. Also, I must say that IPython evolves quiet slowly.

What do you think? Will Python ever be able to replace BASH?