Hex dump functions

This post starts a new category of posts called Code library. In this category I will post useful code snippets. I hope you’ll find them as useful as I do.

I use two functions below quiet often. Both functions do hexadecimal dump of given buffer. First function is in C. Second function is in Python. Enjoy.

Read the rest of this entry »

New article – C/C++ reference counting with atomic variables and gcc

This article explains how to implement performance critical reference counting in C/C++ program, using atomic variables and gcc. Enjoy it.

Read the article here.

C/C++ reference counting with atomic variables and gcc

Table of contents

IntroductionBACK TO TOC

Lets say we have a data structure that manages objects and we would like to manipulate the data structure and the objects from two or more threads. To achieve maximum performance we have to distinguish between mechanism that we use to protect the data structure itself, and the mechanism that we use to protect actual objects.

What reference counting needed for?BACK TO TOC

To explain why two mechanisms provide best performance and atomic variables aid us, consider following scenario.

First thread (the manipulator) has to find a particular object in the data structure and change it. Second thread (the eraser) deletes expired objects. From time to time, first thread tries to manipulate an object that second thread tries to delete. This is a classical scenario that can be found in almost any multithreaded program.

Obviously two threads have to work with some sort of mutual exclusion mechanism. Lets say we protect entire data structure with a single mutex. In this case, manipulator thread has to hold the mutex locked for as long as it takes it to find an object and to manipulate it. This means that eraser thread won’t be able to access the data structure for as long as manipulator thread works. If all manipulator thread does is searching for entries and manipulates them, then eraser thread just won’t be able to access the data structure.

Depending on circumstances, this can be quiet devastating for the system. And, as I stated earlier, the only way to solve this problem is to split between mechanisms that protect the data structure and actual objects.

If we do so, manipulator thread will hold the mutex only when it searches for the object. Once it has the object it can release the mutex that protects the data structure, allowing eraser thread to do its job.

But then how can we make sure that the eraser thread does not delete the very same object that first thread modifies?

We could use another mutex to protect contents of actual object. There is one important question that we have to ask ourselves here. How many mutexes we need to protect object’s contents?

If we use one mutex for all objects, then we run into the same bottle-neck that we’ve just avoided. If, on the other hand, we use mutex per object, then we run into slightly different trouble.

Assuming we’ve created a mutex per object. How do we manage mutexes for objects? We can put a mutex into the object itself, but this will create another problem. If eraser thread decides to delete the object, but a moment before it does, manipulator thread decides to check the object and tries to lock its mutex. Since the mutex has already been taken by eraser thread, manipulator will fall asleep. Yet right after manipulator falls asleep, eraser will delete the object and its mutex, leaving manipulator sleeping on non-existing object. Ouch.

There are couple of things that we can do actually. One of them is implementing reference counting on objects. Yet we have to protect the reference counter itself. Luckily thanks to atomic variables we don’t need a mutex to protect the counter.

This is how we will use atomic variables to count references to objectsBACK TO TOC

First, we initialize our object’s reference counter to 1. When manipulator thread begins using it, it should increase the reference counter by one. When it has finished with the object, it should decrease the reference counter. I.e. as long as manipulator uses the object it should keep the reference counter incremented.

Once eraser thread decides to delete certain object, it should first delete the object from the data structure. It should hold the mutex that protects the data structure to do that. Next it has to decrease object’s reference counter by one.

Now remember that we’ve initialized the reference counter to 1. So if the object is not in use, its reference counter should become zero. If its zero, we can safely delete the object. This is because since the object is no longer in the data structure, we can be sure that first thread won’t use it anymore.

If its reference counter is bigger than one. In this case we should wait until threads reference counter becomes zero. But, how?

Answering this question can be quiet tricky. I usually consider one of two approaches. The naive or an approach I call the RCU approach.

The naive approachBACK TO TOC

The naive approach goes like this. We create a list of objects that we want to delete. Every time our eraser thread wakes up, it runs through the list and deletes all object whose reference counter is zero.

Depending on circumstances, there might be a problem with this solution. The list of objects that we want to delete may grow and running through entire list can hog the CPU. Depending on how often we delete objects this can be good enough solution, but if its not, we have to consider RCU approach.

The RCU approachBACK TO TOC

RCU is another synchronization mechanism. Its being used primary in Linux kernel. You can read about it here. I call this approach the RCU approach, because it has some similarity to how RCU, the synchronization mechanism, works.

The idea is to assume that it takes manipulator thread some time to work with the object. I.e. manipulator keeps using the object for certain period of time at most. After that period of time, the object should no longer be in use and eraser thread should be able to delete it.

Lets say that it takes manipulator thread to do its manipulations 1 minute at most. I am exaggerating on purpose here. When eraser thread tries to delete the object, all it has to do is to grab a mutex or a spinlock that protects the data structure that contains the objects and remove the object from the data structure.

Next it has to get current time and save it in the object. Object should have methods that allows it to remember its deletion time. Then eraser thread has to append it to the end of deleted objects list. This is a special list that contains all objects that we want to delete – just as in our naive approach.

list

Note that appending objects to the end of the list, makes it sorted by the time when we’ve deleted objects. Earlier objects in the list are those with earlier deletion time.

Next, eraser thread has to return to the beginning of the list, grab an object and check if one minute has passed since it was removed from the data structure. If so, it should check its reference counter and make sure that its indeed 0. If not, it should update its deletion time to current time and append it to the end of the list. Note that normally, this shouldn’t happen. If its reference counter is 0, then the thread should remove the object from the list and delete it.

To demonstrate you it works, take a look at the figure above. Lets say the time now is 15:35:12. This is more than one minute after we’ve deleted the first object. Therefore, once eraser wakes up and checks the list, it will instantly see that the first object in the list has been deleted more than a minute ago. This is the time for it to delete the object from the list and check the next object.

It will check second object, see that its in the list less than one minute and will keep it in the list. Now, here is the interesting property of this approach. Eraser thread doesn’t have to check the rest of the list. Because we always append objects to the end of the list, the list is sorted and eraser thread can be sure that the rest of the objects shouldn’t be deleted neither.

So instead of passing through entire list, as we did in naive approach, we can use timestamps to build a sorted list of objects and check only few objects at the beginning of the list. Depending on the circumstances, this will save huge amount processing time.

Where atomic variables coming from?BACK TO TOC

Starting version 4.1.2, gcc has built-in support for atomic variables. They work on majority of architectures, but before using them, please check if your architecture supports them.

There’s a set of functions that manipulate atomic variables.

type __sync_fetch_and_add (type *ptr, type value);
type __sync_fetch_and_sub (type *ptr, type value);
type __sync_fetch_and_or (type *ptr, type value);
type __sync_fetch_and_and (type *ptr, type value);
type __sync_fetch_and_xor (type *ptr, type value);
type __sync_fetch_and_nand (type *ptr, type value);
type __sync_add_and_fetch (type *ptr, type value);
type __sync_sub_and_fetch (type *ptr, type value);
type __sync_or_and_fetch (type *ptr, type value);
type __sync_and_and_fetch (type *ptr, type value);
type __sync_xor_and_fetch (type *ptr, type value);
type __sync_nand_and_fetch (type *ptr, type value);

There’s no need to include any header file to use these functions. If your architecture doesn’t support them, you will receive a linker error as if you were calling a non-existing function.

Each one of these functions manipulates a variable of certain type. type can be either char, short, int, long, long long or their unsigned counterparts.

It is a good idea to wrap these functions with something that hides implementation details. This can be class that represents atomic variable or a  abstract data type.

Finally, I have a nice example of how to use atomic variables in multithreaded simple data type access and atomic variables article.

ConclusionBACK TO TOC

I hope I’ve found this interesting. In case you have further questions, please send me emails to alex@alexonlinux.com

What kept me from sticking to Ubuntu as a desktop solution

There were a couple of things actually, but most of them were solvable one way or another. There was one problem that I could not solve.

I have a N95 cell. phone. There are couple of applications I use on it, but one of them I use the most. It is called Handy Safe Pro. It is an encrypted database of personal information.

I have all my life in this database. All my credit card numbers with their PINs, bank account access information, passwords for various passwords, passwords for this blog, etc.

I am not really afraid to loose it, because this particular program comes with a version for Windows. I actively work with two different computers and both the computers and a cell. phone, all have the updated database. Oh and because the database is encrypted with Blowfish 448 bit long key, I am not really afraid that someone may steal it.

I rarely change the database using the cellular phone itself. Instead, Windows version allows you to modify the database and sync it.

I use this application more often than Word, or SlickEdit or any other program on my life. Well, perhaps web browser is a program that I use more often, but it is the only one. Because it is so important for me, I went searching for an alternative for Linux. Unfortunately I didn’t find one.

There are personal information managers for Linux of course, but I didn’t find one the syncs with a cell. phone. Very unfortunate occasion. Then I thought perhaps I should write one myself :-)

pthread mutex vs pthread spinlock

Update 06/16/2009: On several occasions, commentators to this article pointed out that this post is somewhat incomplete. Therefore, before I continue, I would like to make some things clear.

Before considering the code below, please note that spinlocks are totally useless on uni-processor computers. This is due to a nature of spinlocks, which I will describe in future posts.

Also, it is not the intent of this article to show that spinlocks better than mutexes in all circumstances. Spinlocks perform better than mutexes in this particular setup, on multi-processor computer. It does not mean that spinlocks are good for you. Apply your own judgment when considering using materials presented in this article for your needs.

In this article I would like to demonstrate you how spinlocks are more efficient than mutexes. I’ll demonstrate a program that does the following.

Read the rest of this entry »

Is desktop Linux too fragmented to succeed?

This question is especially relevant after yesterday’s fiasco. I ran into an article whose name is exactly Is desktop Linux too fragmented to succeed?

The article argues that the fragmentation is what keeping Linux Desktop from beating Mac OS and even Windows. It is because the effort to create Linux desktop is scattered across multiple distribution, it cannot become a serious player in desktop solutions.

Toward its end, the article gives optimistic prospects. There are some efforts to bring Linux desktop community to common ground, unanswered though.

Read the rest of this entry »

Ubuntu 9.04 with Wubi – Failure

Once every year or so, I get so frustrated with Windows Desktop that I decide to install Linux. I am a big fun of Ubuntu Linux. I use it for many things, this includes a server platform for this web-site. So most natural choice for me was to try Ubuntu 9.04, the latest version. I installed it on my Laptop computer using a program called Wubi.

Wubi is a program that installs Ubuntu Linux on top of installed Windows. I already had a Windows installation on my laptop, that I didn’t want to destroy. This is why I’ve chosen Wubi. It places root file-system in a file on Windows partition. Because of this feature, you don’t even need a partition to install Ubuntu.

Read the rest of this entry »

Help me to find interesting blogs

Not that I am bored, but when I am looking for something to read on the internet, I usually end up reading Google news. Enough is enough.

I am looking for interesting blogs about Linux, Linux administration and programming. So if you know one (or two, or three, etc), please help me by posting a link in the comments.

Thanks!

A new kind of virtualization

There are plenty of virtualization technologies. There are organizations like VMware, VirtualBox and XEN, whose virtualization allows one to run several virtual computers using one physical computer.

I worked at a company called ScaleMP. ScaleMP’s technology, vSMP, turns multiple physical computers into one large computer.

Today I was looking for something different. I was looking for technology that turns multi-core physical computer into a virtual computer with a single core, whose power is a combined power of all physical CPUs. I.e. take two CPUs with four cores each, each doing X flops. Turn them into single CPU with single core that does 8X flops.

In case you wonder why on earth someone would need such program, here’s what I was thinking about. Imagine you have a single threaded program. Lets also assume that the program is difficult to change (or you only have its binary).

Single threaded implies scalability problem. This is because the only thing you can do to improve its performance is to make the CPU, it runs on, to run faster. But modern CPUs has reached their capability in terms of single core processing power. To make CPUs faster, processor vendors add more cores – split CPU in two, four, eight, etc. Yet they cannot make single core to be significantly faster. Hence we’re stuck.

Such super CPU virtualization technology could solve the problem, but it appears that such technology does not exist. There’s not a single firm that does anything that even closely resembles something like this.

Then, I thought that actually such company cannot exist. Its a combination of technological innovation and cost of development that simply cannot coexist.

Think about it. Technology behind virtualized CPU would have to split single stream of execution into several streams. But this is exactly what modern CPUs do already. Processor that utilizes branch prediction, try to predict most probable path of execution of a thread and execute several chunks of code simultaneously.

In case you’re wondering if a processor in your laptop computer does branch prediction, the answer is probably yes.

Because CPUs already use branch prediction and perhaps other technologies that speed up executable code processing, making our super CPU do the job even faster would be a very difficult task. Task as difficult that it would probably require some best minds in the world to join up. And best minds in the cost a lot of money.

On the other hand, market size for such product is relatively small. There are not many programs out there that cannot be optimized to run on SMP. So income from such product wouldn’t be so great. Eventually, such company will bankrupt.

Unless, of course, there are other possible uses for such technology.

Crisis, over?

After almost a month of attempts to bring things back on track, I think I can say that it is almost over. It seems that I won’t be able to bring things same because this new domain is newer than alexandersandler.net and it seems that search engines care about domain age.

On the other hand, most of the visitors are back. Welcome back and I hope you’ll find it as interesting as before :D