SMP affinity and proper interrupt handling in Linux

Introduction

Hardware interrupts has always been expensive. Somehow these small pieces of software consume so much CPU power and hardware and software engineers has always been trying to change this state of affairs. Some significant progress has been made. Still hardware interrupts consume lots of CPU power.

You will rarely see effects of interrupt handling on desktop systems. Take a look at your /proc/interrupts file. This file enlists all of your hardware devices and how many interrupts received by each and one of them on each CPU. If you are on a regular desktop system, you will see that number of interrupts that your computer handles is relatively small. Even powerful servers handling millions of packets per second handle only tens of thousands of interrupts per second. Yet these interrupts consume CPU power and handling them properly undoubtedly helps to improve system’s performance.

But really, what can we do about interrupts?

There are many things that can be done. Many Linux distributions ship with kernel that include modifications that significantly improve the situation. Technologies, such as NAPI, reduce number of interrupts and interrupt handling overhead so dramatically, that modern server probably wont be able to sustain a 1Gbps Ethernet link. NAPI is part of kernel for quiet some time. Other things include interrupt coalescence.

In this article I would like to address one of the most powerful techniques to optimize interrupt handling.

SMP affinity

The SMP affinity or processor affinity term has quiet broad meaning and requires an explanation. The word affinity addresses proximity of a certain task to certain processor within multi-processor system. I.e. when processor X runs process Y, they are affine to each other. The processor has parts of process’s memory in cache, thus constantly moving the process to different processor when scheduling it, would probably mean less effective scheduling.

As far as interrupts concerned, SMP affinity refers to a question what processor handles certain interrupt. On the contrary to the processes, binding interrupts to certain CPU will most likely cause performance degradation and here’s why. Interrupt handlers are usually very small in size. Interrupt’s memory footprint is relatively small, thus keeping interrupt on certain CPU will not improve cache hits. Instead, multiple interrupts will keep one of the cores overloaded while others remain relatively free. Scheduler has no idea about this state of affairs. It assumes that our interrupt handling core is as busy as any other core. As a result, you may face bottle necks as one of the processes or threads will occasionally work on core that has only 90% of its power available.

Things may be even worse because often core 0 by default handles all interrupts. On busy systems all interrupts may consume as much as 30% of core’s 0 power. Because we assume that all cores are equally powerful, we may find ourselves in a situation where our software system will effectively use only 70% of total CPU power.

Who’s responsible

APIC or Advanced Programmable Interrupt Controller has been integral part of all modern x86 based systems for many years – both SP (single-processor) and MP. This component is responsible for delivering interrupts. It also decides what interrupt goes where, in terms of cores.

By default APIC delivers ALL interrupts to core 0.This is the reason why /proc/interrupts will look like this on vast majority of modern Linux systems:

         CPU0     CPU1     CPU2     CPU3
  0:   123357        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    12252        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:      468        0        0        0  IO-APIC-level  eth0
225:      285        0        0        0  IO-APIC-level  eth1
NMI:      120       66       76       45
LOC:   123239   123220   123187   123065
ERR:        0
MIS:        0

See anything suspicious? Well, CPU0 handling all hardware interrupts. All of them. This is the situation that you see on a system with misconfigured interrupt SMP affinity.

Simple solution for the problem

Solution for this problem has been around pretty much since the introduction of the APIC. It has several interrupt delivery and destination modes. Physical and logical. Fixed and low priority. Etc. The important fact is that it is capable of delivering interrupts to any of the cores and even do load balancing between them.

Its configuration is limited to first eight cores. I.e. if you have more than eight cores, don’t expect any core higher than 7 to receive interrupts.

By default it operates in physical/fixed. This means that it will deliver certain interrupt to certain core. You already know that by default it is core 0. The thing is that you can easily change core that receives certain interrupt.

For each and every IRQ number in the first column in /proc/interrupts file, there’s a sub-directory in /proc/irq/. That directory contains a file named smp_affinity. Using this file you can change what core handles that interrupt. Reading from this file produces a hexadecimal number which is a bitmask with a single bit for each core. When certain bit is set, APIC will deliver the interrupt to corresponding core.

Let’s see an example…

#
# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0: 19599546        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    95337        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:   100778        0        0        0  IO-APIC-level  eth0
225:    56651        0        0        0  IO-APIC-level  eth1
NMI:      466      393      422      372
LOC: 19600453 19600434 19600401 19600279
ERR:        0
MIS:        0
#
#
# echo "2" > /proc/irq/217/smp_affinity
# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0: 19606722        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    95349        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:   101027       49        0        0  IO-APIC-level  eth0
225:    56655        0        0        0  IO-APIC-level  eth1
NMI:      466      393      422      372
LOC: 19607629 19607610 19607577 19607455
ERR:        0
MIS:        0
#

As we can see, once we enter the magical command, CPU1 begins receiving interrupts from eth0, instead of CPU0. The echo command that changed the state of affairs is especially interesting. It is “2″ that we’re echoing into the file. Writing “4″ to the file, would cause eth0 interrupt be handled by CPU2, instead of CPU1. As I already mentioned, it is a bitmask where one bit correspond to single CPU.

How about writing “3″ into the file. In theory, this should cause APIC to divert interrupts to CPU0 and CPU1. Unfortunately, things are a little more complicated here. It all depends on whether APIC works in physical “destination mode” and low priority “delivery mode”. If it is so, than you most likely would not be seeing CPU0 handling all interrupts. This is because when kernel configures APIC to work in physical/low priority modes, it automatically tells APIC to load balance interrupts between first eight cores.

So if on your system CPU0 handles all interrupts by default, this probably means that APIC configured ambiguously.

Ultimate solution

First of all, unfortunately there is no choice but to replace the kernel. Software that configures APIC is part of the kernel and if we want to change things we have no choice but to fix things in kernel. Things related to APIC are not configurable, so we have absolutely no choice. The only question is, replace kernel with what?

I tested this with OpenSuSE 10.2 that comes with kernel 2.6.18. Installing kernel 2.6.24.3 (the latest at the moment) with OpenSuSE’s default kernel configuration (/proc/config.gz) fixes the problem. With this kernel, things look like this, right from the start:

# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0:   728895   728796   728624   728895  IO-APIC-edge     timer
  8:        0        0        0        0  IO-APIC-edge     rtc
 11:        0        0        0        0  IO-APIC-fasteoi  acpi
 16:        0        0        0        0  IO-APIC-fasteoi  uhci_hcd:usb1
 19:        0        0        0        0  IO-APIC-fasteoi  uhci_hcd:usb2
 24:    14090    14090    14327    14056  IO-APIC-fasteoi  ioc0
 49:        7        9        7        8  IO-APIC-fasteoi  qla2xxx
 50:        8       12       11       10  IO-APIC-fasteoi  qla2xxx
 77:     2849     2759     2841     2827  IO-APIC-fasteoi  eth0
 78:    25072    25138    24996    24980  IO-APIC-fasteoi  eth1
NMI:        0        0        0        0
LOC:  2915270  2915256  2915228  2915092
ERR:        0

Looks good isn’t it? All cores handle interrupts, thus working with maximum efficiency. Now how about getting this result with just any kernel version? It appears to be doable.

There’s a kernel configuration option that stands in our way and once removed you will get similar situation with probably any kernel  newer than 2.6.10. The option is CONFIG_HOTPLUG_CPU. It adds support for hotplugable CPUs. It appears that having this option off, makes kernel configure APIC properly.

Actually  it is quiet understandable. You see, APIC has to be told what processors should receive interrupts. You need additional piece of code that tells APIC how to handle processor removals – processor removal is one of the things that CONFIG_HOTPLUG_CPU allows you to do. I assume that this functionality was missing from earlier kernel and got inside in 2.6.24.3.

Conclusion

We saw that we can achieve really nice results by doing some modifications to kernel configuration. On a very busy system, doing this small configuration change can boost server’s productivity by large margin.

I hope you will find this information useful and use techniques I described in this article.

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

64 Comments

  1. sravan says:

    Any one knows hoe to disable the CONFIG_HOTPLUG_CPU in 3.8 kernel

  2. sravan says:

    Any one knows how to disable the CONFIG_HOTPLUG_CPU in 3.8 kernel

Leave a Reply