SMP affinity and proper interrupt handling in Linux

Introduction

Hardware interrupts has always been expensive. Somehow these small pieces of software consume so much CPU power and hardware and software engineers has always been trying to change this state of affairs. Some significant progress has been made. Still hardware interrupts consume lots of CPU power.

You will rarely see effects of interrupt handling on desktop systems. Take a look at your /proc/interrupts file. This file enlists all of your hardware devices and how many interrupts received by each and one of them on each CPU. If you are on a regular desktop system, you will see that number of interrupts that your computer handles is relatively small. Even powerful servers handling millions of packets per second handle only tens of thousands of interrupts per second. Yet these interrupts consume CPU power and handling them properly undoubtedly helps to improve system’s performance.

But really, what can we do about interrupts?

There are many things that can be done. Many Linux distributions ship with kernel that include modifications that significantly improve the situation. Technologies, such as NAPI, reduce number of interrupts and interrupt handling overhead so dramatically, that modern server probably wont be able to sustain a 1Gbps Ethernet link. NAPI is part of kernel for quiet some time. Other things include interrupt coalescence.

In this article I would like to address one of the most powerful techniques to optimize interrupt handling.

SMP affinity

The SMP affinity or processor affinity term has quiet broad meaning and requires an explanation. The word affinity addresses proximity of a certain task to certain processor within multi-processor system. I.e. when processor X runs process Y, they are affine to each other. The processor has parts of process’s memory in cache, thus constantly moving the process to different processor when scheduling it, would probably mean less effective scheduling.

As far as interrupts concerned, SMP affinity refers to a question what processor handles certain interrupt. On the contrary to the processes, binding interrupts to certain CPU will most likely cause performance degradation and here’s why. Interrupt handlers are usually very small in size. Interrupt’s memory footprint is relatively small, thus keeping interrupt on certain CPU will not improve cache hits. Instead, multiple interrupts will keep one of the cores overloaded while others remain relatively free. Scheduler has no idea about this state of affairs. It assumes that our interrupt handling core is as busy as any other core. As a result, you may face bottle necks as one of the processes or threads will occasionally work on core that has only 90% of its power available.

Things may be even worse because often core 0 by default handles all interrupts. On busy systems all interrupts may consume as much as 30% of core’s 0 power. Because we assume that all cores are equally powerful, we may find ourselves in a situation where our software system will effectively use only 70% of total CPU power.

Who’s responsible

APIC or Advanced Programmable Interrupt Controller has been integral part of all modern x86 based systems for many years – both SP (single-processor) and MP. This component is responsible for delivering interrupts. It also decides what interrupt goes where, in terms of cores.

By default APIC delivers ALL interrupts to core 0.This is the reason why /proc/interrupts will look like this on vast majority of modern Linux systems:

         CPU0     CPU1     CPU2     CPU3
  0:   123357        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    12252        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:      468        0        0        0  IO-APIC-level  eth0
225:      285        0        0        0  IO-APIC-level  eth1
NMI:      120       66       76       45
LOC:   123239   123220   123187   123065
ERR:        0
MIS:        0

See anything suspicious? Well, CPU0 handling all hardware interrupts. All of them. This is the situation that you see on a system with misconfigured interrupt SMP affinity.

Simple solution for the problem

Solution for this problem has been around pretty much since the introduction of the APIC. It has several interrupt delivery and destination modes. Physical and logical. Fixed and low priority. Etc. The important fact is that it is capable of delivering interrupts to any of the cores and even do load balancing between them.

Its configuration is limited to first eight cores. I.e. if you have more than eight cores, don’t expect any core higher than 7 to receive interrupts.

By default it operates in physical/fixed. This means that it will deliver certain interrupt to certain core. You already know that by default it is core 0. The thing is that you can easily change core that receives certain interrupt.

For each and every IRQ number in the first column in /proc/interrupts file, there’s a sub-directory in /proc/irq/. That directory contains a file named smp_affinity. Using this file you can change what core handles that interrupt. Reading from this file produces a hexadecimal number which is a bitmask with a single bit for each core. When certain bit is set, APIC will deliver the interrupt to corresponding core.

Let’s see an example…

#
# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0: 19599546        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    95337        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:   100778        0        0        0  IO-APIC-level  eth0
225:    56651        0        0        0  IO-APIC-level  eth1
NMI:      466      393      422      372
LOC: 19600453 19600434 19600401 19600279
ERR:        0
MIS:        0
#
#
# echo "2" > /proc/irq/217/smp_affinity
# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0: 19606722        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    95349        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:   101027       49        0        0  IO-APIC-level  eth0
225:    56655        0        0        0  IO-APIC-level  eth1
NMI:      466      393      422      372
LOC: 19607629 19607610 19607577 19607455
ERR:        0
MIS:        0
#

As we can see, once we enter the magical command, CPU1 begins receiving interrupts from eth0, instead of CPU0. The echo command that changed the state of affairs is especially interesting. It is “2” that we’re echoing into the file. Writing “4” to the file, would cause eth0 interrupt be handled by CPU2, instead of CPU1. As I already mentioned, it is a bitmask where one bit correspond to single CPU.

How about writing “3” into the file. In theory, this should cause APIC to divert interrupts to CPU0 and CPU1. Unfortunately, things are a little more complicated here. It all depends on whether APIC works in physical “destination mode” and low priority “delivery mode”. If it is so, than you most likely would not be seeing CPU0 handling all interrupts. This is because when kernel configures APIC to work in physical/low priority modes, it automatically tells APIC to load balance interrupts between first eight cores.

So if on your system CPU0 handles all interrupts by default, this probably means that APIC configured ambiguously.

Ultimate solution

First of all, unfortunately there is no choice but to replace the kernel. Software that configures APIC is part of the kernel and if we want to change things we have no choice but to fix things in kernel. Things related to APIC are not configurable, so we have absolutely no choice. The only question is, replace kernel with what?

I tested this with OpenSuSE 10.2 that comes with kernel 2.6.18. Installing kernel 2.6.24.3 (the latest at the moment) with OpenSuSE’s default kernel configuration (/proc/config.gz) fixes the problem. With this kernel, things look like this, right from the start:

# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0:   728895   728796   728624   728895  IO-APIC-edge     timer
  8:        0        0        0        0  IO-APIC-edge     rtc
 11:        0        0        0        0  IO-APIC-fasteoi  acpi
 16:        0        0        0        0  IO-APIC-fasteoi  uhci_hcd:usb1
 19:        0        0        0        0  IO-APIC-fasteoi  uhci_hcd:usb2
 24:    14090    14090    14327    14056  IO-APIC-fasteoi  ioc0
 49:        7        9        7        8  IO-APIC-fasteoi  qla2xxx
 50:        8       12       11       10  IO-APIC-fasteoi  qla2xxx
 77:     2849     2759     2841     2827  IO-APIC-fasteoi  eth0
 78:    25072    25138    24996    24980  IO-APIC-fasteoi  eth1
NMI:        0        0        0        0
LOC:  2915270  2915256  2915228  2915092
ERR:        0

Looks good isn’t it? All cores handle interrupts, thus working with maximum efficiency. Now how about getting this result with just any kernel version? It appears to be doable.

There’s a kernel configuration option that stands in our way and once removed you will get similar situation with probably any kernel  newer than 2.6.10. The option is CONFIG_HOTPLUG_CPU. It adds support for hotplugable CPUs. It appears that having this option off, makes kernel configure APIC properly.

Actually  it is quiet understandable. You see, APIC has to be told what processors should receive interrupts. You need additional piece of code that tells APIC how to handle processor removals – processor removal is one of the things that CONFIG_HOTPLUG_CPU allows you to do. I assume that this functionality was missing from earlier kernel and got inside in 2.6.24.3.

Conclusion

We saw that we can achieve really nice results by doing some modifications to kernel configuration. On a very busy system, doing this small configuration change can boost server’s productivity by large margin.

I hope you will find this information useful and use techniques I described in this article.

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

71 Comments

  1. Anonymous says:

    @Gustav

    sudo will give you trouble with echo since the redirect to the smp_affinity file is handled by your user’s shell interpreter, not by a sudo’ed root shell interpreter. To use sudo, try something like this:

    sudo bash -c ‘echo 2 > /proc/irq/xxx/smp_affinity’

  2. […] on by default, and with the hotplug_cpu compiler option on, all interrupts are bound to one core (see here). After emerging irqbalance & recompiling the kernel with hotplug off, I saw another 15% on the […]

  3. […] feature because it messes up APIC and interrupts can't be delivered to all CPUs, as suggested in http://www.alexonlinux.com/smp-affin…dling-in-linux. However, when I run make menuconfig, I find I can't disable this feature directly. The line in […]

  4. […] search on the internet and found some good information on http://tldp.org/LDP/tlk/tlk-toc.html and http://www.alexonlinux.com/smp-affin…dling-in-linux . Most information and articles I found say that interrupts ( also a particular interrupt ) should […]

  5. […] the process for managing multi cpu threading. See http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux for the answers you seek on how to lower it down, but basically its the way the system handles […]

  6. […] 先来看前两种情况。其实就是 RR(round-robin) 的方式,或许能解决一些 bottleneck,但是更严重的问题来了,尤其对于网卡这种中断很频繁的设备来说,会造成 cpu cache missed,造成影响是直接访问内存比访问 cache 里面的内容慢 30x;并且对于流媒体来说,还可能造成 congestion collapse。另外,由于 APIC 的问题,可能无法实现上面的要求,比如,我想将 irq 60 绑定到 core 0 以及  core 1 上,或者想把 irq 67 分布到所有的 core 中,将 3 写入到 60 的 smp_affinity 不就可以了吗?实事是做不到的,并且上面已经说明了这样做没有意义的原因。将 CONFIG_HITPLUG_CPU 由原先的 yes 设置成 no 可能会解决问题,不过我并没有试过,尤其明白了上面将 irq 分布到多个 core 的不合理之处之后更没有必要这样做了。 […]

  7. […] Привязка к процессору и обработка прерываний […]

  8. sravan says:

    Any one knows hoe to disable the CONFIG_HOTPLUG_CPU in 3.8 kernel

  9. […] smp affinity for interrupts it is possible to boost the routing power from 500Mbits to […]

  10. What’s up to all, for the reason that I am really keen of reading this weblog’s post to be updated regularly.
    It contains fastidious stuff.

  11. […] the process for managing multi cpu threading. See http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux for the answers you seek on how to lower it down, but basically its the way the system handles […]

  12. […] the process for managing multi cpu threading. See http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux for the answers you seek on how to lower it down, but basically its the way the system handles […]

  13. […] the process for managing multi cpu threading. See http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux for the answers you seek on how to lower it down, but basically its the way the system handles […]

  14. […] the process for managing multi-CPU threading. See SMP Affinity and Proper Interrupt Handling in Linux for the answers you seek on how to lower it down, but basically its the way the system handles […]

  15. […] serverfault article and this offsite article that it references are also […]

Leave a Reply

Prove you are not a computer or die *