SMP affinity and proper interrupt handling in Linux
Hardware interrupts has always been expensive. Somehow these small pieces of software consume so much CPU power and hardware and software engineers has always been trying to change this state of affairs. Some significant progress has been made. Still hardware interrupts consume lots of CPU power.
You will rarely see effects of interrupt handling on desktop systems. Take a look at your /proc/interrupts file. This file enlists all of your hardware devices and how many interrupts received by each and one of them on each CPU. If you are on a regular desktop system, you will see that number of interrupts that your computer handles is relatively small. Even powerful servers handling millions of packets per second handle only tens of thousands of interrupts per second. Yet these interrupts consume CPU power and handling them properly undoubtedly helps to improve system’s performance.
But really, what can we do about interrupts?
There are many things that can be done. Many Linux distributions ship with kernel that include modifications that significantly improve the situation. Technologies, such as NAPI, reduce number of interrupts and interrupt handling overhead so dramatically, that modern server probably wont be able to sustain a 1Gbps Ethernet link. NAPI is part of kernel for quiet some time. Other things include interrupt coalescence.
In this article I would like to address one of the most powerful techniques to optimize interrupt handling.
The SMP affinity or processor affinity term has quiet broad meaning and requires an explanation. The word affinity addresses proximity of a certain task to certain processor within multi-processor system. I.e. when processor X runs process Y, they are affine to each other. The processor has parts of process’s memory in cache, thus constantly moving the process to different processor when scheduling it, would probably mean less effective scheduling.
As far as interrupts concerned, SMP affinity refers to a question what processor handles certain interrupt. On the contrary to the processes, binding interrupts to certain CPU will most likely cause performance degradation and here’s why. Interrupt handlers are usually very small in size. Interrupt’s memory footprint is relatively small, thus keeping interrupt on certain CPU will not improve cache hits. Instead, multiple interrupts will keep one of the cores overloaded while others remain relatively free. Scheduler has no idea about this state of affairs. It assumes that our interrupt handling core is as busy as any other core. As a result, you may face bottle necks as one of the processes or threads will occasionally work on core that has only 90% of its power available.
Things may be even worse because often core 0 by default handles all interrupts. On busy systems all interrupts may consume as much as 30% of core’s 0 power. Because we assume that all cores are equally powerful, we may find ourselves in a situation where our software system will effectively use only 70% of total CPU power.
APIC or Advanced Programmable Interrupt Controller has been integral part of all modern x86 based systems for many years – both SP (single-processor) and MP. This component is responsible for delivering interrupts. It also decides what interrupt goes where, in terms of cores.
By default APIC delivers ALL interrupts to core 0.This is the reason why /proc/interrupts will look like this on vast majority of modern Linux systems:
CPU0 CPU1 CPU2 CPU3 0: 123357 0 0 0 IO-APIC-edge timer 8: 0 0 0 0 IO-APIC-edge rtc 11: 0 0 0 0 IO-APIC-level acpi 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb1 177: 0 0 0 0 IO-APIC-level qla2xxx 185: 0 0 0 0 IO-APIC-level qla2xxx 193: 12252 0 0 0 IO-APIC-level ioc0 209: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 217: 468 0 0 0 IO-APIC-level eth0 225: 285 0 0 0 IO-APIC-level eth1 NMI: 120 66 76 45 LOC: 123239 123220 123187 123065 ERR: 0 MIS: 0
See anything suspicious? Well, CPU0 handling all hardware interrupts. All of them. This is the situation that you see on a system with misconfigured interrupt SMP affinity.
Simple solution for the problem
Solution for this problem has been around pretty much since the introduction of the APIC. It has several interrupt delivery and destination modes. Physical and logical. Fixed and low priority. Etc. The important fact is that it is capable of delivering interrupts to any of the cores and even do load balancing between them.
Its configuration is limited to first eight cores. I.e. if you have more than eight cores, don’t expect any core higher than 7 to receive interrupts.
By default it operates in physical/fixed. This means that it will deliver certain interrupt to certain core. You already know that by default it is core 0. The thing is that you can easily change core that receives certain interrupt.
For each and every IRQ number in the first column in /proc/interrupts file, there’s a sub-directory in /proc/irq/. That directory contains a file named smp_affinity. Using this file you can change what core handles that interrupt. Reading from this file produces a hexadecimal number which is a bitmask with a single bit for each core. When certain bit is set, APIC will deliver the interrupt to corresponding core.
Let’s see an example…
# # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 19599546 0 0 0 IO-APIC-edge timer 8: 0 0 0 0 IO-APIC-edge rtc 11: 0 0 0 0 IO-APIC-level acpi 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb1 177: 0 0 0 0 IO-APIC-level qla2xxx 185: 0 0 0 0 IO-APIC-level qla2xxx 193: 95337 0 0 0 IO-APIC-level ioc0 209: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 217: 100778 0 0 0 IO-APIC-level eth0 225: 56651 0 0 0 IO-APIC-level eth1 NMI: 466 393 422 372 LOC: 19600453 19600434 19600401 19600279 ERR: 0 MIS: 0 # # # echo "2" > /proc/irq/217/smp_affinity # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 19606722 0 0 0 IO-APIC-edge timer 8: 0 0 0 0 IO-APIC-edge rtc 11: 0 0 0 0 IO-APIC-level acpi 169: 0 0 0 0 IO-APIC-level uhci_hcd:usb1 177: 0 0 0 0 IO-APIC-level qla2xxx 185: 0 0 0 0 IO-APIC-level qla2xxx 193: 95349 0 0 0 IO-APIC-level ioc0 209: 0 0 0 0 IO-APIC-level uhci_hcd:usb2 217: 101027 49 0 0 IO-APIC-level eth0 225: 56655 0 0 0 IO-APIC-level eth1 NMI: 466 393 422 372 LOC: 19607629 19607610 19607577 19607455 ERR: 0 MIS: 0 #
As we can see, once we enter the magical command, CPU1 begins receiving interrupts from eth0, instead of CPU0. The echo command that changed the state of affairs is especially interesting. It is “2” that we’re echoing into the file. Writing “4” to the file, would cause eth0 interrupt be handled by CPU2, instead of CPU1. As I already mentioned, it is a bitmask where one bit correspond to single CPU.
How about writing “3” into the file. In theory, this should cause APIC to divert interrupts to CPU0 and CPU1. Unfortunately, things are a little more complicated here. It all depends on whether APIC works in physical “destination mode” and low priority “delivery mode”. If it is so, than you most likely would not be seeing CPU0 handling all interrupts. This is because when kernel configures APIC to work in physical/low priority modes, it automatically tells APIC to load balance interrupts between first eight cores.
So if on your system CPU0 handles all interrupts by default, this probably means that APIC configured ambiguously.
First of all, unfortunately there is no choice but to replace the kernel. Software that configures APIC is part of the kernel and if we want to change things we have no choice but to fix things in kernel. Things related to APIC are not configurable, so we have absolutely no choice. The only question is, replace kernel with what?
I tested this with OpenSuSE 10.2 that comes with kernel 2.6.18. Installing kernel 220.127.116.11 (the latest at the moment) with OpenSuSE’s default kernel configuration (/proc/config.gz) fixes the problem. With this kernel, things look like this, right from the start:
# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 728895 728796 728624 728895 IO-APIC-edge timer 8: 0 0 0 0 IO-APIC-edge rtc 11: 0 0 0 0 IO-APIC-fasteoi acpi 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2 24: 14090 14090 14327 14056 IO-APIC-fasteoi ioc0 49: 7 9 7 8 IO-APIC-fasteoi qla2xxx 50: 8 12 11 10 IO-APIC-fasteoi qla2xxx 77: 2849 2759 2841 2827 IO-APIC-fasteoi eth0 78: 25072 25138 24996 24980 IO-APIC-fasteoi eth1 NMI: 0 0 0 0 LOC: 2915270 2915256 2915228 2915092 ERR: 0
Looks good isn’t it? All cores handle interrupts, thus working with maximum efficiency. Now how about getting this result with just any kernel version? It appears to be doable.
There’s a kernel configuration option that stands in our way and once removed you will get similar situation with probably any kernel newer than 2.6.10. The option is CONFIG_HOTPLUG_CPU. It adds support for hotplugable CPUs. It appears that having this option off, makes kernel configure APIC properly.
Actually it is quiet understandable. You see, APIC has to be told what processors should receive interrupts. You need additional piece of code that tells APIC how to handle processor removals – processor removal is one of the things that CONFIG_HOTPLUG_CPU allows you to do. I assume that this functionality was missing from earlier kernel and got inside in 18.104.22.168.
We saw that we can achieve really nice results by doing some modifications to kernel configuration. On a very busy system, doing this small configuration change can boost server’s productivity by large margin.
I hope you will find this information useful and use techniques I described in this article.