SMP affinity and proper interrupt handling in Linux

Posted on April 15, 2008, 3:11 pm, by Alexander Sandler, under Programming Articles, System Administrator Articles.

Introduction

Hardware interrupts has always been expensive. Somehow these small pieces of software consume so much CPU power and hardware and software engineers has always been trying to change this state of affairs. Some significant progress has been made. Still hardware interrupts consume lots of CPU power.

You will rarely see effects of interrupt handling on desktop systems. Take a look at your /proc/interrupts file. This file enlists all of your hardware devices and how many interrupts received by each and one of them on each CPU. If you are on a regular desktop system, you will see that number of interrupts that your computer handles is relatively small. Even powerful servers handling millions of packets per second handle only tens of thousands of interrupts per second. Yet these interrupts consume CPU power and handling them properly undoubtedly helps to improve system’s performance.

But really, what can we do about interrupts?

There are many things that can be done. Many Linux distributions ship with kernel that include modifications that significantly improve the situation. Technologies, such as NAPI, reduce number of interrupts and interrupt handling overhead so dramatically, that modern server probably wont be able to sustain a 1Gbps Ethernet link. NAPI is part of kernel for quiet some time. Other things include interrupt coalescence.

In this article I would like to address one of the most powerful techniques to optimize interrupt handling.

SMP affinity

The SMP affinity or processor affinity term has quiet broad meaning and requires an explanation. The word affinity addresses proximity of a certain task to certain processor within multi-processor system. I.e. when processor X runs process Y, they are affine to each other. The processor has parts of process’s memory in cache, thus constantly moving the process to different processor when scheduling it, would probably mean less effective scheduling.

As far as interrupts concerned, SMP affinity refers to a question what processor handles certain interrupt. On the contrary to the processes, binding interrupts to certain CPU will most likely cause performance degradation and here’s why. Interrupt handlers are usually very small in size. Interrupt’s memory footprint is relatively small, thus keeping interrupt on certain CPU will not improve cache hits. Instead, multiple interrupts will keep one of the cores overloaded while others remain relatively free. Scheduler has no idea about this state of affairs. It assumes that our interrupt handling core is as busy as any other core. As a result, you may face bottle necks as one of the processes or threads will occasionally work on core that has only 90% of its power available.

Things may be even worse because often core 0 by default handles all interrupts. On busy systems all interrupts may consume as much as 30% of core’s 0 power. Because we assume that all cores are equally powerful, we may find ourselves in a situation where our software system will effectively use only 70% of total CPU power.

Who’s responsible

APIC or Advanced Programmable Interrupt Controller has been integral part of all modern x86 based systems for many years – both SP (single-processor) and MP. This component is responsible for delivering interrupts. It also decides what interrupt goes where, in terms of cores.

By default APIC delivers ALL interrupts to core 0.This is the reason why /proc/interrupts will look like this on vast majority of modern Linux systems:

         CPU0     CPU1     CPU2     CPU3
  0:   123357        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    12252        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:      468        0        0        0  IO-APIC-level  eth0
225:      285        0        0        0  IO-APIC-level  eth1
NMI:      120       66       76       45
LOC:   123239   123220   123187   123065
ERR:        0
MIS:        0

See anything suspicious? Well, CPU0 handling all hardware interrupts. All of them. This is the situation that you see on a system with misconfigured interrupt SMP affinity.

Simple solution for the problem

Solution for this problem has been around pretty much since the introduction of the APIC. It has several interrupt delivery and destination modes. Physical and logical. Fixed and low priority. Etc. The important fact is that it is capable of delivering interrupts to any of the cores and even do load balancing between them.

Its configuration is limited to first eight cores. I.e. if you have more than eight cores, don’t expect any core higher than 7 to receive interrupts.

By default it operates in physical/fixed. This means that it will deliver certain interrupt to certain core. You already know that by default it is core 0. The thing is that you can easily change core that receives certain interrupt.

For each and every IRQ number in the first column in /proc/interrupts file, there’s a sub-directory in /proc/irq/. That directory contains a file named smp_affinity. Using this file you can change what core handles that interrupt. Reading from this file produces a hexadecimal number which is a bitmask with a single bit for each core. When certain bit is set, APIC will deliver the interrupt to corresponding core.

Let’s see an example…

#
# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0: 19599546        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    95337        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:   100778        0        0        0  IO-APIC-level  eth0
225:    56651        0        0        0  IO-APIC-level  eth1
NMI:      466      393      422      372
LOC: 19600453 19600434 19600401 19600279
ERR:        0
MIS:        0
#
#
# echo "2" > /proc/irq/217/smp_affinity
# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0: 19606722        0        0        0   IO-APIC-edge  timer
  8:        0        0        0        0   IO-APIC-edge  rtc
 11:        0        0        0        0  IO-APIC-level  acpi
169:        0        0        0        0  IO-APIC-level  uhci_hcd:usb1
177:        0        0        0        0  IO-APIC-level  qla2xxx
185:        0        0        0        0  IO-APIC-level  qla2xxx
193:    95349        0        0        0  IO-APIC-level  ioc0
209:        0        0        0        0  IO-APIC-level  uhci_hcd:usb2
217:   101027       49        0        0  IO-APIC-level  eth0
225:    56655        0        0        0  IO-APIC-level  eth1
NMI:      466      393      422      372
LOC: 19607629 19607610 19607577 19607455
ERR:        0
MIS:        0
#

As we can see, once we enter the magical command, CPU1 begins receiving interrupts from eth0, instead of CPU0. The echo command that changed the state of affairs is especially interesting. It is “2” that we’re echoing into the file. Writing “4” to the file, would cause eth0 interrupt be handled by CPU2, instead of CPU1. As I already mentioned, it is a bitmask where one bit correspond to single CPU.

How about writing “3” into the file. In theory, this should cause APIC to divert interrupts to CPU0 and CPU1. Unfortunately, things are a little more complicated here. It all depends on whether APIC works in physical “destination mode” and low priority “delivery mode”. If it is so, than you most likely would not be seeing CPU0 handling all interrupts. This is because when kernel configures APIC to work in physical/low priority modes, it automatically tells APIC to load balance interrupts between first eight cores.

So if on your system CPU0 handles all interrupts by default, this probably means that APIC configured ambiguously.

Ultimate solution

First of all, unfortunately there is no choice but to replace the kernel. Software that configures APIC is part of the kernel and if we want to change things we have no choice but to fix things in kernel. Things related to APIC are not configurable, so we have absolutely no choice. The only question is, replace kernel with what?

I tested this with OpenSuSE 10.2 that comes with kernel 2.6.18. Installing kernel 2.6.24.3 (the latest at the moment) with OpenSuSE’s default kernel configuration (/proc/config.gz) fixes the problem. With this kernel, things look like this, right from the start:

# cat /proc/interrupts
         CPU0     CPU1     CPU2     CPU3
  0:   728895   728796   728624   728895  IO-APIC-edge     timer
  8:        0        0        0        0  IO-APIC-edge     rtc
 11:        0        0        0        0  IO-APIC-fasteoi  acpi
 16:        0        0        0        0  IO-APIC-fasteoi  uhci_hcd:usb1
 19:        0        0        0        0  IO-APIC-fasteoi  uhci_hcd:usb2
 24:    14090    14090    14327    14056  IO-APIC-fasteoi  ioc0
 49:        7        9        7        8  IO-APIC-fasteoi  qla2xxx
 50:        8       12       11       10  IO-APIC-fasteoi  qla2xxx
 77:     2849     2759     2841     2827  IO-APIC-fasteoi  eth0
 78:    25072    25138    24996    24980  IO-APIC-fasteoi  eth1
NMI:        0        0        0        0
LOC:  2915270  2915256  2915228  2915092
ERR:        0

Looks good isn’t it? All cores handle interrupts, thus working with maximum efficiency. Now how about getting this result with just any kernel version? It appears to be doable.

There’s a kernel configuration option that stands in our way and once removed you will get similar situation with probably any kernel newer than 2.6.10. The option is CONFIG_HOTPLUG_CPU. It adds support for hotplugable CPUs. It appears that having this option off, makes kernel configure APIC properly.

Actually it is quiet understandable. You see, APIC has to be told what processors should receive interrupts. You need additional piece of code that tells APIC how to handle processor removals – processor removal is one of the things that CONFIG_HOTPLUG_CPU allows you to do. I assume that this functionality was missing from earlier kernel and got inside in 2.6.24.3.

Conclusion

We saw that we can achieve really nice results by doing some modifications to kernel configuration. On a very busy system, doing this small configuration change can boost server’s productivity by large margin.

I hope you will find this information useful and use techniques I described in this article.

Bookmark: digg, del.icio.us, reddit, stumbleupon, technorati, twitter, google, yahoo, facebook
Tags: affinity, apic, cache, code, CPU, interrupt, irq, kernel, linux, memory, napi, packets, performance, smp
Comment (RSS) | Trackback

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

71 Comments

Jim M says:

September 21, 2008 at 8:46 pm

How about the MSI/MSI-X interrupts? Is the information you provided only for “INTA”? Isn’t the goal of the MSI/MSI-X to transfer the irqs around a server?
Thanks,
Jim

Reply to this comment
- Alexander Sandler says:
  
  September 22, 2008 at 2:44 pm
  
  @Jim M
  Jim, to be honest I don’t know MSI interrupts delivered directly to local APIC (which sits right in your core) and you can configure where goes each interrupt. Apparently this has nothing to do with interrupt load balancing that IO-APIC does. On the other hand, I saw that once the IO-APIC load balancing works well, MSI interrupts being load balanced too.
  I think my best advice to you would be go on and check.
  
  Reply to this comment
Harm-Jan says:

September 26, 2008 at 10:00 pm

I saw in another post that irqbalanced can also solve this problem, as it has on my SMP system. How do you stand with respect to this tool? (see: http://www.uwsg.iu.edu/hypermail/linux/kernel/0302.2/1861.html)
Should I use this tool, or rebuild my kernel?

Reply to this comment
- Alexander Sandler says:
  
  September 27, 2008 at 11:56 am
  
  @Harm
  I am familiar with this tool. It uses the same technique as I described in the article – i.e. writes into /proc/irq/<irq number>/smp_affinity. As far as I know, and please don’t try to catch me here because things evolve and I could have missed something, this tool is good when you have power safe mode that turns off certain CPU, thus interrupts should be routed to remaining CPU. I.e. this tool is more for desktop machines rather than servers. As such, I believe it does its job well. However, because it uses same technique as I described it may need CONFIG_HOTPLUG_CPU kernel configuration trick to work.
  
  Reply to this comment
Pavel says:

March 7, 2009 at 12:57 pm

Thank you very much for your article. I replaced 2.6.22 kernel with 2.6.28 and suddenly all interrupts were processed by CPU0 and performance was drastically degraded. While googling for the cause I found your article and now things are clear for me.

However, I have a question:
It is possible to disable CONFIG_HOTPLUG_CPU only if power management support is disabled as well. This means disabling ACPI support too. Is it a correct thing to do?

Thanks in advance.

Reply to this comment
Alexander Sandler says:

March 9, 2009 at 10:20 am

@Pavel
I am surprised to hear that you have problems with 28. I had kernel 24 working good, so I expected 28 work as well. Can you please post here if you managed to solve this problem and if so, how? Thanks!
As for your question, ACPI and CPU hotplug are almost entirely unrelated to each other so you can safely disable one and keep the other.

Reply to this comment
Pavel says:

March 9, 2009 at 8:15 pm

Well, I didn’t try irqbalance or /proc/irq//smp_affinity trick on 28, I’ll give it a try tomorrow.

About kernel config – there is no option in menuconfig to enable ACPI and disable HOTPLUG_CPU
As soon as I enable “power management” section under which ACPI resides, HOTPLUG_CPU is enabled automatically and cannot be deselected

Reply to this comment
Alexander Sandler says:

March 9, 2009 at 11:05 pm

@Pavel
Please give me a day or two to check this out. Is this a vanilla kernel 2.6.28 we’re talking about? I mean official Linux kernel you get from http://www.kernel.org?

Reply to this comment
Pavel says:

March 10, 2009 at 9:27 pm

Thank you for your readiness to spend some time on my problem! Please don’t be in a hurry, this is not an urgent question

Yes, I’m using vanilla kernel:
http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.28.7.tar.bz2

Thanks in advance.

Reply to this comment
Alexander Sandler says:

March 17, 2009 at 4:33 pm

@Pavel
Pavel, I’ve just installed kernel 2.6.28-7 and it works well for me. I am using default OpenSuSE 10.2 configuration (copied it from /proc/config.gz). I have both ACPI settings and HOTPLUG_CPU enabled. Interrupts sharing works properly – i.e. all CPUS handling all interrupts.

Reply to this comment
Pavel says:

March 17, 2009 at 10:50 pm

Thank you very much for your tests.
Alexander, are you sure your system is not running irqbalance(d) and interrupts are distributed by kernel?
If irqbalance or irqbalanced are not running indeed, could you please share your kernel config with me?

Reply to this comment
Alexander Sandler says:

March 18, 2009 at 11:09 am

@Pavel
I explicitly disabled irqbalance. Grab the config file here: http://www.alexonlinux.com/config-2.6.28.7
Note that it has both CONFIG_ACPI and CONFIG_HOTPLUG_CPU set to ‘y’. Please let me know how it is going.

Reply to this comment
Pavel says:

March 20, 2009 at 11:48 am

Once again, thank you
I hope to be able to try new kernel this weekend.

Reply to this comment
Pavel says:

March 29, 2009 at 2:27 pm

Alexander, using your config kernel is distributing interrupts over different CPUs, guess I’ll have to examine my kernel config closely to find what I did wrong.

Thank you!

Reply to this comment
Alexander Sandler says:

March 29, 2009 at 3:32 pm

@Pavel
I am glad to hear that. Please post here your findings – this might help someone in the future

Reply to this comment
yuanbor says:

April 17, 2009 at 3:36 am

I use the same Linux-2.6.25 kernel and the same config file. But, the appearance is not the same in AMD platform and Intel. In the Intel machine, interrupts is distributed over all CPUs while only distributed over the first one in the AMD machine. I don’t know the reason.

Reply to this comment
Alexander Sandler says:

April 19, 2009 at 9:03 pm

@yuanbor
To be honest I don’t know how to help you. As far as I know, interrupt load balancing does work in kernel 2.6.25. I suggest that you ask vendor of the AMD platform if they know anything about this issue.

Reply to this comment
Praveen says:

August 20, 2009 at 1:58 pm

Hi
My kernel is 2.6.21 and driver is e1000 .adapter intel 82542 .
I have disabled NAPI from my driver then I find that all the interrupts(eth0) are falling on a single cpu .If I try to divert it over other cpus by writing in /proc/irq/n/smp_affinity system gets crashed.though this is by default ffff. but it reaches to only one cpu at one time.I tried your option but didnot work in my file I have CONFIG_HOTPLUG .so I turned it off. Let me know is this a normal behaviour?

Reply to this comment
Praveen says:

August 20, 2009 at 2:03 pm

I think two interrupt handler for one interface running over many cpus when NAPI is disabled can create some contention so I think all the interrupts are intentionally getting to a single cpu.is it correct interpretation?

Reply to this comment
Alexander Sandler says:

August 23, 2009 at 2:07 pm

Originally Posted By PraveenHi
My kernel is 2.6.21 and driver is e1000 .adapter intel 82542 .
I have disabled NAPI from my driver then I find that all the interrupts(eth0) are falling on a single cpu .If I try to divert it over other cpus by writing in /proc/irq/n/smp_affinity system gets crashed.though this is by default ffff. but it reaches to only one cpu at one time.I tried your option but didnot work in my file I have CONFIG_HOTPLUG .so I turned it off. Let me know is this a normal behaviour?

No. It is not normal behavior. Yet I cannot tell you for sure because I’ve never worked with e1000.
In my opinion you should leave everything as is. Remember that serving interrupts on multiple cores causes cache misses which can be very expensive. So, I honestly don’t think you’ll get any performance boost by configuring this round-robin interrupt delivery scheme.
This entire think is more suitable for servers.

As for your second question, the answer is no. NAPI has nothing to do with which core receives the interrupts.

Reply to this comment
Praveen says:

August 24, 2009 at 8:41 am

Thanks for reply
If one packet will be handled by one core and other packet will be on other core then respective cores will contain their handled packet data—then I with my limited knowlege donot know how cache miss will occur.like core1 will contain packet1 data core2 will contain packet2 data and so on. so upper layer will receive data from the corresponding cores—-Pls let me know the cache miss scenario in this case

In fact This issue is for servers where one core is getting all the interrupts . If driver is NAPI then I understand the reason but when it is disabled then why all the interrupts are reaching to a single cpu–that is bothering me.I assume that one packet is processed by one cpu and other should be on different cpus—-in this way load should be disbursed. but in my case when I try to distribute load system crashes down.

Reply to this comment
Alexander Sandler says:

August 25, 2009 at 9:16 pm

@Praveen
First, I owe you an apology. I confused e1000 with eeePC (the netbook). I do work with e1000 and well familiar with it. My suggestion to leave this whole thing alone is also a result of the same misunderstanding.

Anyway,
1. There is no problem with two completely unrelated packets. But what if two packets belong to the same TCP connection. Then two cores will attempt to access single connection descriptor. This will cause cache miss.
How severe the cache miss is? It is a question of how much traffic your computer receives and what else your computer does. If it handles moderate number of packets and is not very busy, then it is better to leave things as they are (even though this is not a netbook LOL). If you are really pushing it in terms of CPU consumption, IO and the rest, then it is better to find a way to spread those interrupts among all cores.

2. NAPI is a feature that reduces number of interrupts. When your computer receives X packets per second, without NAPI it will have to handle X interrupts. With NAPI number of interrupts will be smaller. I.e. Without NAPI, single interrupt delivers single packet. With NAPI, single interrupt delivers many packets. So, when all interrupts land on one CPU0 it does not matter if you have NAPI or not. This is because even with NAPI, CPU0 will have to handle the packet. Bottom line is that NAPI does not affect how your interrupts being delivered. It only affects number of interrupts.

As for the issue itself. First of all, e1000 has nothing to do with how interrupts being delivered. Component that decides what CPU should handle certain interrupt called IO-APIC. It is on your motherboard. Ability of your computer to spread interrupts among all cores depends on IO-APIC.
I can’t tell you why the trick I described in the article does not work for you. One of the possible reasons is because you have too many cores. When IO-APIC configured to spread interrupts among all cores, it can handle up to eight cores. If you have more than eight cores, kernel will not configure IO-APIC to spread interrupts. Thus the trick I described in the article will not work.
Otherwise it may be caused by buggy BIOS or even buggy hardware.

Reply to this comment
nitr0 says:

September 10, 2009 at 2:53 pm

2.6.30 kernel – disabling HOTPLUG_CPU does nothing for me Amost all interrupts are on cpu1 (LAN uses MSI), so still chance to use both cores in my situation – to use bonding
IRQ stats on test machine:

# cat /proc/interrupts
CPU0 CPU1
0: 13320 33974 IO-APIC-edge timer
1: 1 7 IO-APIC-edge i8042
4: 0 2 IO-APIC-edge
7: 1 0 IO-APIC-edge
8: 1 111 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
14: 188 3289 IO-APIC-edge pata_amd
15: 0 0 IO-APIC-edge pata_amd
17: 88 750 IO-APIC-fasteoi eth1
20: 0 0 IO-APIC-fasteoi ohci_hcd:usb3
21: 0 0 IO-APIC-fasteoi ehci_hcd:usb2
22: 0 0 IO-APIC-fasteoi ehci_hcd:usb1
23: 0 0 IO-APIC-fasteoi ohci_hcd:usb4
26: 0 0 PCI-MSI-edge ahci
27: 952 1733073 PCI-MSI-edge eth0
NMI: 0 0 Non-maskable interrupts
LOC: 16823 17504 Local timer interrupts
SPU: 0 0 Spurious interrupts
RES: 5590 4581 Rescheduling interrupts
CAL: 39 20 Function call interrupts
TLB: 810 784 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts

Reply to this comment
Why interrupt affinity with multiple cores is not such a good thing - Alex on Linux says:

September 17, 2009 at 2:44 pm

[…] written an article explaining this mechanism in greater detail. Yet let me remind you how it works in two […]

Reply to this comment
Alexander Sandler says:

September 17, 2009 at 2:47 pm

@nitr0
Here is a new post that may help you to decide what to do: http://www.alexonlinux.com/why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing

Reply to this comment
nitr0 says:

September 17, 2009 at 3:21 pm

Yes, I know about cache misses and other bad things. But I really need to use all cores to process interrupts – just because there is a shaper and pptp server on that machine, and both of them needs enough CPU resources (I have up to 50 kpps on receive and on transmit).

P.S. And how about 2.4 kernels?

Reply to this comment
Alexander Sandler says:

September 28, 2009 at 11:55 am

@nitr0
Actually, I am writing another post that presents another alternative to smp affinity, in case it doesn’t work. So stay tuned.

I don’t know about kernel 2.4, but I doubt that something that doesn’t work with kernel 2.6 would work in 2.4. Give it a shot and see if it helps.

Reply to this comment
ntop » IRQ Balancing says:

December 13, 2009 at 3:56 pm

[…] SMP affinity and proper interrupt handling in Linux […]

Reply to this comment
Chris says:

January 18, 2010 at 2:43 pm

The kernel option which appears to disable IRQ sharing among CPUs is CONFIG_CGROUPS

Setting it off allows IRQs to be shared.

I have not yet investigated much further than that yet

Reply to this comment
allexch says:

January 20, 2010 at 9:57 am

Hi, I’ ve got a question about “Its configuration is limited to first eight cores. I.e. if you have more than eight cores, don’t expect any core higher than 7 to receive interrupts.”

I have HP server with 16 cores (2 proc *8 corese each) and 8 NICs. I am able to assign interrupts from all af this 8 NICs to 8 cores, so at least 6 cores are not used. I tried to find out how to overcome this restriction, but I was not successful. So I do realy interested if there is any way to start using other cores in my OS?
——————————
HP proliant Dl360 G6
Linux version 2.6.31-gentoo-r6
——————————

Reply to this comment
Chris says:

January 20, 2010 at 10:03 am

You can set affinity to cpus higher than 8:

root@clusterdb-tp-02 ~ # egrep ‘CPU|66:’ /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
66: 1194464 0 0 0 0 0 0 0 42 0 0 0 0 0 0 1323 IO-APIC-level aacraid

Reply to this comment
Chris says:

January 20, 2010 at 10:03 am

@Chris – echo 8000 > /proc/irq/66/smp_affinity

Reply to this comment
Alexander Sandler says:

January 20, 2010 at 10:17 am

@allexch
It depends on mode in which io-apic works. There are two modes. In logical mode, io-apic can spread interrupts among multiple cores. This mode is limited to 8 cores. So, if you have more than 8 cores, io-apic will not work in logical mode and switch to physical mode.
In physical mode, io-apic will deliver every interrupt to a single core. In this mode, io-apic is limited to 256 cores.

Reply to this comment
Alexander Sandler says:

January 20, 2010 at 10:20 am

@Chris
I see three of your comments now, but I struggle to understand how any of them has to do with other questions in comments and the theme of this article Can you please elaborate a bit?

Reply to this comment
allexch says:

January 20, 2010 at 10:37 am

So if I am going to install 2*10G NICs I will certainly have problems? because in logical mode 4 cores to one 10G NIC is too little, and in that instance there’s no sence in physical mode, am I right?

Reply to this comment
Alexander Sandler says:

January 20, 2010 at 11:44 am

Originally Posted By allexch

So if I am going to install 2*10G NICs I will certainly have problems?

No. It doesn’t matter how many NICs you have. It’s all about number of cores.

because in logical mode 4 cores to one 10G NIC is too little, and in that instance there’s no sence in physical mode, am I right?

At your place I would not be concerned about io-apic at all. 10G NICs tend to support MSI-X. This means you will have multiple interrupt vectors per NIC. So, stick to physical mode and redirect each interrupt vector to different core. This way you’ll spread weight of interrupt handling among multiple cores.
For more details, see this:
http://www.alexonlinux.com/msi-x-the-right-way-to-spread-interrupt-load
and this:
http://www.alexonlinux.com/why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing

Reply to this comment
allexch says:

January 20, 2010 at 11:57 am

Oh. Thanks!!!

Reply to this comment
pkun says:

May 17, 2010 at 10:32 am

Hello. Thanks for the article.
We have some mystic with IRQ balancing. We check a several computers for the IRQ balancing. The same linux distributive (Ubuntu server edition, Ubuntu desktop) use IRQ balancing on some computers and don’t use on the others. We cannot find fundamental differencies beetween these platforms. For example we have two computers with Gigabyte motherboards (different models with AWARD bios) with the same APIC controller (v01 GBT GBTUACPI 42302E31 GBTU 01010101). The IRQ balancing works on first computer and doesn’t work on the other. The kernels is the same (without any changes in .config).

Then we try to build custom vanilla kernel 2.6.33.3. We set maximum CPU number to 8 (the kernel check if max CPU number bigger than 8 and set physical APIC mode), disable CGROUPs, disable HOTPLUG_CPU (and ACPI too. It has hard dependence) and it is useless. The dmesg says the APIC in flat mode but it has not IRQ balancing.

Another computer has 8 cpus, ASUSTEK motherboard with AMI bios (different manufacturer and bios) and don’t have IRQ balancing too.

Is it so widely-spreaded problem with bios or hardware? We have not another ideas about it.

Reply to this comment
allexch says:

May 17, 2010 at 11:11 am

to pkun
Hi. I have got the same problem, I can’t spread Interrupts between cores.
Instances:
1)
Xeon 5550, 2CPU’s with 4 core each, and NC522SFP+ NIC (10Ge)
2.6.31-gentoo-r10
2.6.32-gentoo-r7
2.6.34-rc7-git2

2)
Core I7 with 4 cores, Inlel 82576 NIC with igb 2.1.9 driver
2.6.32-gentoo-r7
None of this instances does not allow me to spread interrupts, I tried both legasy mode, msi and msi-x.

Reply to this comment
pkun says:

May 17, 2010 at 12:02 pm

It’s bad there is the same problem on the modern system (i7). One of my thought was that it’s a problem of ancient hardware/bios and it was fixed later.

Today I tried ubuntu with 2.6.28 kernel and debian with 2.6.26 kernel. It’s useless.

The experiments (on our hardware) don’t show any dependence on MSI or non-MSI interrupt sources. The systems with good IRQ balance used MSI devices succesfully. On bad systems (all interrupts is on CPU0) we disable MSI and MSI devices become devices with common interrupt mechanism. So it was useless for IRQ balance too…

Probably I’ll try 2.6.24 vanilla kernel. Alexander Sandler used it for the article.

Reply to this comment
allexch says:

May 17, 2010 at 12:24 pm

to pkun
I have 3 more HP servers with Xeon 5550 with 2.6.31-r6 – smp_affinity works.

I tried this(2.6.31-r6) on new severs, mentioned above, IT does not work. I tried plenty of patches like : https://patchwork.kernel.org/patch/73766/, they did not help me. So I am full of curiosity what is wrong? Is it new motherboard, not fresh bios or what? (I hardly believe now that it depends on kernel version). But I am looking forward for your test result with 2.6.24 vanilla kernel.

Reply to this comment
pkun says:

May 18, 2010 at 8:51 am

Yes, 2.6.24.7 doesn’t work. This kernel is rather old so I had some problems with compilation and testing (environment is newer). But I get system with loaded drivers and /proc/interrupts. There is no changes in IRQ balancing.

So I think it’s not a kernel problem or it’s very very internal kernel problem don’t dependent on CONFIG_…
I use apic=debug show_lapic=all kernel parameters and cannot find fundamental differencies in dmesg on “good” and “bad” systems. The both use IOAPIC for interrupt routing. Both use flat APIC mode, the loaded routing tables is very similar.

At the same time it’s very strange for me so many platforms have hardware/bios/apic bugs. It’s about 1/2 or 1/3 of all systems. I understand our tests and allexch’s tests is not anough for statistics but there is too many “broken” systems. The worst we have not criterion to choose “good” platform to buy. It’a a lottery.

Is it a Linux problem? Do you try some other systems on broken platforms? May be BSD or even Windows.

Reply to this comment
allexch says:

May 18, 2010 at 10:29 am

I guess it is not a Linux problem, but a hardware! May be some problems with interrupts on motherboard slot, but i have no idea how to investigate it.

Reply to this comment
Alexander Sandler says:

May 27, 2010 at 1:33 pm

@pkun
@allexch
Hi Guys.

Sorry it took me some time to get back to you – vacation
There’s very little that can be done to address the problem you’re facing. If it helps I can find where kernel decides how to handle interrupts, but IMHO, debugging this thing is by far the worst way to address the problem. I’d go to mobo vendor technical support, but chances they’ll help with a problem of this kind are futile.
IMHO the only thing that you can do is to try other way to spread interrupt load – I have a couple of posts on the subject in Related Posts above.

@pkun
Note that i7 comes with new version APIC controller. It’s called x2APIC. See here: http://www.intel.com/Assets/pdf/manual/318148.pdf
It has some advantages over xAPIC – it can direct interrupts to up to 32 cores simultaneously. But it has to be supported by the kernel (recent kernels support it so no problems here) and enabled.

Reply to this comment
Douglas says:

March 1, 2011 at 10:45 pm

Hello Alexander,

Great Post! It really answered a lot of questions that I had.

Thanks,

Douglas M.

Reply to this comment
Balancing Your Interrupts says:

July 23, 2011 at 4:02 am

[…] going to be loaded up on CPU0 and that was probably the source of my issue. He pointed me at this article to help translate to my […]

Reply to this comment
What’s the cause of high LOC interrupts? - Admins Goodies says:

August 21, 2011 at 2:48 am

[…] the process for managing multi cpu threading. See http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux for the answers you seek on how to lower it down, but basically its the way the system handles […]

Reply to this comment
Han says:

October 12, 2011 at 5:14 am

Great Post. Help me understand the logic of IRQ balance a lot.

There is still some problem with your solution.
I have a 8-core machine, when running a web performance test tool, all interrupts on eth0 (has IO-APIC support) went to CPU0, so I applied your solution, it does work, the interrupts are distributed to all CPUs, but the distribution is not equal, most of the interrupts still go to one core, on my server, it’s the last CPU core. Any clue?

Thanks,
Han

Reply to this comment
one question about local timer interrupt on tickless system says:

November 15, 2011 at 1:15 pm

[…] …. Yes, I believe it's doable via what is called SMP affinity. Try to read: http://www.alexonlinux.com/smp-affin…dling-in-linux Good […]

Reply to this comment
Sysadmin says:

January 24, 2012 at 1:24 pm

@Harm-Jan – Yes, you are right. I confirm that irqbalance daemon does its job very well. I was experienced troubles with high CPU load in one of my servers until I noticed that irqbalance died. When looking into /proc/interrupts I saw that almost all interrupts were on core0 (among total 16 cores). When I restarted irqbalance, all troubles went away.

Reply to this comment
Alexander Sandler says:

February 5, 2012 at 10:38 pm

@Han
Note that it depends on load on your network cards. More packets arriving means more interrupts so the load may not be even. It also depends on type of traffic. What matters here is number of packets and not necessarily incoming traffic.

Reply to this comment
Gustav says:

February 17, 2012 at 11:34 am

Hi Alexander,

i have a server with 24 cores and several network cards. The NICs are in mode PCI-MSI-edge. All interrupts are handled by CPU0. Kernel is 2.6.32-32-preempt. I want to redirect the interrupts caused by eth1 to CPU1, eth2 to CPU2, eth3 to CPU3 and so on. But when i do “sudo echo 2 > smp_affinity” the permission is denied and i want to edit the smp file with vim but i get “smp_affinity” E667: Fsync failed (And i do have enough disk space). I don’t get it.

regards

Reply to this comment