MSI-X – the right way to spread interrupt load

When considering ways to spread interrupts from one device among multiple cores, I can’t not to mention MSI-X. The thing is that MSI-X is actually the right way to do the job.

Interrupt affinity, which I discussed here and here, has a fundamental problem. That is inevitable CPU cache misses. To emphasise this, think about what happens when your computer receives a packet from the network. Packet belongs to some connection. With interrupt affinity the packet would land on core X, while the chances are that previous packet on the same TCP connection has landed on core Y (X ≠ Y).

Handing the packet would require kernel to load TCP connection object into X’s cache. But, this is so ineffective. After all, the TCP connection object is already in Y’s cache. Wouldn’t it be better to handle second packet on core Y as well?

This is the  problem with interrupt affinity. From one point of view we want to spread interrupts to even the load on cores. From another point of view, doing simple round robin isn’t enough. The little fella that decides where each interrupt goes, should be able to look into the packet and depending on what TCP connection it belongs to, send the interrupt to core that handles all packets that belong to this connection.

Ideally, NICs should be able to:

  1. Look into packets and identify connections.
  2. Direct interrupt to core that handles the connection.

Apparently, this functionality already here. Devices that support MSI-X do exactly this.

Meet MSI-X

MSI-X is an extension to MSI. MSI replaces good old pin based interrupt delivery mechanism.

Each IO-APIC chip (x86 permits up to 5) has 24 legs, each connected to one or more devices. When IO-APIC receives an interrupt, it redirects the interrupt to one of the local-APICs. Each local-APIC connected to a core that receives an interrupt.

MSI provides a kind of protocol for interrupt delivery. Instead of raising signal on pins, PCI cards send a message over MSI and IO-APIC translates the message into right interrupt. Theoretically this means that each device can have number of interrupt vectors. In reality, plain MSI does not support this, but MSI-X does.

Modern high-end network cards that support MSI-X, implement multiple tx-rx queues. Each queue tied up to an interrupt vector and each NIC has plenty of them. I checked Intel’s 82575 chipset. With igb driver compiled properly, it has up to eight queues, four rx and four tx. Broadcom’s 5709 chipset provides eight queues (and eight interrupt vectors), each handling both rx and tx.

In kernel 2.6.24, kernel developers introduced new member of struct sk_buff called queue_mapping. This member tells incoming NIC driver what queue to use when transmitting the packet.

Before transmitting the packet, kernel decides what queue to use for this packet (net/core/dev.c:dev_queue_xmit()). It uses two techniques to do so. First, kernel can ask NIC driver to provide a queue number for the packet. This functionality, however, is optional in NIC drivers and at the moment both Intel and Broadcom drivers don’t provide it. Otherwise, kernel uses a simple hashing algorithm that produces 16 bit number from two ip addresses and (in case of TCP or UDP) two port numbers. All this happens in function named simple_tx_hash() in net/core/dev.c.

When receiving packets, things are even easier because NIC firmware and the driver decide what queue to use to introduce the packet to the kernel.

Using this simple technique kernel and modern NIC’s can verify that packets that belong to certain connection land on certain queue. Using interrupt affinity binding techniques you can bind certain interrupt vector to certain core (writing to smp_affinity, etc). Thus you can spread interrupts among multiple cores and yet make sure there are no cache misses.

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

18 Comments

  1. telenne barz says:

    Hi Alex !

    Once again, here’s a nice article… Thanks for sharing your knowledges.

    For outbound packets, the Kernel builds a hash based on IP addresses and port numbers (source & destination, I suppose ?) in order to bind the corresponding flow to a given TX queue. I was wondering if the hash is build in the same manner for inbound packets / RX queues ?

    What I understand is that the driver is in charge of binding a given ingress flow to a given RX queue. Does that mean that the sysadmin cannot configure it a posteriori (with ethtool for instance) ?

    “Using interrupt affinity binding techniques you can bind certain interrupt vector to certain core” : can you please give us further details on how to to setup that ? Does that mean that each queue will appear as a particular device under /proc/interrupts ?

    Finally, did you hear about the TNAPI and PF_RING patches of Lucas Deri (http://www.ntop.org/TNAPI.html) ? If the MSI-X feature is already implemented in the concerned drivers (Intel igb, igbx), I don’t catch what is the benefit of the TNAPI patch. What is your opinion about this ?

    Telenn

  2. ninez says:

    MSI-X is great, and is also now used by defualt in the linux kernel.

    2.6.33 and onwards!

    no more sharing irq on my laptop.

    great article!

  3. @ninez
    Thanks for sharing your experience and for a warm comment. Please come again! :-)

  4. marvniek says:

    Great article Alex, thanks a lot

  5. I have not checked in here for a while as I thought it was getting boring, but the last few posts are good quality so I guess I¡¦ll add you back to my everyday bloglist. You deserve it friend :)

  6. Are you aware if they make any plugins to help with Web optimization? I’m trying to get my weblog to rank for some targeted search phrases but I’m not seeing encouraging gains. In case you know of any make sure you let me know. It would mean a lot

  7. Nice post. I was checking continuously this blog and I am impressed! Extremely helpful info particularly the last part :) I care for such info much. I was looking for this certain information for a very long time. Thank you and best of luck.

  8. David Gray says:

    It’s a shame you don’t have a donate button! I’d most certainly donate to this outstanding blog! I guess for now i’ll settle for book-marking and adding your RSS feed to my Google account. I look forward to fresh updates and will talk about this blog with my Facebook group. Chat soon!

  9. สิว says:

    Thank for your articles, I will share your link in my facebook.
    This is very nice for me.

  10. Mitra (India) says:

    Thanks for the knowledge, god bless you.

  11. Anil says:

    Quite an informative article. Thanks!
    While the network stack has been modified to take advantages of the multiple queues provided by devices, is there something similar planned for storage side traffic as well?
    Something similar to what other OSes (VMWare, Windows have to offer).

  12. adelaide hills b…

    MSI-X – the right way to spread interrupt load – Alex on Linux

Leave a Reply


7 × = seven