RHEL4/Documentation/MSI-HOWTO.txt
<<
>>
Prefs
   1                The MSI Driver Guide HOWTO
   2        Tom L Nguyen tom.l.nguyen@intel.com
   3                        10/03/2003
   4        Revised Feb 12, 2004 by Martine Silbermann
   5                email: Martine.Silbermann@hp.com
   6        Revised Jun 25, 2004 by Tom L Nguyen
   7
   81. About this guide
   9
  10This guide describes the basics of Message Signaled Interrupts (MSI),
  11the advantages of using MSI over traditional interrupt mechanisms,
  12and how to enable your driver to use MSI or MSI-X. Also included is
  13a Frequently Asked Questions.
  14
  152. Copyright 2003 Intel Corporation
  16
  173. What is MSI/MSI-X?
  18
  19Message Signaled Interrupt (MSI), as described in the PCI Local Bus
  20Specification Revision 2.3 or latest, is an optional feature, and a
  21required feature for PCI Express devices. MSI enables a device function
  22to request service by sending an Inbound Memory Write on its PCI bus to
  23the FSB as a Message Signal Interrupt transaction. Because MSI is
  24generated in the form of a Memory Write, all transaction conditions,
  25such as a Retry, Master-Abort, Target-Abort or normal completion, are
  26supported.
  27
  28A PCI device that supports MSI must also support pin IRQ assertion
  29interrupt mechanism to provide backward compatibility for systems that
  30do not support MSI. In Systems, which support MSI, the bus driver is
  31responsible for initializing the message address and message data of
  32the device function's MSI/MSI-X capability structure during device
  33initial configuration.
  34
  35An MSI capable device function indicates MSI support by implementing
  36the MSI/MSI-X capability structure in its PCI capability list. The
  37device function may implement both the MSI capability structure and
  38the MSI-X capability structure; however, the bus driver should not
  39enable both.
  40
  41The MSI capability structure contains Message Control register,
  42Message Address register and Message Data register. These registers
  43provide the bus driver control over MSI. The Message Control register
  44indicates the MSI capability supported by the device. The Message
  45Address register specifies the target address and the Message Data
  46register specifies the characteristics of the message. To request
  47service, the device function writes the content of the Message Data
  48register to the target address. The device and its software driver
  49are prohibited from writing to these registers.
  50
  51The MSI-X capability structure is an optional extension to MSI. It
  52uses an independent and separate capability structure. There are
  53some key advantages to implementing the MSI-X capability structure
  54over the MSI capability structure as described below.
  55
  56        - Support a larger maximum number of vectors per function.
  57
  58        - Provide the ability for system software to configure
  59        each vector with an independent message address and message
  60        data, specified by a table that resides in Memory Space.
  61
  62        - MSI and MSI-X both support per-vector masking. Per-vector
  63        masking is an optional extension of MSI but a required
  64        feature for MSI-X. Per-vector masking provides the kernel
  65        the ability to mask/unmask MSI when servicing its software
  66        interrupt service routing handler. If per-vector masking is
  67        not supported, then the device driver should provide the
  68        hardware/software synchronization to ensure that the device
  69        generates MSI when the driver wants it to do so.
  70
  714. Why use MSI?
  72
  73As a benefit the simplification of board design, MSI allows board
  74designers to remove out of band interrupt routing. MSI is another
  75step towards a legacy-free environment.
  76
  77Due to increasing pressure on chipset and processor packages to
  78reduce pin count, the need for interrupt pins is expected to
  79diminish over time. Devices, due to pin constraints, may implement
  80messages to increase performance.
  81
  82PCI Express endpoints uses INTx emulation (in-band messages) instead
  83of IRQ pin assertion. Using INTx emulation requires interrupt
  84sharing among devices connected to the same node (PCI bridge) while
  85MSI is unique (non-shared) and does not require BIOS configuration
  86support. As a result, the PCI Express technology requires MSI
  87support for better interrupt performance.
  88
  89Using MSI enables the device functions to support two or more
  90vectors, which can be configured to target different CPU's to
  91increase scalability.
  92
  935. Configuring a driver to use MSI/MSI-X
  94
  95By default, the kernel will not enable MSI/MSI-X on all devices that
  96support this capability. The CONFIG_PCI_MSI kernel option
  97must be selected to enable MSI/MSI-X support.
  98
  995.1 Including MSI/MSI-X support into the kernel
 100
 101To allow MSI/MSI-X capable device drivers to selectively enable
 102MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
 103below), the VECTOR based scheme needs to be enabled by setting
 104CONFIG_PCI_MSI during kernel config.
 105
 106Since the target of the inbound message is the local APIC, providing
 107CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI.
 108
 1095.2 Configuring for MSI support
 110
 111Due to the non-contiguous fashion in vector assignment of the
 112existing Linux kernel, this version does not support multiple
 113messages regardless of a device function is capable of supporting
 114more than one vector. To enable MSI on a device function's MSI
 115capability structure requires a device driver to call the function
 116pci_enable_msi() explicitly.
 117
 1185.2.1 API pci_enable_msi
 119
 120int pci_enable_msi(struct pci_dev *dev)
 121
 122With this new API, any existing device driver, which like to have
 123MSI enabled on its device function, must call this API to enable MSI
 124A successful call will initialize the MSI capability structure
 125with ONE vector, regardless of whether a device function is
 126capable of supporting multiple messages. This vector replaces the
 127pre-assigned dev->irq with a new MSI vector. To avoid the conflict
 128of new assigned vector with existing pre-assigned vector requires
 129a device driver to call this API before calling request_irq().
 130
 1315.2.2 API pci_disable_msi
 132
 133void pci_disable_msi(struct pci_dev *dev)
 134
 135This API should always be used to undo the effect of pci_enable_msi()
 136when a device driver is unloading. This API restores dev->irq with
 137the pre-assigned IOAPIC vector and switches a device's interrupt
 138mode to PCI pin-irq assertion/INTx emulation mode.
 139
 140Note that a device driver should always call free_irq() on MSI vector
 141it has done request_irq() on before calling this API. Failure to do
 142so results a BUG_ON() and a device will be left with MSI enabled and
 143leaks its vector.
 144
 1455.2.3 MSI mode vs. legacy mode diagram
 146
 147The below diagram shows the events, which switches the interrupt
 148mode on the MSI-capable device function between MSI mode and
 149PIN-IRQ assertion mode.
 150
 151         ------------   pci_enable_msi   ------------------------
 152        |            | <=============== |                        |
 153        | MSI MODE   |                  | PIN-IRQ ASSERTION MODE |
 154        |            | ===============> |                        |
 155         ------------   pci_disable_msi  ------------------------
 156
 157
 158Figure 1.0 MSI Mode vs. Legacy Mode
 159
 160In Figure 1.0, a device operates by default in legacy mode. Legacy
 161in this context means PCI pin-irq assertion or PCI-Express INTx
 162emulation. A successful MSI request (using pci_enable_msi()) switches
 163a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
 164stored in dev->irq will be saved by the PCI subsystem and a new
 165assigned MSI vector will replace dev->irq.
 166
 167To return back to its default mode, a device driver should always call
 168pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
 169device driver should always call free_irq() on MSI vector it has done
 170request_irq() on before calling pci_disable_msi(). Failure to do so
 171results a BUG_ON() and a device will be left with MSI enabled and
 172leaks its vector. Otherwise, the PCI subsystem restores a device's
 173dev->irq with a pre-assigned IOAPIC vector and marks released
 174MSI vector as unused.
 175
 176Once being marked as unused, there is no guarantee that the PCI
 177subsystem will reserve this MSI vector for a device. Depending on
 178the availability of current PCI vector resources and the number of
 179MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
 180
 181For the case where the PCI subsystem re-assigned this MSI vector
 182another driver, a request to switching back to MSI mode may result
 183in being assigned a different MSI vector or a failure if no more
 184vectors are available.
 185
 1865.3 Configuring for MSI-X support
 187
 188Due to the ability of the system software to configure each vector of
 189the MSI-X capability structure with an independent message address
 190and message data, the non-contiguous fashion in vector assignment of
 191the existing Linux kernel has no impact on supporting multiple
 192messages on an MSI-X capable device functions. To enable MSI-X on
 193a device function's MSI-X capability structure requires its device
 194driver to call the function pci_enable_msix() explicitly.
 195
 196The function pci_enable_msix(), once invoked, enables either
 197all or nothing, depending on the current availability of PCI vector
 198resources. If the PCI vector resources are available for the number
 199of vectors requested by a device driver, this function will configure
 200the MSI-X table of the MSI-X capability structure of a device with
 201requested messages. To emphasize this reason, for example, a device
 202may be capable for supporting the maximum of 32 vectors while its
 203software driver usually may request 4 vectors. It is recommended
 204that the device driver should call this function once during the
 205initialization phase of the device driver.
 206
 207Unlike the function pci_enable_msi(), the function pci_enable_msix()
 208does not replace the pre-assigned IOAPIC dev->irq with a new MSI
 209vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
 210into the field vector of each element contained in a second argument.
 211Note that the pre-assigned IO-APIC dev->irq is valid only if the device
 212operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of
 213using dev->irq by the device driver to request for interrupt service
 214may result unpredictabe behavior.
 215
 216For each MSI-X vector granted, a device driver is responsible to call
 217other functions like request_irq(), enable_irq(), etc. to enable
 218this vector with its corresponding interrupt service handler. It is
 219a device driver's choice to assign all vectors with the same
 220interrupt service handler or each vector with a unique interrupt
 221service handler.
 222
 2235.3.1 Handling MMIO address space of MSI-X Table
 224
 225The PCI 3.0 specification has implementation notes that MMIO address
 226space for a device's MSI-X structure should be isolated so that the
 227software system can set different page for controlling accesses to
 228the MSI-X structure. The implementation of MSI patch requires the PCI
 229subsystem, not a device driver, to maintain full control of the MSI-X
 230table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA.
 231A device driver is prohibited from requesting the MMIO address space
 232of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail
 233enabling MSI-X on its hardware device when it calls the function
 234pci_enable_msix().
 235
 2365.3.2 Handling MSI-X allocation
 237
 238Determining the number of MSI-X vectors allocated to a function is
 239dependent on the number of MSI capable devices and MSI-X capable
 240devices populated in the system. The policy of allocating MSI-X
 241vectors to a function is defined as the following:
 242
 243#of MSI-X vectors allocated to a function = (x - y)/z where
 244
 245x =     The number of available PCI vector resources by the time
 246        the device driver calls pci_enable_msix(). The PCI vector
 247        resources is the sum of the number of unassigned vectors
 248        (new) and the number of released vectors when any MSI/MSI-X
 249        device driver switches its hardware device back to a legacy
 250        mode or is hot-removed. The number of unassigned vectors
 251        may exclude some vectors reserved, as defined in parameter
 252        NR_HP_RESERVED_VECTORS, for the case where the system is
 253        capable of supporting hot-add/hot-remove operations. Users
 254        may change the value defined in NR_HR_RESERVED_VECTORS to
 255        meet their specific needs.
 256
 257y =     The number of MSI capable devices populated in the system.
 258        This policy ensures that each MSI capable device has its
 259        vector reserved to avoid the case where some MSI-X capable
 260        drivers may attempt to claim all available vector resources.
 261
 262z =     The number of MSI-X capable devices pupulated in the system.
 263        This policy ensures that maximum (x - y) is distributed
 264        evenly among MSI-X capable devices.
 265
 266Note that the PCI subsystem scans y and z during a bus enumeration.
 267When the PCI subsystem completes configuring MSI/MSI-X capability
 268structure of a device as requested by its device driver, y/z is
 269decremented accordingly.
 270
 2715.3.3 Handling MSI-X shortages
 272
 273For the case where fewer MSI-X vectors are allocated to a function
 274than requested, the function pci_enable_msix() will return the
 275maximum number of MSI-X vectors available to the caller. A device
 276driver may re-send its request with fewer or equal vectors indicated
 277in a return. For example, if a device driver requests 5 vectors, but
 278the number of available vectors is 3 vectors, a value of 3 will be a
 279return as a result of pci_enable_msix() call. A function could be
 280designed for its driver to use only 3 MSI-X table entries as
 281different combinations as ABC--, A-B-C, A--CB, etc. Note that this
 282patch does not support multiple entries with the same vector. Such
 283attempt by a device driver to use 5 MSI-X table entries with 3 vectors
 284as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
 285pci_enable_msix(). Below are the reasons why supporting multiple
 286entries with the same vector is an undesirable solution.
 287
 288        - The PCI subsystem can not determine which entry, which
 289          generated the message, to mask/unmask MSI while handling
 290          software driver ISR. Attempting to walk through all MSI-X
 291          table entries (2048 max) to mask/unmask any match vector
 292          is an undesirable solution.
 293
 294        - Walk through all MSI-X table entries (2048 max) to handle
 295          SMP affinity of any match vector is an undesirable solution.
 296
 2975.3.4 API pci_enable_msix
 298
 299int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
 300
 301This API enables a device driver to request the PCI subsystem
 302for enabling MSI-X messages on its hardware device. Depending on
 303the availability of PCI vectors resources, the PCI subsystem enables
 304either all or nothing.
 305
 306Argument dev points to the device (pci_dev) structure.
 307
 308Argument entries is a pointer of unsigned integer type. The number of
 309elements is indicated in argument nvec. The content of each element
 310will be mapped to the following struct defined in /driver/pci/msi.h.
 311
 312struct msix_entry {
 313        u16     vector; /* kernel uses to write alloc vector */
 314        u16     entry; /* driver uses to specify entry */
 315};
 316
 317A device driver is responsible for initializing the field entry of
 318each element with unique entry supported by MSI-X table. Otherwise,
 319-EINVAL will be returned as a result. A successful return of zero
 320indicates the PCI subsystem completes initializing each of requested
 321entries of the MSI-X table with message address and message data.
 322Last but not least, the PCI subsystem will write the 1:1
 323vector-to-entry mapping into the field vector of each element. A
 324device driver is responsible of keeping track of allocated MSI-X
 325vectors in its internal data structure.
 326
 327Argument nvec is an integer indicating the number of messages
 328requested.
 329
 330A return of zero indicates that the number of MSI-X vectors is
 331successfully allocated. A return of greater than zero indicates
 332MSI-X vector shortage. Or a return of less than zero indicates
 333a failure. This failure may be a result of duplicate entries
 334specified in second argument, or a result of no available vector,
 335or a result of failing to initialize MSI-X table entries.
 336
 3375.3.5 API pci_disable_msix
 338
 339void pci_disable_msix(struct pci_dev *dev)
 340
 341This API should always be used to undo the effect of pci_enable_msix()
 342when a device driver is unloading. Note that a device driver should
 343always call free_irq() on all MSI-X vectors it has done request_irq()
 344on before calling this API. Failure to do so results a BUG_ON() and
 345a device will be left with MSI-X enabled and leaks its vectors.
 346
 3475.3.6 MSI-X mode vs. legacy mode diagram
 348
 349The below diagram shows the events, which switches the interrupt
 350mode on the MSI-X capable device function between MSI-X mode and
 351PIN-IRQ assertion mode (legacy).
 352
 353         ------------   pci_enable_msix(,,n) ------------------------
 354        |            | <===============     |                        |
 355        | MSI-X MODE |                      | PIN-IRQ ASSERTION MODE |
 356        |            | ===============>     |                        |
 357         ------------   pci_disable_msix     ------------------------
 358
 359Figure 2.0 MSI-X Mode vs. Legacy Mode
 360
 361In Figure 2.0, a device operates by default in legacy mode. A
 362successful MSI-X request (using pci_enable_msix()) switches a
 363device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
 364stored in dev->irq will be saved by the PCI subsystem; however,
 365unlike MSI mode, the PCI subsystem will not replace dev->irq with
 366assigned MSI-X vector because the PCI subsystem already writes the 1:1
 367vector-to-entry mapping into the field vector of each element
 368specified in second argument.
 369
 370To return back to its default mode, a device driver should always call
 371pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
 372a device driver should always call free_irq() on all MSI-X vectors it
 373has done request_irq() on before calling pci_disable_msix(). Failure
 374to do so results a BUG_ON() and a device will be left with MSI-X
 375enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
 376device function's interrupt mode from MSI-X mode to legacy mode and
 377marks all allocated MSI-X vectors as unused.
 378
 379Once being marked as unused, there is no guarantee that the PCI
 380subsystem will reserve these MSI-X vectors for a device. Depending on
 381the availability of current PCI vector resources and the number of
 382MSI/MSI-X requests from other drivers, these MSI-X vectors may be
 383re-assigned.
 384
 385For the case where the PCI subsystem re-assigned these MSI-X vectors
 386to other driver, a request to switching back to MSI-X mode may result
 387being assigned with another set of MSI-X vectors or a failure if no
 388more vectors are available.
 389
 3905.4 Handling function implementng both MSI and MSI-X capabilities
 391
 392For the case where a function implements both MSI and MSI-X
 393capabilities, the PCI subsystem enables a device to run either in MSI
 394mode or MSI-X mode but not both. A device driver determines whether it
 395wants MSI or MSI-X enabled on its hardware device. Once a device
 396driver requests for MSI, for example, it is prohibited to request for
 397MSI-X; in other words, a device driver is not permitted to ping-pong
 398between MSI mod MSI-X mode during a run-time.
 399
 4005.5 Hardware requirements for MSI/MSI-X support
 401MSI/MSI-X support requires support from both system hardware and
 402individual hardware device functions.
 403
 4045.5.1 System hardware support
 405Since the target of MSI address is the local APIC CPU, enabling
 406MSI/MSI-X support in Linux kernel is dependent on whether existing
 407system hardware supports local APIC. Users should verify their
 408system whether it runs when CONFIG_X86_LOCAL_APIC=y.
 409
 410In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
 411however, in UP environment, users must manually set
 412CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
 413CONFIG_PCI_MSI enables the VECTOR based scheme and
 414the option for MSI-capable device drivers to selectively enable
 415MSI/MSI-X.
 416
 417Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
 418vector is allocated new during runtime and MSI/MSI-X support does not
 419depend on BIOS support. This key independency enables MSI/MSI-X
 420support on future IOxAPIC free platform.
 421
 4225.5.2 Device hardware support
 423The hardware device function supports MSI by indicating the
 424MSI/MSI-X capability structure on its PCI capability list. By
 425default, this capability structure will not be initialized by
 426the kernel to enable MSI during the system boot. In other words,
 427the device function is running on its default pin assertion mode.
 428Note that in many cases the hardware supporting MSI have bugs,
 429which may result in system hang. The software driver of specific
 430MSI-capable hardware is responsible for whether calling
 431pci_enable_msi or not. A return of zero indicates the kernel
 432successfully initializes the MSI/MSI-X capability structure of the
 433device funtion. The device function is now running on MSI/MSI-X mode.
 434
 4355.6 How to tell whether MSI/MSI-X is enabled on device function
 436
 437At the driver level, a return of zero from the function call of
 438pci_enable_msi()/pci_enable_msix() indicates to a device driver that
 439its device function is initialized successfully and ready to run in
 440MSI/MSI-X mode.
 441
 442At the user level, users can use command 'cat /proc/interrupts'
 443to display the vector allocated for a device and its interrupt
 444MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is
 445enabled on a SCSI Adaptec 39320D Ultra320.
 446
 447           CPU0       CPU1
 448  0:     324639          0    IO-APIC-edge  timer
 449  1:       1186          0    IO-APIC-edge  i8042
 450  2:          0          0          XT-PIC  cascade
 451 12:       2797          0    IO-APIC-edge  i8042
 452 14:       6543          0    IO-APIC-edge  ide0
 453 15:          1          0    IO-APIC-edge  ide1
 454169:          0          0   IO-APIC-level  uhci-hcd
 455185:          0          0   IO-APIC-level  uhci-hcd
 456193:        138         10         PCI MSI  aic79xx
 457201:         30          0         PCI MSI  aic79xx
 458225:         30          0   IO-APIC-level  aic7xxx
 459233:         30          0   IO-APIC-level  aic7xxx
 460NMI:          0          0
 461LOC:     324553     325068
 462ERR:          0
 463MIS:          0
 464
 4656. FAQ
 466
 467Q1. Are there any limitations on using the MSI?
 468
 469A1. If the PCI device supports MSI and conforms to the
 470specification and the platform supports the APIC local bus,
 471then using MSI should work.
 472
 473Q2. Will it work on all the Pentium processors (P3, P4, Xeon,
 474AMD processors)? In P3 IPI's are transmitted on the APIC local
 475bus and in P4 and Xeon they are transmitted on the system
 476bus. Are there any implications with this?
 477
 478A2. MSI support enables a PCI device sending an inbound
 479memory write (0xfeexxxxx as target address) on its PCI bus
 480directly to the FSB. Since the message address has a
 481redirection hint bit cleared, it should work.
 482
 483Q3. The target address 0xfeexxxxx will be translated by the
 484Host Bridge into an interrupt message. Are there any
 485limitations on the chipsets such as Intel 8xx, Intel e7xxx,
 486or VIA?
 487
 488A3. If these chipsets support an inbound memory write with
 489target address set as 0xfeexxxxx, as conformed to PCI
 490specification 2.3 or latest, then it should work.
 491
 492Q4. From the driver point of view, if the MSI is lost because
 493of the errors occur during inbound memory write, then it may
 494wait for ever. Is there a mechanism for it to recover?
 495
 496A4. Since the target of the transaction is an inbound memory
 497write, all transaction termination conditions (Retry,
 498Master-Abort, Target-Abort, or normal completion) are
 499supported. A device sending an MSI must abide by all the PCI
 500rules and conditions regarding that inbound memory write. So,
 501if a retry is signaled it must retry, etc... We believe that
 502the recommendation for Abort is also a retry (refer to PCI
 503specification 2.3 or latest).
 504