samedi 10 janvier 2015

[System] Emulate a PCI device with Qemu


I wanted to learn more about Linux drivers, but in order to write your own driver, you need to own the device you want to drive and if it doesn't exist (or if it's not stable yet), you are basically screwed. But do not panic, Qemu is here for you. You just need to create the device.

I will cover the following things:
  • A PCI device supporting interrupts, MMIO and PIO regions, DMA
  • A Linux PCI driver supporting this device (kernel 3.2) 
The code of the device c
an be found at https://github.com/grandemk/qemu_devices/blob/master/hello_tic.c

The code of the driver can be found at https://github.com/grandemk/qemu_devices/blob/master/driver_pci.c

The Qemu PCI Device

 

Here is the flow of a device qemu code. (See [1] for a good explanation)



The first thing Qemu do is to register our device inside its core. this is done with the macro type_init. This macro is called before the Qemu main. What's important here is that the Typeinfo struct representing our device is passed along.
static const TypeInfo pci_hello_info = {
    .name           = TYPE_PCI_HELLO_DEV,
    .parent         = TYPE_PCI_DEVICE,
    .instance_size  = sizeof(PCIHelloDevState),
    .class_init     = pci_hellodev_class_init,
};
As our device is a PCI peripheral, we will extend a PCIDevice struct so name, parent and instance_size are for inheritance purpose and are implementations details.

The exciting thing here is the class_init function. It will be called when the device is registered. Its goal is to allow you to overwrite the virtual methods of the PCIDeviceClass with the one of your device and to set the different PCI configuration bytes which represents your device.

The PCI configuration space is used by the Bios, bootloader or the kernel depending on your configuration in order to ease the probe phase of device. It is used to declare to the world how your device work and where to write in order to communicate with it.



static void pci_hellodev_class_init(ObjectClass *klass, void *data)
{
    DeviceClass *dc = DEVICE_CLASS(klass);
    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
    k->init = pci_hellodev_init;
    k->exit = pci_hellodev_uninit;
    /* this identify our device */
    k->vendor_id = 0x1337;
    k->device_id = 0x0001;
    k->class_id = PCI_CLASS_OTHERS;
    set_bit(DEVICE_CATEGORY_MISC, dc->categories);

    k->revision  = 0x00;
    dc->desc = "PCI Hello World";
    /* qemu user things */
    dc->props = hello_properties;
    dc->reset = qdev_pci_hellodev_reset;
}
The vendor_id and device_id fields identifies the PCI device. They are used in PCI Linux driver to decide which driver is chosen to operate your device. The PCIDeviceClass inherits from DeviceClass which is the base class used to represent a device and the function it provides to the Qemu core.

Init, exit and reset are the basic function you will need to write for your function. Init will be called when your device is plugged and exit when it's unplugged.

Exit and reset aren't really important function for us. exit is basically used to free the memory of your device. Init is the function you want to look at.
 static int pci_hellodev_init(PCIDevice *pci_dev)
{
    /* init the internal state of the device */
    PCIHelloDevState *d = PCI_HELLO_DEV(pci_dev);
    printf("d=%lu\n", (unsigned long) &d);
    d->dma_size = 0x1ffff * sizeof(char);
    d->dma_buf = malloc(d->dma_size);
    d->id = 0x1337;
    d->threw_irq = 0;
    uint8_t *pci_conf;

    /* create the memory region representing the MMIO and PIO
     * of the device
     */

    memory_region_init_io(&d->mmio, OBJECT(d), &hello_mmio_ops, d, "hello_mmio", HELLO_MMIO_SIZE);
    memory_region_init_io(&d->io, OBJECT(d), &hello_io_ops, d, "hello_io", HELLO_IO_SIZE);

    /*
     * See linux device driver (Edition 3) for the definition of a bar
     * in the PCI bus.
     */
    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &d->io);
    pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio);

    pci_conf = pci_dev->config;

   /* also in ldd, a pci device has 4 pin for interrupt
     * here we use pin B.
     */
    pci_conf[PCI_INTERRUPT_PIN] = 0x02;

    /* this device support interrupt */
    // d->irq = pci_allocate_irq(pci_dev);

    printf("Hello World loaded\n");
    return 0;
}

You can see a cast to PCIHelloDevState at the beginning. This is our own extension of the PCIDevice Class. We will store the logic of our device in it. So the first line are just initialization of the internal state of the dummy PCI device we are writing.

pci_conf[] represents the configuration space of your pci device. We just need to specify our device support interrupts on some pin (here pin B). The
Linux kernel PCI bus driver will get this information for us.

 You can check it with the command lspci -v. If you want more information, you can also look at the /sys filesystem. For more information see PCI Chapter in [0].

Qemu represents MMIO and PIO regions as MemoryRegion struct. When the driver access the memory region, the callbacks we registered with memory_region_init_io will be called.

static const MemoryRegionOps hello_mmio_ops = {
    .read = hello_mmioread,
    .write = hello_mmiowrite,
    .endianness = DEVICE_NATIVE_ENDIAN,
    .valid = {
        .min_access_size = 4,
        .max_access_size = 4,
    },
};

If the CPU tries to read the MemoryRegion, then the callback read will be called. There is one more step necessary for our MemoryRegion to be accessible from outside of the device. We have to register it in a PCI bar.

 A PCI device implements up to six I/O address regions. Each region consists of either memory or I/O locations. A bar is the PCI word for these regions. We call register_bar for this purpose.

You can request an irq with pci_allocate_irq and have an irq object or you can use the pci functions for irq.

static void hello_iowrite(void *opaque, hwaddr addr, uint64_t value, unsigned size)
{
    int i;
    PCIHelloDevState *d = (PCIHelloDevState *) opaque;
    PCIDevice *pci_dev = (PCIDevice *) opaque;
    printf("Write Ordered, addr=%x, value=%lu, size=%d\n", (unsigned) addr, value, size);
    switch (addr) {
        case 0:
            if (value) {
                /* throw an interrupt */
                printf("irq assert\n");
                d->threw_irq = 1;
                pci_irq_assert(pci_dev);
            } else {
                /* ack interrupt */
                printf("irq deassert\n");
                pci_irq_deassert(pci_dev);
                d->threw_irq = 0;
            }
            break;
        case 4:
            /* throw a random DMA */
            for ( i = 0; i < d->dma_size; ++i)
                d->dma_buf[i] = rand();
            cpu_physical_memory_write(0xa0000, (void *) d->dma_buf, d->dma_size); 
            break;
        default:
            printf("Io not used\n");
    }
}

The function iowrite, mmiowrite have the same prototype. opaque our PCIHelloDevState structure. addr is the address the driver accessed, value what was written in the memory and size is the granularity of the write: 1, 2 or 4 Bytes. ioread and mmioread follow the same principles.

You can use  pci_irq_assert and pci_irq_deassert to generate an interruptions. It's very easy.

You can use cpu_physical_memory_write to directly access the physical memory. This simulates a DMA but bypass the IOMMU system of Qemu. Its use is now deprecated, but if you want to simulate that there is no Iommu, it is the fastest way.

I will cover the IOMMU DMA in an update of this article.

Linux PCI driver


The code of the driver can be found at https://github.com/grandemk/qemu_devices/blob/master/driver_pci.c

Most of the time, to understand a driver, you need to begin from the end of the file. So let's begin with the last lines.

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("hello world");
MODULE_AUTHOR("Kevin Grandemange");

These are macros are used to describe the module. The license is not a trivial thing because some api of the kernel are only available to GPL licensed driver via the EXPORT_SYMBOL_GPL. Moreover, non GPL driver are not likely to be accepted in the mainstream branch of Linux.


module_pci_driver(hello_tic);
MODULE_DEVICE_TABLE(pci, hello_tic_ids);

There are to type of driver in Linux. The bus drivers and the device drivers. What we commonly call drivers are the device drivers. The bus driver provides an API to the device driver for them to communicate with the hardware using a specific bus. PCI is a specific type of bus. So here, module_pci_driver register our driver inside the bus driver handling all the low level pci stuff. MODULE_DEVICE_TABLE achieve this registration. Inside this table, we write what device we are able to drive. So when this device is found, our driver will be called to handle the communication.

/* vendor and device (+ subdevice and subvendor) 
 * identifies a device we support
 */
static struct pci_device_id hello_tic_ids[] = {
    { PCI_DEVICE(PCI_TIC_VENDOR, PCI_TIC_DEVICE) },
    { 0, }, /* sentinel */
};
/* id_table describe the device this driver support
 * probe is called when a device we support exist and
 * when we are chosen to drive it.
 * remove is called when the driver is unloaded or
 * when the device disappears
 */
static struct pci_driver hello_tic = {
    .name = "hello_tic",
    .id_table = hello_tic_ids,
    .probe = hello_tic_probe,
    .remove = hello_tic_remove,
};

A driver is a service based program. The pci_driver struct we provided to the pci bus driver is a set of function we support. Probe will be called when the pci bus driver find your device (PCI is really easy to handle, you don't need to search for your device, the BIOS or assimilate has already set everything in its right place).

This function is used to read the pci configuration part of the device, to request all the resources you will need for your driver (interrupts, virtual adresses for the mmio zone, etc.).

Remove is the opposite of probe and is called when the device disappear. The id table is a list of struct representing the device we support. The device are identified by their vendor id and device id.

You can use the PCI_DEVICE macro to ease the structure construction. You can also specify the subvendor and subdevice field for more precision but we don't need that here.

static int hello_tic_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
    struct tic_info *info;
    info = kzalloc(sizeof(struct tic_info), GFP_KERNEL);
    if (!info)
        return -ENOMEM;

    if (pci_enable_device(dev))
        goto out_free;

    pr_alert("enabled device\n");

    if (pci_request_regions(dev, "hello_tic"))
        goto out_disable;

    pr_alert("requested regions\n");

    /* BAR 0 has IO */
    info->port[0].name = "tic-io";
    info->port[0].start = pci_resource_start(dev, 0);
    info->port[0].size = pci_resource_len(dev, 0);

    /* BAR 1 has MMIO */
    info->mem[0].name = "tic-mmio";
    info->mem[0].start = pci_ioremap_bar(dev, 1);
    info->mem[0].size = pci_resource_len(dev, 1);

    if (!info->mem[0].start)
        goto out_unrequest;   

    pr_alert("remaped addr for kernel uses\n");
    /* get device irq number */
    if (pci_read_config_byte(dev, PCI_INTERRUPT_LINE, &info->irq))
        goto out_iounmap;
    /* request irq */
    if (devm_request_irq(&dev->dev, info->irq, hello_tic_handler, IRQF_SHARED, hello_tic.name, (void *) info))
        goto out_iounmap;

    /* get a mmio reg value and change it */
    pr_alert("device id=%x\n", ioread32(info->mem[0].start + 4));
    iowrite32(0x4567, info->mem[0].start + 4);
    pr_alert("modified device id=%x\n", ioread32(info->mem[0].start + 4));

    /* assert an irq */
    outb(1, info->port[0].start);
    /* try dma without iommu */
    outl(1, info->port[0].start + 4);

    pci_set_drvdata(dev, info);

    return 0;

out_iounmap:
    pr_alert("tic:probe_out:iounmap");
    iounmap(info->mem[0].start);
out_unrequest:
    pr_alert("tic:probe_out_unrequest\n");
    pci_release_regions(dev);
out_disable:
    pr_alert("tic:probe_out_disable\n");
    pci_disable_device(dev);
out_free:
    pr_alert("tic:probe_out_free\n");
    kfree(info);
    return -ENODEV;
}


The first things to do is in a pci driver is to call the two functions pci_enable_device and pci_request_regions.

The first one ask low-level code to enable I/O and memory and the second one reserves all these I/O and memory for our driver. You need this two function to success to continue.

 We then get the PIO with pci_ressource_start on BAR 0 (remember how we declared those in the qemu device).

The mmio part is trickier because the CPU only use virtual addresses. We need to remap the physical io address. This is done with the ioremap function. Here there is the pci_ioremap_bar helper function which does everything you need in one call.

 In order to know which interrupt our device has been assigned, we use pci_read_config_byte to read where this information is stored. After that we request the interrupt. This register our interrupt handler to the interrupt vector table code.

After this call, if an interrupt fires, all the interrupt handler linked to this interrupt will be called.

One important thing to notice is that the interrupt you are using can be shared. This means you need to verify by some means that your device really triggered this interrupt. If it didn't you will tell the kernel code that you're not interested by this interrupt.

 If you need some locking mechanism, better initialize it before calling this function, because after that, your interrupt handler can be called whenever the device feels like it.

The devm part of the request_irq means that this function is device managed and that the irq will automagically be freed when your driver is unloaded.

You can play with the mmio remapped memory with the function iowrite{8,16,32} and ioread{8,16,32} and talk with the ports of the device with the out{b,w,l} and in{b,w,l} functions.

These size are the same as the device mmio operation sizes in the qemu device. This means you can do some weird stuff in qemu where reading a 8 bits value does something radically different from reading the same addr with a 32 bit granularity.

The pci_set_drvdata allow us to store some relevant data inside the driver struct for further uses.

static void hello_tic_remove(struct pci_dev *dev)
{
    /* we get back the driver data we store in
     * the pci_dev struct */
    struct tic_info *info = pci_get_drvdata(dev);
    /* let's clean a little */
    pci_release_regions(dev);
    pci_disable_device(dev);
    iounmap(info->mem[0].start);
    kfree(info);
}

The remove function is the exact mirror of the initialization inside the probe function.

static irqreturn_t hello_tic_handler(int irq, void *dev_info)
{
    struct tic_info *info = (struct tic_info *) dev_info;
    pr_alert("IRQ %d handled\n", irq);
    /* is it our device throwing an interrupt ? */
    if (inl(info->port[0].start)) {
        /* deassert it */
        outl(0, info->port[0].start );
        return IRQ_HANDLED;
    } else {
        /* not mine, we have done nothing */
        return IRQ_NONE;
    }
}

This function is called when the interrupt we registered to fires. It is called in interrupt context. This means we may not sleep inside this function. If you need to sleep, you can use tasklets or threaded interrupts, which are easy to use.

You can see that dev_info contains the structure we specified in the devm_regist_irq. Notice the IRQ_NONE macro signaling the kernel that we don't care about this interrupt.

Have fun ;)

 

Hope you found it useful. I plan to create devices and drivers for other bus as well (platform driver, isa, usb, etc.)

 Sources


[0] Linux Device Driver (3rd Edition)
[1] http://ilevex.eu/post/88944209761/how-to-create-a-custom-pci-device-in-qemu
[3] Ctags of qemu and git log ;)
[4] http://free-electrons.com/doc/training/linux-kernel/linux-kernel-slides.pdf

10 commentaires:

  1. Excellent article. Thanks you very much. It helped me a lot in understanding qemu and virtio framework.

    RépondreSupprimer
  2. Thanks, I'm also working on a sysbus device + driver article as things are kinda different in the arm world.

    RépondreSupprimer
  3. Could you attach complete sources code please?

    RépondreSupprimer
  4. You can find the source code at
    https://github.com/grandemk/qemu_devices

    RépondreSupprimer
  5. Thanks, how I can generate pci interrupt in the host side for guest?

    RépondreSupprimer
    Réponses
    1. the function pci_irq_assert(pci_dev) will throw an interrupt, so the qemu side of this is easy. Now you need to have a way from the host to use this code. My first try would be to look at the monitor code of qemu to see if you can cram your fonctionality there. Else you can have a thread in your device that read command from somewhere / can communicate with another program.

      Supprimer
  6. Thanks for wonderful article,

    Need to know how to emulate instruction set, RAM [2GB], interrupts.
    Architecture is like that PCI device is coprocessor connected to host machine(x86) with its own instruction set. So need to simulate instruction execution when host machine copy cross compiler binary on its RAM

    RépondreSupprimer
    Réponses
    1. what you would like to do is a qemu in a qemu basically.
      Using shared memory and named pipes
      One qemu for x86
      One qemu for coprocessor
      writing to your qemu device in x86 would trigger a write to a shared memory. To send interrupt back, you would have to apply basic shared memory sync algorithm with your own protocol.

      You can also try to run your own emulation of the coprocessor inside qemu, but qemu is not good for that (it's not a library and all of what you would need is tangle with other stuff). You can try to look at http://www.unicorn-engine.org/ for a library to create the emulation of your coprocessor.

      Supprimer
  7. Hi, Thank you for this excellent article. Could you also please give the steps for compiling and running this to be a device on qemu? I tried compiling hello_tic.c along with the qemu source. I then tried to run the device by launching qemu with the -device pci-hellodev option. But I get the error: 'pci-hellodev' is not a valid device model name. Please help.

    RépondreSupprimer