Tic le Polard: 2015

samedi 6 juin 2015

[android] Building the android kernel for the x86_64 emulator

See previous post for more details: Here

my avd is Nexus_5_API_22

export my_avd=Nexus_5_API_22

What is the version of linux running on the x86_64 emulator ?

adb shell dmesg| grep "Linux version"

<5>[    0.000000] Linux version 3.10.0+ (tic@debian) (gcc version 4.9.2 (Debian 4.9.2-10) ) #2 PREEMPT Fri Jun 5 15:07:50 CEST 2015

Oh it's a more recent one than the i386 version!

git checkout remotes/origin/android-goldfish-3.10 -b emu_x86_64
ls arch/x86/configs
i386_defconfig      i386_ranchu_defconfig  x86_64_emu_defconfig
i386_emu_defconfig  x86_64_defconfig       x86_64_ranchu_defconfig

There are more config than before, great it means it is actively developped :)

Ranchu is the successor of goldfish, based on a more resent version of qemu. Linaro team is working on it, It should replace Goldfish in due time. (It even support the android arm64 platform :) )

But for now the emulator which ships with android is goldfish so let's stay with that.

ARCH=x86_64
make x86_64_emu_defconfig
make -j`nproc`

Give it a try:

emulator -kernel arch/x86/bzImage -avd $my_avd -qemu -enable-kvm

I couldn't get it to work with this exact commandline, so I used

emulator64-x86 -kernel arch/x86/boot/bzImage -avd $my_avd -gpu off -qemu --enable-kvm

That's all! Now developp you own modules ;)

[android] Building the android kernel for the i386 emulator

To build for x86-64, check this article:
http://tic-le-polard.blogspot.fr/2015/06/android-building-android-kernel-for_6.html

Get the goldfish kernel.

git clone https://android.googlesource.com/kernel/goldfish

The master branch is empty, you need to find a commit where arch/x86/configs/goldfish_defconfig exist because the CONFIG options you need to build the kernel correctly for the emulation platform are in it.
Let's list all the different branch of the kernel

git branch -a 
 
  master
  remotes/origin/HEAD -> origin/master
  remotes/origin/android-3.10
  remotes/origin/android-3.4
  remotes/origin/android-goldfish-2.6.29
  remotes/origin/android-goldfish-3.10
  remotes/origin/android-goldfish-3.4
  remotes/origin/linux-goldfish-3.0-wip
  remotes/origin/master

Now check what version of the kernel the emulator is running:
Follow the android guide to get android studio : http://developer.android.com/tools/studio/index.html

Create a hello world app and run it in the emulator and create an x86 avd.
(You can also directly use the sdk tools to create them, but here everything is pretty much automated and the configuration is done for you.)

Use adb to connect to the emulator.
Add to your path the platform-tools and tools directory from the sdk directory.

adb devices # list all emulator instances
adb shell dmesg| grep "Linux version"

<5>[    0.000000] Linux version 3.4.67+ (digit@tyrion.par.corp.google.com) (gcc version 4.8 (GCC) ) #3 PREEMPT Tue Sep 16 19:46:22 CEST 2014

Let's get the 3.4 version of the goldfish kernel. It is proved to work :)

git checkout remotes/origin/android-goldfish-3.4 -b emu_x86

ls arch/x86/configs
goldfish_defconfig  i386_defconfig  x86_64_defconfig

Yeah, there is a goldfish config

export ARCH=x86
make goldfish_defconfig
make -j`nproc`

That's it you should have a working kernel for the emulator to test it:

emulator -list-avds
emulator -kernel arch/x86/bzImage -avd $your_avd -qemu -enable-kvm

You are now using your own kernel, you can compile your own modules for it, etc.

samedi 18 avril 2015

[qemu] Spice your qemu up

Spice is a free and open-source alternative to the VNC protocol. It is actually faster and safer. It enables you to use a remote viewer on your application.

In the host:

apt-get install virt-viewer

This will install remote-viewer, don't use spicec as it is deprecated.

In the guest VM:

apt-get install spice-vdagent

Now let's boot our VM using spice for the display.

qemu-system-x86_64 -drive file=debian_bootstrap2.qcow2,cache=none,if=virtio \ 
--enable-kvm \ 
-net nic,model=virtio,macaddr=52:54:00:12:34:58 \ 
-net vde,sock=/tmp/switch1.ctl -cpu host \ 
-smp 4 -m 2G \ 
-vga qxl -spice port=5900,addr=127.0.0.1,disable-ticketing \ 
-device virtio-serial-pci \ 
-device virtserialport,chardev=spicechannel0,name=com.redhat.spice.0 \ 
-chardev spicevmc,id=spicechannel0,name=vdagent

There is a lot to process here. all the options before the -vga are my usual options, enabling the net via vde (see my post on this here) and enabling virtio for everything.

The -smp 4 option tell kvm to use 4 cores on my computer.

I use -cpu host because I don't want some feature to disappear in the emulation, also It might be faster.

The -vga qxl is the recommanded vga for use with the spice protocol.

The -spice gives the Qemu integrated spice server its configuration. Here we specify the port and the adress of the server. As this is a local setup, we don't need a certified connection, so I disabled the ticketing.

If your VM is on your server, you might want to reconsider this choice as anyone can hijack the connection if you are not careful. Hackers are hardcore people, they are waiting for you in the internet, wearing ski masks and typing in the most awkward position. So encrypt your communications if you don't want to be a victim of these kind of people.

I will describe the use of spice with certificates in another article as you need to set up a CA to get it working.

With these options, you can already use the spice procol to get your virtual machine exported display. In fact you don't need to install the spice-vdagent for that.

But you can get more: copy and paste between the host and the vm and automatic resizing of X.

The -device virtio-serial-pci is the bus where all the virtserialport are plugged, so you need it to declare the use of a virtserialport device.

The -device virtserialport,chardev=spicechannel0,name=com.redhat.spice.0 is the actual port plugged in the virtio serial bus. its name enable the guest agent to connect to it.

The -chardev spicevmc,id=spicechannel0,name=vdagent is the character device to which the spice client will connect to communicate with guest agent using the virtio serial bus.

Now you can connect to your VM using:

remote-viever spice://127.0.0.1:5900

Have fun :)

Links

http://www.linux-kvm.org/page/SPICE
https://wiki.archlinux.org/index.php/QEMU#Spice_support

mardi 14 avril 2015

[System] Patching an initramfs

Extract the cpio filesystem in a directory, create or modify files, recreate a new cpio archive.

CPIO_FILE="rootfs.cpio"
CPIO_DIR="/tmp/yourCpioDir"
#Extract the cpio archive in the CPIO_DIR
mkdir $CPIO_DIR
pushd $CPIO_DIR
fakeroot cpio -i < $CPIO_FILE
#Apply your changes
#Recreate the cpio archive
find . | cpio --create --format='newc' > $CPIO_FILE

lundi 13 avril 2015

[System] Use buildroot to create an embedded distribution

Get the buildroot git repository:

git clone git://git.buildroot.net/buildroot

Follow blindly the README sacred words.

make list-defconfigs

Let's start with a defconfig of the vexpress arm machine

make qemu_arm_vexpress_defconfig

The versatile defconfig doesn't support device tree, and it's a pain in the ass to make it work, moreover the vexpress board support is better in qemu.

launch the build:

make

Build an awesome card castle.
Come back when it's finished.

Read the readme.txt at board/qemu/arm_vexpress
Do what you are told to do.

qemu-system-arm -M vexpress-a9 -kernel output/images/zImage -drive file=output/images/rootfs.ext2,if=sd -append "console=ttyAMA0,115200 root=/dev/mmcblk0" -serial stdio -net nic,model=lan9118 -net user

Thanks the creator of buildroot in your mind because this commandline is difficult to find on first try. Tweak it to make it work for you. You can add the -nographic option if you don't want the framebuffer to show.

That's it you've created a complete and minimal embedded system. Now you will want to customize it, change the kernel options to use the device tree for example or add the openssh server package to your rootfs. Explore your options using the make menuconfig command.

[Trivia] Visualize your git history with gource

Find the source here:
https://github.com/acaudwell/Gource

The compilation is pretty straight forward:

./autogen
./configure
# fix the dependencies which are missing, probably sdl2.
make -j `nproc`
make install

That's it.

Go in your git repository (or svn, bazar?, whatever)

gource

You can now see your project grow up, get his first girlfriend, move out of your computer, buy a new repository at github city, etc.


A snapshot of the buildroot project

dimanche 12 avril 2015

[qemu] Create a complete system image without booting in the emulator

Create an empty disk image:

qemu-img create -f qcow2 debian_bootstrap.qcow2 10G

Use the nbd kernel module (Network Block Device) to create and acceed partitions on the qcow2 disk image.

modinfo nbd

filename:       /lib/modules/3.18.4-1-ARCH/kernel/drivers/block/nbd.ko.gz
license:        GPL
description:    Network Block Device
depends:      
intree:         Y
vermagic:       3.18.4-1-ARCH SMP preempt mod_unload modversions
parm:           nbds_max:number of network block devices to initialize (default: 16) (int)
parm:           max_part:number of partitions per device (default: 0) (int)
parm:           debugflags:flags for controlling debug output (int)

sudo modprobe nbd max_part=16

Associate the qcow2 disk image to the /dev/nbd0 disk node.

sudo qemu-nbd -c /dev/nbd0 debian_bootstrap.qcow2

Create your partitions:

sudo parted /dev/nbd0

Or to get a swap + rootfs partition scheme, you can use:

sfdisk /dev/nbd0 -D -uM << EOF
,512,82
;
EOF

If you used parted with gpt partitions, you need to set up the first partition as a 2MB partition with the bios_grub flag set If no new disk file nbd0p1 shows up, you can usethe following command to force then to appear.

 partx -a /dev/nbd0

Format your partitions with the filesystems of your choice. (You can also do this in parted if you wish)

mkswap /dev/nbd0p1
mkfs.ext4 /dev/nbd0p2

Mount your root file system, you are almost ready to bootstrap a new debian in your qcow2 disk image.

mount /dev/nbd0p2 /mnt/

That's it, debootstrap this little rootfs ! (Choose the closest ftp to you to speed up the download)

debootstrap --include=less,locales-all,vim,sudo,openssh-server stable /mnt http://ftp.us.debian.org/debian

chroot in your new debian rootfs:

mount --bind /dev/ /mnt/dev
LANG=C chroot /mnt/ /bin/bash
mount -t proc none /proc
mount -t sysfs none /sys

You might need to set the path correctly if your host distribution doesn't use the same binaries path

PATH=$PATH:/bin:/sbin:/usr/sbin:/usr/bin

Install the latest kernel and grub-pc (beware, you want to install grub2 and not grub legacy)

apt-get install linux-image-amd64 grub-pc

When installing grub, choose the /dev/nbp0p2, you wouldn't want to install it on your own disk.

# To be sure grub is installed ^^
grub-install /dev/nbd0
update-grub

Set a passwd for root

passwd root

exit the chroot

umount /proc/ /sys/ /dev/
exit

# Some say you can fix the grub installation with those commands. Didn't work for me.
grub-install /dev/nbd0 --root-directory=/mnt --modules="biosdisk part_msdos" 
sed -e 's/nbd0p2/sda2/g' boot/grub/grub.cfg > boot/grub/grub.cfg
# I think if you want to have a working grub, just write the configuration yourself, 
# I fix this by hand at the first boot.

If you have other partition than the rootfs, you will need to edit your fstab.conf accordingly.
You are finished, clean your mess.

umount /mnt
qemu-nbd -d /dev/nbd0

Start your VM You might get the grub console instead of a beautiful grub menu because the grub configuration is still broken! To launch linux you can do:

ls
#identify the rootfs partition
ls (hd0,gpt2)
ls (hd0,gpt2)/
#You should see your filesystem here, names can change
set root = (hd0,gpt2)
linux /boot/vmlinux-... root=/dev/sda2 # very important, specify your rootfs partition here
initrd boot/initrd-...
boot

Launch your newly created VM ! update-grub inside to fix the grub, this will eventually fix your grub.

[qemu] Create a complete subnet of virtual machines

Let's create a virtual network with qemu/kvm and vde :)

First, create a virtual switch where you will plug your virtual machines.

vde_switch -s /tmp/switch1.ctl -m 666 --daemon --mgmt /tmp/switch1.mgmt

The -m option is here to set the flags so as to let everyone use the emulated switch. This is useful if you declare a switch with a link to a tap device.

Without the --daemon option, the vde_switch doesn't go in background and provides us a shell to configure its internals. This shell doesn't support history and readline commodities, so a better option is to use the --daemon mode coupled with the --mgmt option, this provide a socket where the virtual switch will listen for configuration input.

 
vdeterm /tmp/switch1.mgmt

This command will give you a fully fledged shell. (You know you deserve it)

Finally the -s option is the most important as it is the interface you have to use to connect something to this virtual switch.

Congratulation, there is now an emulated empty switch running on your computer !

You can connect a vm to this switch with the following command:

qemu-system-x86_64 -drive file=your_image.qcow2,if=virtio --enable-kvm -net nic,model=virtio,macaddr=52:54:00:12:34:57 -net vde,sock=/tmp/switch1.ctl -cpu host -smp 4 -m 1G -vga qxl

The important thing is the -net vde option here.
Your vm should have registered to the vde_switch.

You can find out by using the command port/print in the vdeterm. You should get something like that:

The VMs you plug like this will be able to communicate with each other.

To enable the VMs to have an internet access, you can set up a NAT. This can be
done using slirp.

slirpvde -s /tmp/switch1.ctl --dhcp

That's it, it handles DNS, dhcp, name it, it does it :)
However, with this setup, you won't be able to access the VMs from the host, if you try and add a tap interface to the switch, it will get a dhcp and dns setup, it will get a default route rule. This will conflict with the nat and the ip forwarding won't work because the path are not correct.

You can prevent that by not using the dhcp server at all whith the tap interface. You can also use the tap interface as gateway for the VMs and enable the ip forwarding and masquerade nat rules instead of using the slirp tool (I will describe this in another post)

Sources:

http://wiki.v2.cs.unibo.it/wiki/index.php/VDE_Basic_Networking
https://wiki.archlinux.org/index.php/QEMU#Networking_with_VDE2
http://wiki.qemu.org/Documentation/Networkin
https://xkcd.com/350/

vendredi 10 avril 2015

[Python] colored ascii art with pyfiglet

Your scripts are boring? Keep them interesting with some big ascii art characters!

The following code is an example of what can be created with just a few line of code.

The colorama module enable us to have the same behaviour as the posix auto coloring mode. If we are targeting a tty, then the color will show up, but if the actual stdout isn't directed to a tty, then the color mode is switch off.

import sys
from colorama import init
init(strip=not sys.stdout.isatty())
from termcolor import cprint
from pyfiglet import figlet_format

text="Awesome Character Ascii Art"
cprint(figlet_format(text, font="standard"), "blue")

The standard font is the default, but there are many fonts you can try, like the starwars font for example. Here is the result of the former piece of code:

The width of your terminal is not autodetected by pyfiglet and it default to 80 columns.

If you want the newline support, you will need my version of pyfiglet (You can find it here) but it should be merged in the mainline pyfiglet in a few days/weeks. (EDIT: It's done, you can get it with pip install pyfiglet now)

If you want to have a width corresponding to the terminal your are using, you can use https://gist.github.com/jtriley/1108174.

You can then feed figlet_format with the newly calculated width:

cprint(figlet_format(text, font="standard", width=terminal_width), "blue")

Pyfiglet is a port of the figlet utility in python (versions 2.6 to 3.4 are supported). In fact it should behave the same way.

Have fun with your scripts ;)

[Trivia] Mount a Windows 7 share repository with mount

Trying to access a shared repository with the commandline without specifying the version 2.0 will result in a I/O failure.

sudo mount -t cifs //ip_adress/partage-path mount-point -o user=youruser,vers=2.0

lundi 9 février 2015

[Trivia] Debugging Qemu

Intro

First time I tried debugging Qemu, I had some problems with signals, serials, etc.

Debug

configure

Configure qemu to use debug
./configure --enable-debug --enable-fdt --target-list=arm-softmmu

signals

add the following lines to your .gdbinit:
handle SIGUSR1 SIGUSR2 noprint nostop

Qemu uses those signals for timeout and internal stuff, we don't need to track them

serials

Then I had problems with serial lines in gdb, Reading the serial worked well but I couldn't write to it...

My workaround is to call:

socat -d -d pty pty

You get two pty.
Call screen on the second one to be able to write to the serial port.

References

[0] http://lnotestoself.blogspot.fr/2014/01/arm-debugging-with-qemu-and-gdb.html

samedi 31 janvier 2015

[write-up] While Not Challenge

By @tic_le_polard

Context

There was a little challenge proposed this week by a friend at Securimag.
The goal was to write an infinite loop without the while instruction in python.

You can see the original article here: https://securimag.org/wp/news/while-not-challenge/

I remember saying this challenge was pretty dumb and then spending the next three hours searching how to do the perfect infinite loop without a while, for, map, iter, infinite /dev/urandom and all that stuff. Here is my write up.

The final result is for python3. I had to choose because I was messing with the internals of python.

The challenge

No while, no for, no lambdas, no lists growing in memory, no infinite file like /dev/urandom, no infinite generator, what's left?

My intention was to find out how I could write bytecode and make python eval it in some way. I found the great dis module which enable its user to see the bytecode of a function.

import dis

def test_dis():
    a = 1 + 2
    b = 3 + 4
    return a + b


dis.dis(test_dis)

gives us

4           0 LOAD_CONST              5 (3)
              3 STORE_FAST               0 (a)

5          6 LOAD_CONST              6 (7)
              9 STORE_FAST               1 (b)

6         12 LOAD_FAST                0 (a)
             15 LOAD_FAST                1 (b)
             18 BINARY_ADD
             19 RETURN_VALUE

Great, now we are talking!

Where is this bytecode stored? Reading the dis documentation, i find out that the python bytecode is held by code objects.

Let's find out where these code objects are hidden in the python function object.

>>> def func():
...   pass

>>> dir(func)
['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

>>> dir(func.__code__)
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars', 'co_kwonlyargcount', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']

>>> list(func.___code__.co_code)
['d', '\x00', '\x00', 'S'] 

>>> func.__code__.co_code = '10'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: readonly attribute

Dammit!

There must be a workaround for this. We need to know how the function object really work and what we can override in it.

[0] tells us that the __code__ and the func_code both represents the function code and can be overwritten. So we only need to create our own code object with the bytecode we want and we've won.

This is not complicated, because all the objects we need in order to create the new code object are already created by the function object. Moreover, someone already did this ! See [1] for more details in the creation of the new object.

The next step is to find the jump immediate equivalent opcode in python in order to build our infinite loop.

import dis

def test_dis():
    while True:
        pass

dis.dis(test_dis)
print(list(test_dis.__code__.co_code))

We get the following result:

4            0 SETUP_LOOP                   3 (to 6)

5     >>    3 JUMP_ABSOLUTE            3
         >>    6 LOAD_CONST              0 (None)
                  9 RETURN_VALUE
[120, 3, 0, 113, 3, 0, 100, 0, 0, 83]

So "JUMP_ABSOLUTE 3" opcode is [113, 3, 0], an infinite loop could be a unique opcode [113, 0, 0]. The opcode would jump on itself.

The final code:

import dis

def func():
    this_code_is_never_executed
    return tralala


# we redefine the code object to make it mutable
# code object is an implementation detail and differs between python versions
fco = func.__code__
func_code = list(fco.co_code)
# jump absolute based infinite loop
func_code = [ 113, 0, 0 ]

# we define a new instance of code object
# similar to the previous one but
# with our modified bytecode
func.__code__ = type(fco)(
        fco.co_argcount,
        fco.co_kwonlyargcount,
        fco.co_nlocals,
        fco.co_stacksize,
        fco.co_flags,
        bytes(func_code),
        fco.co_consts,
        fco.co_names,
        fco.co_varnames,
        fco.co_filename,
        fco.co_name,
        fco.co_firstlineno,
        fco.co_lnotab,
        fco.co_freevars,
        fco.co_cellvars
)

# did it work ?
print(list(func.__code__.co_code))
dis.dis(func)

# infinite loop
func()

Challenge Complete :-)

I had fun looking a bit in python internals, I hope you did too ;p

References

[0] https://docs.python.org/3/reference/datamodel.html
[1] http://www.jonathon-vogel.com/posts/patching_function_bytecode_with_python/

samedi 10 janvier 2015

[System] Emulate a PCI device with Qemu

By @tic_le_polard

I wanted to learn more about Linux drivers, but in order to write your own driver, you need to own the device you want to drive and if it doesn't exist (or if it's not stable yet), you are basically screwed. But do not panic, Qemu is here for you. You just need to create the device.

I will cover the following things:

A PCI device supporting interrupts, MMIO and PIO regions, DMA
A Linux PCI driver supporting this device (kernel 3.2)

The code of the device c
an be found at https://github.com/grandemk/qemu_devices/blob/master/hello_tic.c

The code of the driver can be found at https://github.com/grandemk/qemu_devices/blob/master/driver_pci.c

The Qemu PCI Device

Here is the flow of a device qemu code. (See [1] for a good explanation)

The first thing Qemu do is to register our device inside its core. this is done with the macro type_init. This macro is called before the Qemu main. What's important here is that the Typeinfo struct representing our device is passed along.

static const TypeInfo pci_hello_info = {
    .name           = TYPE_PCI_HELLO_DEV,
    .parent         = TYPE_PCI_DEVICE,
    .instance_size  = sizeof(PCIHelloDevState),
    .class_init     = pci_hellodev_class_init,
};

As our device is a PCI peripheral, we will extend a PCIDevice struct so name, parent and instance_size are for inheritance purpose and are implementations details.

The exciting thing here is the class_init function. It will be called when the device is registered. Its goal is to allow you to overwrite the virtual methods of the PCIDeviceClass with the one of your device and to set the different PCI configuration bytes which represents your device.

The PCI configuration space is used by the Bios, bootloader or the kernel depending on your configuration in order to ease the probe phase of device. It is used to declare to the world how your device work and where to write in order to communicate with it.

static void pci_hellodev_class_init(ObjectClass *klass, void *data)
{
    DeviceClass *dc = DEVICE_CLASS(klass);
    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
    k->init = pci_hellodev_init;
    k->exit = pci_hellodev_uninit;
    /* this identify our device */
    k->vendor_id = 0x1337;
    k->device_id = 0x0001;
    k->class_id = PCI_CLASS_OTHERS;
    set_bit(DEVICE_CATEGORY_MISC, dc->categories);

    k->revision  = 0x00;
    dc->desc = "PCI Hello World";
    /* qemu user things */
    dc->props = hello_properties;
    dc->reset = qdev_pci_hellodev_reset;
}

The vendor_id and device_id fields identifies the PCI device. They are used in PCI Linux driver to decide which driver is chosen to operate your device. The PCIDeviceClass inherits from DeviceClass which is the base class used to represent a device and the function it provides to the Qemu core.

Init, exit and reset are the basic function you will need to write for your function. Init will be called when your device is plugged and exit when it's unplugged.

Exit and reset aren't really important function for us. exit is basically used to free the memory of your device. Init is the function you want to look at.

 static int pci_hellodev_init(PCIDevice *pci_dev)
{
    /* init the internal state of the device */
    PCIHelloDevState *d = PCI_HELLO_DEV(pci_dev);
    printf("d=%lu\n", (unsigned long) &d);
    d->dma_size = 0x1ffff * sizeof(char);
    d->dma_buf = malloc(d->dma_size);
    d->id = 0x1337;
    d->threw_irq = 0;
    uint8_t *pci_conf;

    /* create the memory region representing the MMIO and PIO
     * of the device
     */

    memory_region_init_io(&d->mmio, OBJECT(d), &hello_mmio_ops, d, "hello_mmio", HELLO_MMIO_SIZE);
    memory_region_init_io(&d->io, OBJECT(d), &hello_io_ops, d, "hello_io", HELLO_IO_SIZE);

    /*
     * See linux device driver (Edition 3) for the definition of a bar
     * in the PCI bus.
     */
    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &d->io);
    pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio);

    pci_conf = pci_dev->config;

   /* also in ldd, a pci device has 4 pin for interrupt
     * here we use pin B.
     */
    pci_conf[PCI_INTERRUPT_PIN] = 0x02;

    /* this device support interrupt */
    // d->irq = pci_allocate_irq(pci_dev);

    printf("Hello World loaded\n");
    return 0;
}

You can see a cast to PCIHelloDevState at the beginning. This is our own extension of the PCIDevice Class. We will store the logic of our device in it. So the first line are just initialization of the internal state of the dummy PCI device we are writing.

pci_conf[] represents the configuration space of your pci device. We just need to specify our device support interrupts on some pin (here pin B). The
Linux kernel PCI bus driver will get this information for us.

You can check it with the command lspci -v. If you want more information, you can also look at the /sys filesystem. For more information see PCI Chapter in [0].

Qemu represents MMIO and PIO regions as MemoryRegion struct. When the driver access the memory region, the callbacks we registered with memory_region_init_io will be called.

static const MemoryRegionOps hello_mmio_ops = {
    .read = hello_mmioread,
    .write = hello_mmiowrite,
    .endianness = DEVICE_NATIVE_ENDIAN,
    .valid = {
        .min_access_size = 4,
        .max_access_size = 4,
    },
};

If the CPU tries to read the MemoryRegion, then the callback read will be called. There is one more step necessary for our MemoryRegion to be accessible from outside of the device. We have to register it in a PCI bar.

A PCI device implements up to six I/O address regions. Each region consists of either memory or I/O locations. A bar is the PCI word for these regions. We call register_bar for this purpose.

You can request an irq with pci_allocate_irq and have an irq object or you can use the pci functions for irq.

static void hello_iowrite(void *opaque, hwaddr addr, uint64_t value, unsigned size)
{
    int i;
    PCIHelloDevState *d = (PCIHelloDevState *) opaque;
    PCIDevice *pci_dev = (PCIDevice *) opaque;
    printf("Write Ordered, addr=%x, value=%lu, size=%d\n", (unsigned) addr, value, size);
    switch (addr) {
        case 0:
            if (value) {
                /* throw an interrupt */
                printf("irq assert\n");
                d->threw_irq = 1;
                pci_irq_assert(pci_dev);
            } else {
                /* ack interrupt */
                printf("irq deassert\n");
                pci_irq_deassert(pci_dev);
                d->threw_irq = 0;
            }
            break;
        case 4:
            /* throw a random DMA */
            for ( i = 0; i < d->dma_size; ++i)
                d->dma_buf[i] = rand();
            cpu_physical_memory_write(0xa0000, (void *) d->dma_buf, d->dma_size); 
            break;
        default:
            printf("Io not used\n");
    }
}

The function iowrite, mmiowrite have the same prototype. opaque our PCIHelloDevState structure. addr is the address the driver accessed, value what was written in the memory and size is the granularity of the write: 1, 2 or 4 Bytes. ioread and mmioread follow the same principles.

You can use pci_irq_assert and pci_irq_deassert to generate an interruptions. It's very easy.

You can use cpu_physical_memory_write to directly access the physical memory. This simulates a DMA but bypass the IOMMU system of Qemu. Its use is now deprecated, but if you want to simulate that there is no Iommu, it is the fastest way.

I will cover the IOMMU DMA in an update of this article.

Linux PCI driver

The code of the driver can be found at https://github.com/grandemk/qemu_devices/blob/master/driver_pci.c

Most of the time, to understand a driver, you need to begin from the end of the file. So let's begin with the last lines.

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("hello world");
MODULE_AUTHOR("Kevin Grandemange");

These are macros are used to describe the module. The license is not a trivial thing because some api of the kernel are only available to GPL licensed driver via the EXPORT_SYMBOL_GPL. Moreover, non GPL driver are not likely to be accepted in the mainstream branch of Linux.

module_pci_driver(hello_tic);
MODULE_DEVICE_TABLE(pci, hello_tic_ids);

There are to type of driver in Linux. The bus drivers and the device drivers. What we commonly call drivers are the device drivers. The bus driver provides an API to the device driver for them to communicate with the hardware using a specific bus. PCI is a specific type of bus. So here, module_pci_driver register our driver inside the bus driver handling all the low level pci stuff. MODULE_DEVICE_TABLE achieve this registration. Inside this table, we write what device we are able to drive. So when this device is found, our driver will be called to handle the communication.

/* vendor and device (+ subdevice and subvendor) 
 * identifies a device we support
 */
static struct pci_device_id hello_tic_ids[] = {
    { PCI_DEVICE(PCI_TIC_VENDOR, PCI_TIC_DEVICE) },
    { 0, }, /* sentinel */
};
/* id_table describe the device this driver support
 * probe is called when a device we support exist and
 * when we are chosen to drive it.
 * remove is called when the driver is unloaded or
 * when the device disappears
 */
static struct pci_driver hello_tic = {
    .name = "hello_tic",
    .id_table = hello_tic_ids,
    .probe = hello_tic_probe,
    .remove = hello_tic_remove,
};

A driver is a service based program. The pci_driver struct we provided to the pci bus driver is a set of function we support. Probe will be called when the pci bus driver find your device (PCI is really easy to handle, you don't need to search for your device, the BIOS or assimilate has already set everything in its right place).

This function is used to read the pci configuration part of the device, to request all the resources you will need for your driver (interrupts, virtual adresses for the mmio zone, etc.).

Remove is the opposite of probe and is called when the device disappear. The id table is a list of struct representing the device we support. The device are identified by their vendor id and device id.

You can use the PCI_DEVICE macro to ease the structure construction. You can also specify the subvendor and subdevice field for more precision but we don't need that here.

static int hello_tic_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
    struct tic_info *info;
    info = kzalloc(sizeof(struct tic_info), GFP_KERNEL);
    if (!info)
        return -ENOMEM;

    if (pci_enable_device(dev))
        goto out_free;

    pr_alert("enabled device\n");

    if (pci_request_regions(dev, "hello_tic"))
        goto out_disable;

    pr_alert("requested regions\n");

    /* BAR 0 has IO */
    info->port[0].name = "tic-io";
    info->port[0].start = pci_resource_start(dev, 0);
    info->port[0].size = pci_resource_len(dev, 0);

    /* BAR 1 has MMIO */
    info->mem[0].name = "tic-mmio";
    info->mem[0].start = pci_ioremap_bar(dev, 1);
    info->mem[0].size = pci_resource_len(dev, 1);

    if (!info->mem[0].start)
        goto out_unrequest;   

    pr_alert("remaped addr for kernel uses\n");
    /* get device irq number */
    if (pci_read_config_byte(dev, PCI_INTERRUPT_LINE, &info->irq))
        goto out_iounmap;
    /* request irq */
    if (devm_request_irq(&dev->dev, info->irq, hello_tic_handler, IRQF_SHARED, hello_tic.name, (void *) info))
        goto out_iounmap;

    /* get a mmio reg value and change it */
    pr_alert("device id=%x\n", ioread32(info->mem[0].start + 4));
    iowrite32(0x4567, info->mem[0].start + 4);
    pr_alert("modified device id=%x\n", ioread32(info->mem[0].start + 4));

    /* assert an irq */
    outb(1, info->port[0].start);
    /* try dma without iommu */
    outl(1, info->port[0].start + 4);

    pci_set_drvdata(dev, info);

    return 0;

out_iounmap:
    pr_alert("tic:probe_out:iounmap");
    iounmap(info->mem[0].start);
out_unrequest:
    pr_alert("tic:probe_out_unrequest\n");
    pci_release_regions(dev);
out_disable:
    pr_alert("tic:probe_out_disable\n");
    pci_disable_device(dev);
out_free:
    pr_alert("tic:probe_out_free\n");
    kfree(info);
    return -ENODEV;
}

The first things to do is in a pci driver is to call the two functions pci_enable_device and pci_request_regions.

The first one ask low-level code to enable I/O and memory and the second one reserves all these I/O and memory for our driver. You need this two function to success to continue.

We then get the PIO with pci_ressource_start on BAR 0 (remember how we declared those in the qemu device).

The mmio part is trickier because the CPU only use virtual addresses. We need to remap the physical io address. This is done with the ioremap function. Here there is the pci_ioremap_bar helper function which does everything you need in one call.

In order to know which interrupt our device has been assigned, we use pci_read_config_byte to read where this information is stored. After that we request the interrupt. This register our interrupt handler to the interrupt vector table code.

After this call, if an interrupt fires, all the interrupt handler linked to this interrupt will be called.

One important thing to notice is that the interrupt you are using can be shared. This means you need to verify by some means that your device really triggered this interrupt. If it didn't you will tell the kernel code that you're not interested by this interrupt.

If you need some locking mechanism, better initialize it before calling this function, because after that, your interrupt handler can be called whenever the device feels like it.

The devm part of the request_irq means that this function is device managed and that the irq will automagically be freed when your driver is unloaded.

You can play with the mmio remapped memory with the function iowrite{8,16,32} and ioread{8,16,32} and talk with the ports of the device with the out{b,w,l} and in{b,w,l} functions.

These size are the same as the device mmio operation sizes in the qemu device. This means you can do some weird stuff in qemu where reading a 8 bits value does something radically different from reading the same addr with a 32 bit granularity.

The pci_set_drvdata allow us to store some relevant data inside the driver struct for further uses.

static void hello_tic_remove(struct pci_dev *dev)
{
    /* we get back the driver data we store in
     * the pci_dev struct */
    struct tic_info *info = pci_get_drvdata(dev);
    /* let's clean a little */
    pci_release_regions(dev);
    pci_disable_device(dev);
    iounmap(info->mem[0].start);
    kfree(info);
}

The remove function is the exact mirror of the initialization inside the probe function.

static irqreturn_t hello_tic_handler(int irq, void *dev_info)
{
    struct tic_info *info = (struct tic_info *) dev_info;
    pr_alert("IRQ %d handled\n", irq);
    /* is it our device throwing an interrupt ? */
    if (inl(info->port[0].start)) {
        /* deassert it */
        outl(0, info->port[0].start );
        return IRQ_HANDLED;
    } else {
        /* not mine, we have done nothing */
        return IRQ_NONE;
    }
}

This function is called when the interrupt we registered to fires. It is called in interrupt context. This means we may not sleep inside this function. If you need to sleep, you can use tasklets or threaded interrupts, which are easy to use.

You can see that dev_info contains the structure we specified in the devm_regist_irq. Notice the IRQ_NONE macro signaling the kernel that we don't care about this interrupt.

Have fun ;)

Hope you found it useful. I plan to create devices and drivers for other bus as well (platform driver, isa, usb, etc.)

Sources

[0] Linux Device Driver (3rd Edition)
[1] http://ilevex.eu/post/88944209761/how-to-create-a-custom-pci-device-in-qemu
[3] Ctags of qemu and git log ;)
[4] http://free-electrons.com/doc/training/linux-kernel/linux-kernel-slides.pdf