WWW.DUMAISNET.CA

Using epollLast edited on Oct 10, 2014

Introduction

I heard about epoll not so long ago and wanted to give it a try. Epoll is the new way to use non-blocking sockets (or any file descriptors). It replaces poll() and select().

epoll is easy enough to use. The good thing about it is that, unline select(), you don't need to rebuild the FDSET each time you call epoll_wait. Here is a typical usage:

  • efd = epoll_create1()
  • epoll_ctl(efd, EPOLL_CTL_ADD, listeningSocket)
  • loop:
    • n = epoll_wait()
    • for i=0 to n:
      • if (events[i].data.fd == listeningSocket)
        • newsock = accept()
        • epoll_ctl(efd, EPOLL_CTL_ADD, newsock)
  • close(efd)

Another nice thing about epoll is that you don't need to worry about removing a socket from the list. You can call epoll_ctl(efd, EPOLL_CTL_DEL, sock) if you want but when the socket closes, it will be removed automatically.

One thread

Using epoll, I can do all my sending and receiving in one thread. So people may suggest to send data from other thread but what if you get EAGAIN? Assume that thread A is the epoll thread. Thread B attempts to send 100 bytes but could only send 50. Then Thread C attempts to write 100 bytes on the same socket. By the time that Thread C was called, the socket was ready. The remote socket will have received corrupted data. For that reason, Thread B and C, will add data in a queue so that each messages are guaranteed to be sent completely. The queue will be emptied in Thread A.

Since epoll might potentially handle thousands of connections, Thread A must do minimal work. It must only do this:

  • epoll_wait
  • if a socket is ready-ready, read all from that socket and dispatch to consumers. consumers should work fast, otherwise we should "post" the data to the consumer in another thread.
  • send all from the sendqueue

Edge-triggered VS Level-Triggered

epoll offers two ways of working: Edged-triggered and Level-triggered. The default way is Level-triggered. so you might want to change this to level-triggered in your application.

In Level-triggered mode, epoll will return as long a socket is read or write ready. So if data is ready to be received on the socket and you only read part of it, the next epoll call will return because there is still data available. As soon as the internal receive buffer is empty, then epoll won't return. But since the socket will be write-ready almost all of the time, it will return. This is not something you would typically want. My guess is that if you want to use level-triggered mode, you should not register to get EPOLLOUT events unless you have something to send. So while your application's TX buffer is empty, you don't register for EPOLLOUT. As soon as data is added to the TX buffer, you register for EPOLLOUT, epoll will return and you write data out on the socket. If EAGAIN was returned then you will block on the next epoll_wait() and can send the rest of the buffer when it unblocks. Once the application's TX buffer is empty, you could unregister for EPOLLOUT.

With Edge-triggered mode, things are different. You will only receive an event if the status has changed from not-ready to ready. So if you have incomming data and you only read part of it, when you will call epoll_wait, the socket will still be ready. So epoll_wait will block because the state will not change from not-ready to ready. So if you are using that mode, you must make sure to read ALL the data on the socket until you get EAGAIN. If you want to send something, epoll_wait will only give you a write-ready event if you previously got EAGAIN while attempting to write out.

Let's say you have an internal transmit queue in which you add data when you want to send it out. The epoll thread would need to ready from that queue, write the data on the socket and handle EAGAIN appropriately.

  • epoll_wait()
  • if a socket is ready-ready, readAllFromThatSocket(socket);
  • sendAllData() // sends all that is contained in the internal queue.
  • goto step 1

But if epoll_wait() is blocked in triggered mode, and you add data in the queue (from another thread), sendAllData() will not be called until epoll_wait returns because data is ready to be received (it won't return because data is ready to write, because you need to write first and get EAGAIN for that.). To solve this problem, I created an eventfd (see sys/eventfd.h). I add the eventfd in the epoll_wait list and whenever I add data in the application's TX queue, I write 1 on the eventfd so that epoll_wait will break out. I could have used a timout in epoll_wait, but that would add unnecessary latency to amy reactor. That way of doing things is similar to what is called the "pipe to self trick".

My framework

To play with epoll, I wrote a small reactor framework that exposes an interface that's easy to use. all the gory details are hidden inside the framework. To use it you would need to define two classes:

class OwnProtocolFramingStrategy: public IFramingStrategy
{
protected:
    virtual int processData(char* buffer, size_t size, FullMessageInfo& info)
    {
        info.buffer will first be zero. You need to allocate memory for that buffer
        and copy the data from "buffer" to "info.buffer". If you determine that the buffer
        does not contain a full message, return the quantity of bytes you read from the buffer. 
        In this case, it should be "return size;" And the next time that some data is received,
        this function will be called with you buffer. You can update other fields in "info"
        to help you resume next time.

        If you determine that a full message was received, set "info.complete = true" and return
        the number of bytes that you read from the buffer. After returning from this function, your
        buffer created in "info.buffer" will be considered as containing a full message and will be 
        passed to the TcpEngine. Next time this function will be called, info.buffer will be back to 
        zero.

        If you determine that a full message was received but there are still data left in the buffer,
        it means that you probably have 2 messages in the buffer. As mentioned above, you will 
        return the number of bytes that you read from the buffer. This function will be called
        again with "buffer" pointing to the index after the last byte you have read. So you
        only need to try to parse one message only when this function is called.

        If you find that data is not well-formed, does not respect you protocol or has any other errors 
        that prevents you from reliably build a full message, then return -1 and the client's connection
        will be aborted.        
    }
};

A IFramingStrategy is a class that will process the tcp stream into full messages. As you know, when reading data from a socket, you may receive more than one message in a single read and you could also only receive half of a message. So the IFramingStrategy is what parses the stream and builds messages. For example, if implementing an HTTP Server, the strategy would build the message by expecting the data to be chunked so the logic to calculate the size would be done here.

class OwnProtocolServerEngine: public TcpEngine
{
public:
    OwnProtocolServerEngine():TcpEngine(new ClientFactory(this))
    {
    }

protected:
    virtual void onClientConnected(TcpClient* client)
    {
        Maybe add the client in some list so we can use it later?
    }

    virtual void onClientDisconnected(TcpClient* client)
    {
        if client was added in a list, we should remove it because
        its instanced will be deleted after returning from this method.
    }

    virtual void onClientMessage(TcpClient* client, char* buffer, size_t size)
    {
        The buffer you get here, is the one that was created in the strategy. You
        own this buffer in this method, so you must free it from here 
        (or remember to free at a later time)

        In this method, you are guaranteed to get 1 and only 1 full message assuming
        that the Strategy was coded correctly.

        Note that everything is happening in 1 thread. So in the onClientConnected() 
        method, you might have saved a TcpClient pointer in a list. It is perfectly
        safe to use it in here (i.e: for forwarding the message to someone else).
    }
};

Then to run launch a server:

OwnProtocolServerEngine engine;
engine.init(5555,"127.0.0.1",10);
if (!engine.start())
{
    printf("Could not start server\r\n");
    return -1;
}

getchar();
engine.stop();

The source code for the framework is here

Thread-safe queue

I said earlier that everything was happening in 1 single thread so everything *should* be safe. But what if I want to send data to a client from another thread? That should be possible. The transmit queue inside the TcpClient is thread safe. So calling TcpClient::sendData() is a safe call to do in another thread. Accessing the client is another story though. The client instance could get deleted at any time, but there are ways around that but it's beyond the scope of this framework.

Since the TX queue is a Multi-Producer, Single-Consumer FIFO, creating a thread-safe implementation is very easy. Of course, I didn't want to use any of the STL containers.

#pragma once

typedef unsigned int uint32_t;

// Multi-Producer, Single-Consumer, lock-free ring buffer
template 
struct MPSCRingBufferItem
{
    T item;
    bool valid;
};

template 
class MPSCRingBuffer
{
public:
    MPSCRingBuffer(uint32_t size)
    {
        this->head = 0;
        this->tail = 0;
        this->size = size;
        this->items = new MPSCRingBufferItem[size];
        for (int i = 0; i < size; i++) this->items[i].valid = false;
    }

    ~MPSCRingBuffer()
    {
        delete[] this->items;
    }

    bool put(T item)
    {
        uint32_t oldTail = this->tail;
        uint32_t newTail = (oldTail+1)%this->size;

        // This is the critical section, this is where we reserve our slot
        while (!__sync_bool_compare_and_swap(&this->tail,oldTail,newTail))
        {
            oldTail = this->tail;
            newTail = (oldTail+1)%this->size;
            if (newTail == this->head) return false; // overflow
        }

        if (newTail == this->head) return false; // overflow
        this->items[oldTail].item = item;
        this->items[oldTail].valid = true;
        return true;
    }

    // head needs no protection. This is class is for single consumer only
    bool get(T& item)
    {
        if (this->head == this->tail) return false;
        if (!this->items[this->head].valid) return false; // tail was increased but object not inserted yet.
        item = this->items[this->head].item;
        this->items[this->head].valid = false;
        this->head = (this->head+1)%this->size;
        return true;
    }


private:
    MPSCRingBufferItem* items;
    uint32_t head;
    uint32_t size;
    uint32_t tail;

};

This is using __sync_bool_compare_and_swap, which I strongly suspect uses the x86 instruction CMPXCHG. I described that instruction here I didn't fully test that code though . I addapted it from a version I had written in ASM, so there might be bugs in the C++ version.



Memory PagingLast edited on Sep 26, 2014

Introduction

I would like to describe how memory paging works. So far, I have played with x86 paging, x86_64, and avr32 paging. All 3 architectures are different but they share some fundamental concepts that I will try to describe here. There are many ways to implement a paging mechanism, but I am only sticking to the basics here. In my opinion, this article is a good place to start if you want to understand the basics of paging before moving to the technical details and advanced algorithms.

Mathematics of pages

When an instruction attempts to access a memory location, let`s say at address 0x00002345, the MMU will treat that as a logical address. It will separate the address into a page number and an offset. If using a paging system with a 4k page granularity, pages will be 4k each. By taking the address and dividing it by 4096, you get a quotient of 0x02 and a remainder of 0x345, so the mmu would determine that the address points to page #2 at offset 0x345 within the page. The fact that page sizes are always powers of 2 is very convenient because instead of doing a division by 4096, you can simply extract the right-most 12 bits as the offset and all upper bits as the page number. So page_number = address>>12. offset = address&0xFFF.

Note that when using a logical address, you don't need to think about the page number and offset. Using address 0x00002345 will yield page 0x2, and offset 0x345, and if you go 4k above that, the address would be 0x00003345, so page 0x3 and offset 0x345. This is all done naturally.

Page tables

Page tables are tables of descriptors residing in memory. Each descriptor is, for example, 8bytes long, or 64bit. Each descriptor are in order of pages. Descriptor 0 contains information about page 0 (0..4095), Descriptor 1: page 1 (4096..8191), Descriptor N: Page N (N*4096..(N+1)*4096-1). A descriptor is a mapping to physical memory, it contains a physical address. So let's say you need to access logical address 0x0000000000003DEF, the MMU will determine that you want to access page 3, offset 0xDEF. The 3rd descriptor could hold the value 0x0123456789ABC000. So that means that the MMU will map the logical address 0x0000000000003DEF tp physical address 0x0123456789ABC000 + offset 0xDEF, so 0x0123456789ABCDEF. See we have just transposed the offset over the physical address? That is no coincidence. It is because the addresses in the descriptors will always be a multiple of 4096, since pages can only start on a 4k boundary (for 4k pages that is). Therefore, the 12 lower bits of the descriptor will always be zero. For that reason, those bits are always reused to contain other information about the page such as permissions etc... But that is out of the scope of this article. Basically, what this all means is that the mapping is like this: PhysicalAddress = (PageTable[LogicalAddress>>12] & ~0xFFF) | (LogicalAddress&0xFFF).

The mapping is done transparently by the CPU!

So far, paging is cool, but here is why it is powerfull: There can be many page tables in memory and at any moment, the MMU can be instructed to use another table to change the mapping. Usually, in an operating system, you want to use 1 page table per process. Let's say you write an operating system, and to make things simple, you want the heap space of each process to begin at 0xBEEF1000. That is easy enough with paging because you just create a page table for each of the processes and out an address in page number 0xBEEF1. Obviously, you don't want each process to overwrite the data of the other processes, so the page 0xBEEF1 in each different page tables would point to a different physical location.

Note that if you only have 1g of RAM, the last accessible memory location would be 0x0FFFFFFF. But there is no limit to the logical addresses (other than the architecture size, 32bits or 64bit) as long as they map to a valid physical location (that is, under 0x40000000 if you have 1gig).

Since each process can have a full page table to themselves, it means that, as far as they know, they have a full 4gig of RAM (in the case of 32bit architecture) that they can access. so let's say that you have 4 process and 4 gig of RAM.

  • Process 1 maps the first gig of logical addresses to the 1st gig of physical addreses.
  • Process 2 maps the first gig of logical addresses to the 2nd gig of physical addreses.
  • Process 3 maps the first gig of logical addresses to the 3rd gig of physical addreses.
  • Process 4 maps the first gig of logical addresses to the 4th gig of physical addreses.

If process 1 needs more memory, it sees that only the 1st gig of memory is used (because it only deals with logical addresses and not physical addresses). So it will naively think that there is 3 gig of RAM left. It will ask the kerl to map another gig of ram, but the kernel will say: Wait! there is no more physical memory to which I can map these logical addresses. Well that's not entirely true. This is where swapping comes in.

Page Swapping and Page faults

When the kernel runs out of physical memory so map to logical memory for a process, it will re-use a physical location. There are many ways of determining which physical location to use, but for simplicity, let's assume that this would be done randomly. As per our example above, all 4 process uses all the memory. Process1 requests the kernel to map some physical memory to logical address 0x40000000. The kernel has no more physical memory to hand out so it will pick a random location and will choose physical address 0x80000000 for example. This address is mapped to Page 0 of Process 3 already. So the kernel will modify the page descriptor 0 in page table of process 3 and mark it as invalid. Remember the low 12bits that are unused in the descriptor and I said there would be used for information about the descriptor? Well the "invalid" bit is one of those 12 bits. The kernel will then take the full 4k residing at that physical location and store it on the hard drive, in a swap partition (or file). It will then map the newly requested logical space for process 0 to that physical location. At this point, we have 2 process that have 2 different pages that points to the same pysical location. Only one of those process has the page marked as being valid in its descriptor.

When execution resumes on process2, it could attempt to access page 0 of its logical space. When it does, the MMU will see that the page is marked as invalid and will trigger a Page Fault exception. The kernel will then handle the page fault exception. It will swap the 4k in memory with the one that was previously stored on the hard drive. mark the page in fault as valid and mark the other one as invalid. And that is how swapping works.

Fragmentation

Another use of paging is for memory fragmentation. Let's say that you have 2 process. and only 16K of memory, so 4 pages of 4k. Process 1 is uing 2 4k buffers and has a mapping like this:

  • Page0 = 0x0000
  • Page1 = 0x2000
  • Page2 = unused
  • Page3 = unused

it means that Process 1 is using 8k of RAM. There is only 8K of RAM left. Process2 comes in and wants to create a buffer of 8K. Obviously, it expects that buffer to be contiguous memory. But that would be impossbile since there are only 2 4k chunks left and they are not contiguous. Pagin will solve that. Process2 will have a mappint like this:

  • Page0 = 0x1000
  • Page1 = 0x3000
  • Page2 = unused
  • Page3 = unused

When process2 will access logical address 0x0000..0x0FFF, it will be transparently redirected in 0x1000..0x1FFF and then when it continues to 0x1000...0x1FFF, it will be redirected in 0x3000..0x3FFF. So the logical memory IS contiguous. This concept makes it very easy to avoid memory fragmentation as we allocate/deallocate memory dynamically.

Translation Lookaside Buffer (TLB)

When an access to memory is requested, the CPU needs to translate the virtual address to physical address. Like I mentionned above, this is done by walking through the page tables. This process adds a significant ammount of time to the memory access because the CPU need to make one or several memory access just to decode the address. So we are talking about more than 100% overhead here. To reduce the impact of this process, most CPU (if not all) implement what is called the Translation Lookaside Buffer, or TLB. When a virtual address is translated, the result is stored in the TLB. So when the next memory access is requested, the CPU will start by looking in the TLB. If a match is found, then it doesn't need to look into the page tables. The TLB is inside the CPU, so accesses to it is very fast. But the TLB has a limited size. When a new virtual address is translated, and the result cannot be stored in the TLB, the CPU will choose a "victim" entry (the word victim is actually a word that some official documentation use) and delete that entry to replace it with the new one. The algorithm for choosing the victim depends on the CPU implementation, but would mostly be a "least recently used" type of algorithm. The TLB works kind of like a DNS cache but without TTL.

When a context switch occurs, the new running process might want to use a different page table for its own translation. That is actually one of the many benefits of using a virtual memory system: so that each process can have their own translation tables. The OS would then instruct the CPU to use another table by giving it the base address of the new table. But the TLB still holds some translation that might not match the new table. We call those entries "stale entries". So when updating the Page Table Base Address, we must also flush the TLB to remove all stale entries. But flushing the TLB could add a significant ammount of processing. And what if the new process only needs 1 translation, and then we context-switch back the the old process. There could be enough place in the TLB to store all the needed translation for both process. So flushing the TLB is very inneficient in that case. That is why most CPUs will implement the concept of an Address Space ID (ASID). On a context switch, ou would update a special register in the CPU to tell it the ID of the currently running process (that would be the ASID). When the CPU adds an entry in the TLB, it will add a "tag" to the entry, that tag would be the ASID. When a TLB lookup is performed by the CPU, any entries tagged with an ASID that does not match the current ASID will be disregarded. So it would be possible to have two identical translations but with different ASID. This illiminate the need to flush the TLB on every context switch because stale entries will be tagged with a different ASID anyway. When the TLB becomes full, the victims will preferably be entries for other ASID than the current one.

Some translations might need to be performed very often. Let's take, for example, the page containing the code for interrupt handlers. That page needs to be accessed very fast, and we would want it to stay in the TLB all the time. Most CPUs offer the possibility to mark some TLB entries as "permanent" or "locked". These entries will never be chosen as victims if the TLB becomes full.

Addressing Space Limitation

Some people may wonder, if my CPU is a 32bit CPU, why can't I use 4gig of RAM?. This is because a 32bit CPU can only use a 32bit integer as the address of a memory location. So it could address 4 gig of RAM theoritically. But you might have a video card in your computer that has 1gig of RAM on it. That 1gig must be accessible by the CPU. So if you have 4 gig of RAM and 1gig in the video card, it means that your CPU would need to access a total of 5gig of ram, but that is impossible. You video card's RAM will be assigned a physical range of addresse. The 32bit adressing really just defines an addressing space. This is kinda like if you were living on a very long street but there is a rule in your neighbourhood that prevents you from using more than 2 digits on on the address plate of your house. Even though about 1000 houses could be built on that street, after the 100th house, the next ones couldn't be assigned a house number because the addressing space was all used up. In my example, I talk about a video card, but there are many other peripherals that uses addresses from the addressing space.

Then you might wonder: how come I have a 32bit CPU and I can access my full 4gig on linux or windows 7? This is because newer (they are actually old at the moment) CPUs have a 36bit addressing mode even though the CPU is 32bit. But the OS must make use of that. WindowsXP did not use the 36bit addressing mode even if the CPU offered it. That is why it was impossible to use your whole RAM on windowsXP



Implementing HTTP Digest AuthenticationLast edited on Aug 15, 2014

Recently, I was trying to add HTTP digest authentication on my Home automation device. The device exposes a REST interface trough a proxy server. My web server is setup like this

Now since my API is exposed to the world by proxying it like that, I wanted to add security by implementing HTTP digest authentication. Whether or not Digest authentication with MD5 is secure or not is a completely different story, but let's assume it is good enough for now. I have a restricted access webpage that I go on to control my home automation device. This web page makes requests to DHAS using javascript. Since I've implemented digest authentication, I now need to put the credentials in the javascript so that the calls made with XMLHttpRequest can succeed. Even though that javascript code will only be served to me, while I am authenticated on the website, I felt uncomfortable to leave a hardcoded username and password in the JS source. So this is what I came up with:

Note that messages sent from JS to DHAS are being proxied by Apache. Therefore, DHAS receives a GET for /insteon/listmodules and not for /dhas/insteon/listmodules

  • use XMLHttpRequest to make a request to DHAS (through the proxy)
  • add a header "X-NeedAuthenticationHack" in the request
  • receive a 401
  • get the "X-WWW-Authenticate" header from the 401 response
  • Make a XMLHttpRequest to the server and send it the "X-WWW-Authenticate" data
  • Server side php script with hardcoded username/password for DHAS solves the challenge and returns the resonse
  • use XMLHttpRequest to make a request to DHAS (through the proxy) and append the response in a "Authorization" header

So basically, I just intercepted the 401 and instead of letting the browser prompt for a username password, I created the response myself. And instead of doing in the JS, I did it on the server, limiting the exposition of the username/password. You may notice my two special X headers. This is because if the server returns a 401 with WWW-Authenticate, the browser will prompt for your credentials. Event if I have a handler defined to get the 401. So when I send my initial request, I set the X-NeedAuthenticationHack header to tell the server: "Hey, don't send me a WWW-Authenticate, send a X-WWW-Authenticate instead so I can deal with it".

By the way, even if the information is easy to find, this is how the digest authentication is done:

  • Client makes request to http://webserver.com/url1/index.html
  • Server sends a "WWW-Authenticate: realm="testrealm", nonce="testnonce"
  • ha1 = md5("username:testrealm:password")
  • ha2 = md5("GET:/url1/index.html")
  • ha3 = md5(ha1+":testnonce:"+ha2)
  • Client sends: "Authorization: Digest username="username", realm="testrealm", nonce="testnonce", response=""+ha3+"", uri="/url1/index.html"



Stack frame and the red zoneLast edited on Mar 18, 2014

after days, and days, and days of troubleshooting odd problems I had in my homebrew OS, I found out that it was caused by my C compiler. After spending all these years arguing with everyone that it is easier to make a hobby OS in pure assembly, I decided to make this OS with asm and C, and I just proved myself that C is evil! Seriously, C is not evil but it does hide a lot of things that makes it hard to know what your OS does.

So after disassembling all my C code and inspecting the assembly code, I found this:

55                push rbp
4889E5            mov rbp,rsp
48897DE8          mov [rbp-0x18],rdi
488975E0          mov [rbp-0x20],rsi
488955D8          mov [rbp-0x28],rdx
C745FC00000000    mov dword [rbp-0x4],0x0
C9                leave
C3                ret

Notice how the function never decreases the stack pointer? Arguments are passed below the stack pointer. Can you image how insane that is???? What would happen if an interrupt would trigger while in that function? The correct code would be:

55                push rbp
4889E5            mov rbp,rsp
4883EC28     ---> sub rsp,byte +0x28 <---
48897DE8          mov [rbp-0x18],rdi
488975E0          mov [rbp-0x20],rsi
488955D8          mov [rbp-0x28],rdx
C745FC00000000    mov dword [rbp-0x4],0x0
C9                leave
C3                ret

It turns out that this behavior is normal according to the amd64 ABI. There is a thing called the "red zone". The red zone is a 128 bytes buffer that is guaranteed to be untouched by interrupt handlers (I'm not sure how though). To quote the ABI:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

So my solution was to just disable that damn red-zone with the gcc flag "-mno-red-zone". I'm guessing that the compiler does that to improve performances because it assumes that your code will be running in ring-3, so when an interrupt occurs, the stack will change because the handler will run in ring-0. Yeah sure, it will improve performances because there is one less instruction in the code, but I think that's a huge assumption to make. It definitely isn't the case when you are writing kernel code anyway.



AVX/SSE and context switchingLast edited on Mar 18, 2014

This article describes the way I designed AVX/SSE support in my homebrew OS.

AVX registers

In long mode, there are 16 XMM registers. These registers are 128bit long. With AVX, these registers are extended to 256 bit and named YMM. The YMM registers are not new registers, they are only extensions. YMM0 is to XMM0 what AX is to AL. Meaning that XMM0 represents the lower 128bit of the YMM0 register.

The xcr0 register enables processor states saving for XSAVE and XRSTOR instructions. The way to set bits in xcr0 is by using the XSETBV instruction. These bits represents feature sets.

  • 0b001: FPU feature set. Will save/restore content of FPU registers
  • 0b010: XMM feature set. Will save/restore all XMM registers (128bit)
  • 0b100: YMM feature set. Will save/restore upper half of YMM registers

Since YMM registers are 256 bit registers, and that XMM registers aliases the lower 128 bits of the YMM register, it is important to enable bit 2 and 1 in order to save the entire content of the YMM registers.

Enabling AVX support

  • Enable monitoring media instruction to generate #NM when CR0.TS is set: CR0.MP (bit 1) = 1
  • Disable coprocessor emulation: CR0.EM (bit 2) = 0
  • Enable fxsave instruction: CR4.OSFXSR (bit 9) = 1
  • Enable #XF instead of #UD when a SIMD exception occurs: CR4.OSXMMEXCPT (bit 10) = 1
  • Enable XSETBV: CR4.OSXSAVE (bit 18)= 1
  • Enable FPU, SSE, and AVX processor states: XCR0 = 0b111
mov     %cr0,%rax
or      $0b10,%rax
and     $FFFFFFFFFFFFFFFD,%rax
mov     %rax,%cr0

mov     %cr4,%rax
or      $0x40600,%rax
mov     %rax,%cr4

mov     $0,%edx
mov     $0b111,%eax
mov     $0,%ecx
xsetbv

Context Switching

On a context switch, it is important to save the state of all 16 YMM registers if we want to avoid data corruption between threads. Saving/restoring 16 256bit registers can add a lot of overhead to a context switch (we could even wonder if implementing a fast_memcpy() is worth it because of that overhead). Saving/restoring is done with the XSAVE and XRSTOR instruction. Each instruction take a memory operand that specifies the save area where registers will be dumped/restored. These instructions also looks at the content of EDX:EAX to know with processor states to save. EDX:EAX will be bitwise ANDed with XCR0 to determine which processor state to save/restore. In my case, I want to use EDX:EAX= 0b110 to save XMM, YMM, but fpu. Remember, if we set 0b100, we will only get the upper half of YMMx saved/restored. To get the lower half, we need to set bit 1 to enable XMM state saving.

Optimizing context switching - lazy switching

Since media instructions are not used extensively by all threads, it is possible that one thread does not use any media instructions during a time slice (or even during its whole lifetime). In such a case, saving/restoring the whole AVX state would add a lot of overhead to the context switch for absolutely nothing.

There is a workaround for this. In my OS, everytime there is a task switch, I explicitely set the TS bit in register CR0. Everytime a media instruction is executed and that the CR0.TS bit is set, a #NM exception will be raised (Device Non Available). My OS then handles that exception to save/restore the AVX context. So if a task does not use media instructions during a time slice, then no #NM will be triggered so there will be no AVX context switch. The logic is simple.

  • Assume that there is a global kernel variable called LastTaskThatRestoredAVX.
  • On task switch, set CR0.TS=1
  • media instruction is executed, so #NM is generated
  • on #NM:
    • clear CR0.TS
    • if LastTaskThatRestoredAVX==current task, return from exception (still the same context!)
    • XSAVE into LastTaskThatRestoredAVX's save area
    • XRSTOR from current task's save area
    • LastTaskThatRestoredAVX = current task
  • Next media instruction to be executed will not trigger #NM, because we cleared CR0.TS

Save area

The memory layout of the saved registers will look like this (notice how highest 128bits of YMM registers are saved separately)





Pages:123456789