I thought I'd write a little more about my home-grown x86-64 OS. This is about my threading mechanism.
The Task list
The global task list contains a list of 16bytes entries that contains the adress of the PML4 base
for the task and the value of the Ring0 RSP. Since task switches always occur during Ring0, only
the RSP0 needs to be saved (RSP3 will have been pushed on stack during privilege change).
Upons switching task, the scheduler will restore cr3 and rsp from the list.
Task State Segment
In 64bit long mode, the TSS doesn't have as much significance than in protected mode when hardware
context switching was available. The TSS is only used to store the RSP0 value. When a privilege
change is performed, from Ring3 to Ring0, the CPU will load the TSS.RSP0 value in RSP and the old
RSP3 value will have been pushed on the Ring0 stack. So the IRET instruction will pop out the Ring3
RSP value (hence why it is not needed in the TSS). The only time that the CPU will look into the TSS
is during a privilege switch from Ring3 to Ring0.
Only one TSS is needed in the system since the value it contains is a virtual address. Every task
will use the same address for RSP3 and RSP0 but it will be mapped differently.
So the TSS is only used to tell the process which address to use for the stack when going from Ring3
to Ring0. Since that address is the same virtual address for every task, only 1 TSS needs to be created.
There is no point in showing the TSS here since only one value is used. It's basically a 0x67 bytes long table
with a 64bit entry at byte index 4 that represents the RSP0 value. the rest of the table can be filled with zeros
Kernel and user tasks
Kernel tasks have the following properties:
- Run in ring0.
- Their Code Segment Selector is the same as the kernel.
- Each kernel task has its own stack.
- Code is located in kernel memory which is the identity-mapped portion of virtual memory.
- The RSP value that is set in the Ring0 stack (for iret) upon task creation is the ring0 stack pointer.
User tasks have the following properties:
- Run in ring3
- Their Code Segment Selector is the same for all user thread.
- Their Code Segment Selector has DPL=3
- Each user task has 2 stacks: one for Ring0 and one for Ring3
- Code is relocated at load time and appears to be running at virtual address 0x08000000
- The RSP value that is set in the ring0 stack (for iret) upon task creation is the ring3 base stack pointer.
When an interrupt occurs, the Trap or Interrupt Gate contains a DPL field and target Selector field. The DPL
will always be set to 3 so that any privilege levels (Ring0-Ring3) can access the interrupt handler. The target
selector is for a Code Segment Descriptor (inthe GDT) with a DPL value of 0. This will effectively run the
handler in Ring0.
Interrupt or task switch during user task
If the interrupt occured during a user task, the task may have been in Ring3 or in Ring0 if in the middle of a
system call. If a privilege change occurs, the value of RSP will be set to the value in the TSS for RSP0. That
value is the same for all tasks but mapped differently to physical memory. The ring3 stack will be left untouched.
If there is no privilege change, then the same stack will be used. The scheduler algorithm doesn't need to change
since the Ring3 stack's RSP has been pushed sometime before when entering Ring0 (ie: when calling a system
Interrupt or task switch during kernel task
If the interrupt occured during a kernel task then nothing special happens. The same stack is being used and same
protection level is applied. Although it would be possible to use the new ITS mechanism to use another stack
during interrupt handling. This would be usefull, for example, when a stack overflow occurs in a kernel thread
and the #PF exception is raised. If there is not stack switch, then the return address can't be pushed on the stack
since the stack if faulty (hence the #PF). So using another stack for interrupts would be a wise choice. But
I'm not implementing this for now.
A task switch will always be initiated by the timer iterrupt. The scheduler runs in Ring0 so the
Ring0 stack is used. If the interrupted task was in Ring3, we don't need to worry about its stack
since RSP has been pushed on the Ring0 stack.
The current CPU context will be saved on the Ring0 stack. Therefore, the scheduler will always save the RSP
value in the tasklist's rsp0 entry since it occurs in ring0 code. Upon resuming the task, RSP will be reloaded from
the rsp0 entry, context will be restored and the iret instruction will pop the ring3 RSP to resume execution.
In a multi-cpu system, it would be possible to have less threads running than the number of CPUs. In that case, upon
task-switching, a CPU would just continue to run its current thread, giving 100% of its time to that thread. But for
other CPUs that don't have any task to execute, they should get parked. Once a CPU's thread is terminated, if
there is no other threads to run for him, then it will parked. Normally the CPU would load the context
of the next task but instead it will simply jump to a park function. Since the scheduler is invoked
through an interrupt, we eventually need to "iret". But in the case of parking, we will just "jmp" to a
function and spin indefinitely. The iret instruction would restore flags and pop out the stack a jump to a return
address but in this case we don't want to do any of that. Since there are no flags to restore, no return function
and no stack to clean. The park function does the following:
- load kernel PML4 base address in cr3 to restore identity paging
- load RSP with a stack unique for the current CPU
- acknowledge interrupt (APIC or PIC)
- restore interrupt-enable flag (cleared when entering handler)
- spin-wait with "hlt" instruction.
// Set back the kernel page tables,
// set back the cpu's stack. This could be
// optimized but kept the way it is for clarity purposes.
// We're parking the CPU... it's not like we need the performance here
// re-enable interrupts and set some registers for debug and wait
Since interrupts are re-enabled and APIC has been ack'd, we will continue to be interrupted by the timer
and we are ready to take on a new task at any time. It would be wise to warn the parked CPU about
newly created tasks (with IPIs) so we don't need to wait until the timer to kick in. But that's for
Creating a new task
When a new task is created, memory is allocated for its stack and code. An initial context is setup
in the newly created stack. All registers are initialized to 0. Since the task will be scheduled by an interrupt handler,
it needs to have more data initilased in the stack for proper "iret" execution. Of course, the RSP value saved
in the task list will be the bottom of that context in the task's stack.
The initial stack looks as follow:
||0 for kernel thread / Ring3 selector for user thread
||Ring0 stack top or Ring3 stack top
||Ring0 Code selector or Ring3 Code selector
||entry point for Kernel thread / 0x080000000 for User thread
||Parameter to pass to thread on load
My scheduling algorithm is nothing fancy. Just the good old round-robin way. But on multi-processor, it is probably the worse
thing you could do. Here is an example task list (Note that I should also use a linked list instead of a fixed size table)
As this table show, each task except Task5 are running on one of the 4 CPUS. At the next schedule(), the task list will look like this:
The problem here is that Task 1,2 and 3 are still running. So doing a context switch imposed a large overhead for no reason.
But the fact that they are still running is good, but they are not running on the same CPU anymore. So they will never
fully benefit from the CPU cache since memory will need to be fetched all over again because that CPU had no idea about
the code and data being used by the task. A better algorithm would detect that the task should continue to run on the same CPU.
I need to take the time to think about it.
Ring 3 tasks
To make a task run in ring3, it needs the following
- A "non-conforming" code segment descriptor with DPL=3.
- cs must be loaded with RPL=3
- A writable data segment with DPL=3. This is for the stack.
- ss must be loaded with RPL=3
- The task register must be loaded with a valid tss selector. The TSS should be initialzed with a valid RSP0 value
for when the code switches to Ring0 (interrupts). TSS Descriptor:
Please refer to Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 for more information
These values would be on the stack prior to invoking "iret" from the scheduler. The scheduler runs in
ring0, hence why the RPL value in selectors is important
The kernel's page tables is as follow:
- The first 128mb is identity mapped using 2mb pages (first 64 page directory entries).
- The rest of the available memory is identiy mapped with 4k pages.
- The whole memory is again identity mapped, using 2mb pages, at 0x4000000000.
The kernel's 4k pages is only used to track free pages to be distributed to requesting threads.
Each 4k page entry has its 3 AVL bits cleared. This is how the kernel keeps track of available memory.
A page with non-zero AVL is in use.
So when a process needs to create a new page, it does the following:
- look through the kernel 4k pages structure for a free page. Mark is as not-free
- look though the process 4k pages structure for a free page. Mark is as not-free
- take the physical address of the kernel page (easy since it is identity mapped)
and store it in the process page entry. The process's page now maps to the proper physical page.
The AVL bits are set as follow:
- 0b000: physical memory pointed to by this page is free to use
- 0b001: physical memory pointed to by this page is also mapped somewhere else and is a stack page for a thread
- 0b010: physical memory pointed to by this page is used by the kernel
- 0b011: physical memory pointed to by this page is used as heap for a thread or the kernel
Creating a process
When creating a new process, the process code will start at 0x8000000. So the kernel will reserve a physical
page (as per the process in step 1 above). The physical address of that page will be used in the process's page
entry for address 0x8000000. This should correspond to Page Directory entry #64. Since the first 64 entries
are 2mb pages and the rest are pointers to page tables, then this virtual address maps to a 4k page.
So all processes will see their entrypoint at 0x08000000 but they will obviously be mapped to different
physical pages. This allows to to compile position dependant code and run it in the OS.
Whole memory identity mapping
Each process have their own virtual address mapping.
If a process calls a system function that needs to manipulate physical
memory this causes a problem. For example, if a process wants to allocate memory, it needs to access it own page tables.
but the pages reside above kernel memory so it is not mapped into process virtual addressing space. The same can happen
when trying to create a process: The system call will reserve a virtual address in its page tables,
which is identity mapped, but then it will attempt to fill in some data in the physical page,
that physical page resides above the kernel memory, so it is not mapped in the process. for example:
- Kernel memory ends at 0x08000000. Kernel page tables are 0x00020000.
- The kernel page tables is just a list of identity mapped pages above kernel memory
with the AVL field indicating if the page is available.
- The first entry maps to physical location 0x08000000. So when searching for a free page, the kernel will
find, for example, an entry at 0x00020B00 with the AVL bits set the "free".
- This entry maps to 0x08000000+(4096*(B00/16)) = 0x080B0000.
- The process would like that physical address to be mapped to 0x08001000 and fill it with zeros.
Two problem arises:
- The process cannot modify its page table to write an entry for 0x08001000 since
its page tables are not mapped.
- It cannot fill the page with zeros prior to map it since the page is not mapped and it
cannot access 0x080B0000 directly because this is a physical address and it would be interpreted
as a virtual address by the CPU.
The solution is to keep an indentity mapping of the whole memory at some other address. Lucky for us, with 64bit
addressing we have a very large address range. So identiy mapping will start at 0x4000000000 (256th gig).
That address will map to physical address 0x00000000. So by getting the PLM4 address from cr3 and
adding 0x4000000000 (or simply setting bit38), the process can now access its page tables.
Dynamic stack growth
A user thread is initially created with a 16k stack. If the thread attempts an access below the bottom of its stack,
a Page Fault exception (#PF) will occur since the page will not be mapped (unless the access went as far down as the heap,
in which case corruption would happen). The #PF handler will verify it the fault meets all the following conditions:
- Fault was caused by a not-present page.
- Fault was caused by an instruction executing in Ring3.
- Fault was caused by a write operation.
- Pages above the faulting address is a stack pages
If any of those condition is not met, the thread dies. If all of the conditions are met, then the OS
allocates a page for that address. The goal here is to try to recognize that the fault was caused
by a user thread wanting to use more stack space than it currently can in order to grow its stack.
The first 3 conditions are easy to understand. But 4th condition is unfortunately not that good.
It would be nice if the Exception error code would tell us if this operation was using rbp or rsp,
or any implied usage of the ss segment segment selector.
for example, all these operations should be recognized as stack growth demand:
- push %rax
- mov %rax,(%rsp)
- mov %rax,-5000(%rbp)
- mov %rax,-5000(%rbp)
- And to some extent: lea -10000(%rsp),%rax; mov %rbx,(%rax)
But it's impossible to detect all those conditions. So unfortunately, the best we can do is to detect if
the faulting address is in a page right under the bottom of the stack. So we just find the page table entry
for the faulting address, check if it is free and check if the next one is a stack page. If all is good, then
we allocate the page and return from the exception to retry the offending instruction and it should succeed.
But this algorithm is not perfect. What if the thread's RSP is at the bottom of the stack and it tries to
write at -5000(%rsp). This would fall 2 pages down below the bottom of the stack so the check would fail.
Unfortunately, I can't think of any better way of doing this. I wish Intel could give us a hint
that the #PF was due to a stack operation. After all, they seem to be able to make that deduction when trigerring #TS.
So it means that the following code could potentially be dangerous:
// This would be dangerous
// This would be safe.
// the fact that we access a page that is no further than 0x1000 from
// the bottom of the stack will make the #PF correctly create the page
// Doing this is called Stack Probing
a[0x1001] = a[0x1001];
After trying to find an alternate solution, I found out that apparently Windows and linux are facing the same dillema.
Apparently, my algorithm is good. When compiling a C application
under those OS, the compiler will detect if the program tries to allocate more than one page on the stack and
will generate "stack probing" code. So Stack Probably is a well-known technique to work around that limitation.
But if you write a ASM application under linux or windows, then you need to take care of that.
Another possible way I will eventually explore is to set the stack memory at virtual location 0x1000000000000.
By looking at bit 47of the offending address, I would get a strong hint about this being a stack operation.
This means the OS couldn't support systems with more than 256 terabytes of RAM.... It also has other downsides.