Xv6 scheduler and context switch
Hacking the xv6 kernel scheduling algorithm
We will explore the xv6 kernel scheduling algorithm as well as the context switch mechanism.
The source code for xv6 kernel is here.
I will not cover everything in this post, if you want to learn more, checkout those amazing references and resources at the end. This post covers context switch in term of timer-interrupts only, other cases may be for future blog posts.
Background
Scheduling
In the previous blog about [[Adding a system call to XV6]], we know that user programs are executed in a restricted manner. We know how the kernel provides a mechanism called system call for user that wants to access privilege resources on the hardware. The abstraction of process also gives our programs an illusion that we alone are using the hardware computing resources. The truth is most of us who studies computer science knows that, the kernel runs our processes by run one process, stop it and start running another repeatedly. This is called time-sharing. Time-sharing allows limited hardware resources to be shared among many processes, each process is given a small slice of processing time to execute their task.
Imagine this scenario, if we have only 1 CPU(1 core) on our machine, if a user process is using the CPU which means that the kernel is not running. How can the OS takes back control of the CPU from the user program?
If you are using a programming language like Go that implements concurrency using user thread, how does the Go runtime take back control from other Go routines?
In the context of limited direct execution, time-sharing allows the kernel to control how long user programs can run. By periodically forces user programs to give back the CPU to the controller, user programs can not uses resource forever. By giving the control back to the kernel, the kernel's scheduler can decide which processes to run. The scheduling algorithm is the kernel policy to decide which processes to run.
Multiplexing processes onto hardware processor
There are 2 cases when the kernel will switch from one process to another:
- OS periodically forces a switch when a process is executing in user space to the kernel space. This is called a timer-interrupt. In the previous blog post about [[Adding a system call to XV6]], we knows that when a system call is invoked, the user program must issue an interrupt instruction to transition to the kernel, the timer-interrupt is handled by the same mechanism. We will explore this later on in the xv6 kernel code.
- When processes have to wait for device or pipe I/O to complete, when parent process waits for a child process to exit or the
sleep
system call.
Context Switch
When the OS wants to stop one process to run another, it must save the context of currently running process. The kernel must save the old thread's CPU registers and restore the saved register of the new thread. For example, the kernel will save %eip
(point to next instruction) so that when it switches back, it will know which instruction to continue execute.
Context switching from one user process to another user process involves:
- Transition from user space to kernel space by a system call (by calling the
int
instruction), by interrupt. - From the A kernel thread, it will context switch to the CPU's scheduler thread. From here, the scheduler will make decision about which process to (in this case it will be B)
- a context switch to the B's kernel thread
- return to process B's user space
Now, let's dive into the code for context switch.
The code
States of OS
Before I show you the code of those context switch step above, let's look at some important data structure in the kernel.
// FILE: proc.c
struct {
struct spinlock lock;
struct proc proc[NPROC];
} ptable;
ptable
is the process list shared by all processors. Because of this, if we want to change the state of process we will need acquire a lock. We can see that the limit of our processes is NPROC
which is 64. Here is the data structure of a process:
// FILE: proc.h
struct context {
uint edi;
uint esi;
uint ebx;
uint ebp;
uint eip;
};
enum procstate { UNUSED, EMBRYO, SLEEPING, RUNNABLE, RUNNING, ZOMBIE };
// Per-process state
struct proc {
uint sz; // Size of process memory (bytes)
pde_t* pgdir; // Page table
char *kstack; // Bottom of kernel stack for this process
enum procstate state; // Process state
int pid; // Process ID
struct proc *parent; // Parent process
struct trapframe *tf; // Trap frame for current syscall
struct context *context; // swtch() here to run process
void *chan; // If non-zero, sleeping on chan
int killed; // If non-zero, have been killed
struct file *ofile[NOFILE]; // Open files
struct inode *cwd; // Current directory
char name[16]; // Process name (debugging)
};
pgdir
is our page table which I will make another blog post.kstack
is the process's kernel stack. In order for kernel to execute when process transition from user space to kernel space is that each process shares a part of their address space with the kernel. What this means is that the kernelcode
anddata
section exists in every process and is pointed to the same physical location in the memory. With this setup, when the process transition from user space to kernel space, kernel code can be executed.state
presents the state of our process.RUNNABLE
is when our process can be run and it is waiting for the CPU to run.RUNNING
is when it is being executed.trapframe
as discussed in the previous blog post will be used when the hardware jump to the trap handler.context
is the state of the process that needs to be saved during a context switch. You can see the%eip
register in here, which proves that during a context switch, the OS will execute our code at a difference location.
The Timer-interrupt
Timer interrupt is a kind of trap that the hardware will trigger every X ms. The OS will tell the hardware to find the timer interrupt handler when it is booted. Let's look at the previous C trap handler
// FILE: trap.c
void trap(struct trapframe *tf)
{
// ...
switch (tf->trapno) {
case T_IRQ0 + IRQ_TIMER:
if (cpuid() == 0) {
acquire(&tickslock);
ticks++;
wakeup(&ticks);
release(&tickslock);
}
lapiceoi();
break;
// ...
}
// ...
}
At the first switch case, we can find the trap number for timer interrupts, it will update the ticks count and wakeup any process that is waiting for the next ticks . It will only update the ticks count for the first CPU because each CPU has its own timer and interrupt and these timers may not be synchronized. lapiceoi()
will acknowledge the local interrupt so that it can ready for next interrupts. The important code is further down in this function.
I may cover
sleep
andwakeup
in future blog post
// FILE: trap.c
void trap(struct trapframe *tf)
{
// ...
if (myproc() && myproc()->state == RUNNING &&
tf->trapno == T_IRQ0 + IRQ_TIMER) {
yield();
}
// ...
}
This code will check if our process is running and if the trap number is our timer interrupt. If it is, we will call yield
to give up the CPU. Let's look at the code in yield
:
// FILE: proc.c
// Give up the CPU for one scheduling round.
void
yield(void)
{
acquire(&ptable.lock); //DOC: yieldlock
myproc()->state = RUNNABLE;
sched();
release(&ptable.lock);
}
What yield
does is that it set the current process state to RUNNABLE
, because the process list is shared by multiple CPUs that is why we need to acquire the ptable.lock
before we can change the state. Next we will go and see what sched
does.
// FILE: proc.c
void
sched(void)
{
int intena;
struct proc *p = myproc();
if(!holding(&ptable.lock))
panic("sched ptable.lock");
if(mycpu()->ncli != 1)
panic("sched locks");
if(p->state == RUNNING)
panic("sched running");
if(readeflags()&FL_IF)
panic("sched interruptible");
intena = mycpu()->intena;
swtch(&p->context, mycpu()->scheduler);
mycpu()->intena = intena;
}
sched
will make sure that we can go into the scheduler safely.
Some conditions, we can already understand is that, we must hold the ptable.lock
, the process state must be RUNNING
. switch
is where the context switch happens. It will save the current process context and switch to the context of the CPU's scheduler.
# Context switch
#
# void swtch(struct context **old, struct context *new);
#
# Save the current registers on the stack, creating
# a struct context, and save its address in *old.
# Switch stacks to new and pop previously-saved registers.
.globl swtch
swtch:
movl 4(%esp), %eax
movl 8(%esp), %edx
# Save old callee-saved registers
pushl %ebp
pushl %ebx
pushl %esi
pushl %edi
# Switch stacks
movl %esp, (%eax)
movl %edx, %esp
# Load new callee-saved registers
popl %edi
popl %esi
popl %ebx
popl %ebp
ret
Since we will interact with registers, this code is written in assembly. Don's worry, I will try to explain as much as possible :D
.
void swtch(struct context **old, struct context *new);
this is how this assembly code looks like in C.globl swtch
makes theswtch
label globally visible so that we can call this code in C.- In x86 convention (32-bit architecture specifically), function call is pushed onto the stack from right to left.
new
will be pushed onto the stack first, thenold
and finally the return address of this function. -movl 4(%esp), %eax
moves the content at address4($esp
which is the first argumentold
into register$eax
-movl 8(%esp), %eax
moves the content at address4($esp
which is the second argumentold
into register$eax
- the next 4 lines pushes registers that we saw in the
struct context
from above. Fields in struct are pushed in reversed order. The reasons%eip
is not saved is that by the convention of x86, function call will push push%eip
onto the stack which is the return address. movl %esp, (%eax)
saves the current stack pointer into location pointed by%eax
which is oldmovl %edx, %esp
, the stack pointer is set to the value in%edx
which isnew
. This line will change the stack to the context ofnew
, means that we switch the running thread- now we at the new context, we restores the saved registers. Note that this is the next context's saved state.
ret
returns from theswtch function
// at the start of swtch
[Higher Addresses]
...
return address <-- Pushed by `call swtch` <- %esp
&p->context <-- `new` (rightmost argument)
mycpu()->scheduler <-- `old` (leftmost argument)
[Lower Addresses]
So inside the call to yield
, we will switch from the current user process into the CPU's scheduler.
The Scheduler
void
scheduler(void)
{
struct proc *p;
struct cpu *c = mycpu();
c->proc = 0;
for(;;){
// Enable interrupts on this processor.
sti();
// Loop over process table looking for process to run.
acquire(&ptable.lock);
for(p = ptable.proc; p < &ptable.proc[NPROC]; p++){
if(p->state != RUNNABLE)
continue;
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
switchuvm(p);
p->state = RUNNING;
swtch(&(c->scheduler), p->context);
switchkvm();
// Process is done running for now.
// It should have changed its p->state before coming back.
c->proc = 0;
}
release(&ptable.lock);
}
}
The scheduler is per-CPU scheduler, which means each CPU has it owns scheduler but they all share the same process list. The outer loop is an infinite loop so that it never returns. The inner loop is to find the next process to run. The xv6 scheduler is implemented using round-robin algorithm which is our inner loop. When we found a process that is RUNNABLE
, we will context switch to that process so that it can be run.
At this point, from someone who knows how to code perspective but does not know about the OS, they will think that after the swtch
, the loop will continue to run. But as we know, we are now running at difference position until we call yield
again which will continue to run the line after swtch
.
After the called to swtch
, we are back to the called to the function yield
in trap.c
. We were sent to this point because of the timer-interrupt (Do you remember this :D).
Here is where each code will be called to help you build your mental model.
swtch
called by scheduler inproc.c
- after executing for X ms, timer-interrupt will force the process into the
trap.c
- in
trap.c
, the process will callyield
inproc.c
to give up the CPU, which will callsched
sched
will callswtch
inswtch.S
to perform the context switch This process is perform repeatedly to switch from user process to the scheduler and back to user process.
Summary
OS uses timer interrupt to control how much a process can run.
At the end of the timer interrupt, the process will call
yield
to give up the CPU.During a context switch, the process will save the current context and switch to the context of the CPU's scheduler before the scheduler decides which process to run.