19 Nov 2024

OS Notes

OS

Manages runtime of the system. Provides virtualization to the underlying system.

Also provides syscalls that are available to the userspace programs. Hence it is said to provide the standard library. It also makes sure that multitasking is made possible and addresses are virtualized.

The model is very simple. It is an array. OS virtualizes memory. This means that the address referred to by the application is not the same as the physical address of the application running. Each operation has its own address space.

Another thing to notice is to provide concurrency and synchronization.

Some History Early operating systems were just libraries that didn't do too much except for providing an abstraction/wrapper over low level I/O routines. Later on, the idea of system calls with different privilege levels in the hardware were introduced and added to the notion of security. Multiprogramming became commonplace soon after and multiple concurrent tasks began to be used.

Virtualization of the CPU

Basically runs multiple programs, each for a short amount of time, and then changes the running process to the next one.

It accomplishes this using something called as time-sharing. It implements a context switch which allows the OS to stop one program and save its execution state and start another program. We also have policy level control over how the programs are swapped and which one to bring next.

An Abstraction: A process

A running program is called a process in the operating system. It can summarize everything about the program at a given state of execution. This includes memory locations, a couple of registers (all, in fact) and any open IO devices. Earlier OSes loaded a process data eagerly, but now it's used lazily. This involves paging and swapping.

A process has three states (4 actually)

Running: Running
Blocked: Waiting for IO on a file descriptor or something
Ready: Ready to execute but waiting for its turn.

Data structures involved

Process table in case of sv6. Contains all this:

// the registers xv6 will save and restore
// to stop and subsequently restart a process
struct context {
int eip;
int esp;
int ebx;
int ecx;
int edx;
int esi;
int edi;
int ebp;
};

// the different states a process can be in
enum proc_state { UNUSED, EMBRYO, SLEEPING,
RUNNABLE, RUNNING, ZOMBIE };

// the information xv6 tracks about each process
// including its register context and state
struct proc {
char *mem; // Start of process memory
uint sz; // Size of process memory
char *kstack; // Bottom of kernel stack
// for this process
enum proc_state state; // Process state
int pid; // Process ID
struct proc *parent; // Parent process
void *chan; // If !zero, sleeping on chan
int killed; // If !zero, has been killed
struct file *ofile[NOFILE]; // Open files
struct inode *cwd; // Current directory
struct context context; // Switch here to run process
struct trapframe *tf; // Trap frame for the
// current interrupt
};

Such an entry is also sometimes called as a Process Control Block.

The fork() system call creates a new process in Linux. It returns twice! Once in parent process (where it returns the pid of the child process) and once in child process (where it returns 0).

The wait() system call makes the parent wait till the child process has finished execution. This is useful for syncing. If called from a process that has no active unwaited-for child. There's also waitpid() that takes the pid of the child process and checks for that one specifically. wait() has unspecified order for two or more children. Both of them take an argument to a pointer to an int where they can store the return status of the child process. If NULL then it's discarded.

The exec() system call is useful when you want to replace the current process with another one without creating a new process. It replaces the code and data segment of the current process, refreshes the register entries and cleans up stack, heap and other stuff. Any other arguments are passed as arguments to the binary. A successful call to exec() never returns.

Limited Direct Execution

This is a somewhat fast way to do things. We directly execute some of the program on memory to let go of the virtual memory overhead. Here is one place where user mode and kernel mode separation comes into picture.

Kernel sets up a trap table at boot time since boot is in privileged mode. The address of this table is remembered by the hardware. System calls are made via numbers to specify the type of service they need.

Switching between processes

One approach is to assume that the programs running are cooperative. Any time there's a system call, the OS gets the control back and is able to switch between processes. These systems explicitly had a yield() system call that led programs to give control back to the OS. Obviously this can lead to stalls due to ill formed programs.

The other approach is the saner one, and that is to get the OS to take control from the application on its own. This is accomplished using timer interrupts (periodic interrupts as the name suggests). These run a preconfigured handler that is used for context switching.

An example of xv6 can be taken. It has a timer interrupt that triggers the switch() routine in the kernel that saves the registers and other values of the kernel in the kernel stack and the interrupt saves the registers and other info of the process in the kernel stack of that process. This part is done by the hardware. The saving of kernel registers is done by the software.

Scheduling

OS has a bunch of scheduling policies that dictate how the processes should be scheduled. The processes running on the OS are called a workload. Each process is also called a job.

We also need a scheduling metric, in this case a turnaround time. More formally, \(T_{turnaround} = T_{completion} - T_{arrival}\)

Turnaround is a metric of performance, and similarly, there is another metric that is fairness.

First In, First Out

The most basic algorithm is First In, First Out or First Come, First Serve. Doesn't work as some processes may take a long amount of time and starve the other processes.

Shortest Job First

This is again a very dumb method to schedule things. It will take all the shortest jobs and execute them first, but in the process it'll starve a lot of processes that are longer in time. Also, if a long running process arrived first, it'll still complete that before the shorter jobs that arrive later.

Shortest Time to Completion First

This is done so that if a smaller job in SJF comes up it'll be given priority by context switching to that one. Once again, this will starve the larger process.

Response time as a metric

We define response time as \(T_{response} = T_{firstrun} - T_{arrival}\) This basically indicates how long it took for the first process to get scheduled for the first time.

Round Robin

This is more optimal as it gives a certain amount of time to each process without making it go to completion or wait for a long time.

For round robin, the length of the time slice is critical. It should be small for having better response times but not small enough that frequent context switching becomes an overhead.

Incorporating IO

If there are 2 resources that have to be utilized, then when we are scheduling A, we'll also schedule B in the gaps of scheduling A.

Multi Level Feedback Queue

It tries to optimize two things,

turnaround time
response time

It has multiple distinct queues each assigned a different priority level. At any point:

If Priority(A) > Priority(B), A runs and B doesn't
If Priority(A) = Priority(B), A and B run in round robin.

MLFQ varies the priority of the processes based on the observed behaviour of the process.

How priority adjustment is done:

When a job enters a system, it is placed at the highest priority (the topmost queue).
If a job uses up its allotted time, then its priority reduces.
If a job gives up the CPU before allotment is up, for example by using IO, then it stays at the same priority level.

However, this might lead to starvation of processes that don't have interactivity and have longer CPU domains. This can also lead to situations where one process hogs the CPU time by just executing 99% of the allotted time and then opening a file I/O. Finally, a long running batch program might also turn into an interactive program. To counter this, we interoduce a new rule.

After some time period \(S\), move all jobs to the topmost queue.

Tuning the parameters

Solaris allows for a table to be set for various properties of the MLFQ. BSD on the other hand uses a decay for every process that allows the required priority boost.

Proportional Share

This is also called fair-share scheduling. An early example of this is called as the lottery scheduling.

The main idea is to use tickets. For a process A that needs 75% of the CPU time it should have 75% of the tickets. Now we randomly generate ticket numbers and use the corresponding process mapped to it. In the long run, it will fall to 75% of the CPU time.

The implementation is very simple. Have a list of processes along with the allotted number of tickets. You then generate a random number from in between 0 and the total number of tickets. After that, you iterate through the list and then stop at the process where the sum of the tickets so far are more than the number you generated.

Stride Scheduling

Over short time scales, randomness does not allow for fair share. For this, we have stride scheduling which is a deterministic fair-share scheduler.

Each job in the system has a stride which is inverse to the number of tickets it has. We create a pass value for each process. Each time, we choose the one with the least pass value and add its stride to the pass value.

Completely fair share (Linux Scheduler)

It has a counter called as a virtual runtime (vruntime). As each process runs, it accumulates more of this. When a scheduling decision occurs, CFS will pick the one with the least vruntime.

There are various parameters on how to schedule things. The first one is sched_latency. This is usually 48ms, divided by the number of processes the OS has. If there are too many processes running, then there would be too many context switches and it'd be massively painful

For this, there is another parameter called min_granularity. This is usually set to 6ms. CFS is operated upon by a timer interrupt. If the job slice is not a perfect multiple of the timer it is still not a problem as CFS tracks vruntime precisely and in the long run will approximate the ideal CPU sharing.

CFS also enables control over process priority by using a property of the process called niceness. Each niceness has a weight associated with it that allows it to compute the effective time slice that the process should use.

The time slice is scaled to the weight of the process by the sum of weights of the other processes times the sched_latency. Similarly, runtime is also scaled down by the weight of the process.

\[ time\_slice_k = \frac{weight_k}{\sum_{i=0}^{n - 1}weight_i}\cdot sched\_latency \]

\[ vruntime_i = vruntime_i + \frac{weight_0}{weight_i}\cdot runtime_i \]

The way these weights were created, it allows the scheduler to run two different sets of two processes in the exact same manner if difference in the niceness between the two processes is the same in each set.

CFS keeps active processes in a red black tree to fetch the next one as fast as possible. This allows it to search for the a process in \(\mathcal{O}(\log n)\). Insertion and deletion also follow similar time complexities. This is noticeably more efficient than linear.

A singular problem I know of is starvation in case of sleeping and I/O bound processes. To prevent sleeping and IO bound processes from hoarding the CPU time, they are set to the minimum vruntime of the RB-tree. This, however, makes short and frequently sleeping processes starve massively.

Address Spaces and memory management

Address space is an easy to use abstraction of the physical memory. It contains the memory state of the current program. A usual program has a stack, heap and a code section in the memory. The abstraction also helps in virtualizing memory, i.e., the processes might not actually be in the pages they think they are, which provides more security. Therefore, each process operates on NOT physical address but virtual address.

Major goals of virtual memory:

efficiency: translation should be fast and efficient
transparency: the program shouldn't know or care it's running in virtual mode
protection: no read-write of pages between processes without explicit permission from the process.

Types of memory

We have stack memory, that, well, acts like a stack. Used for function calls and shit.
Heap memory, manipulated by malloc and free. They manipulate using brk, sbrk, and mmap.
Code section, keeping the instructions of the binary.
Global vars, constants, and other lookup shit.

Address Translation

We need hardware support for this. The MMU converts every virtual address to a physical address and that is done via page tables set up by the operating system.

We take a granular approach to discuss this:

Simple: First Model
Assumptions:
- Contiguous physical memory.
- Address space less than the size of physical memory.
- Address spaces are the same size.
Here we can translate the process address space very easily. And also mitigate problems.

\[ phy = virt + base \]

\[ max\_addr = virt + bound \]

\[ min\_addr = virt \]

MMU also checks for the bounds of the process, in a sense that it keeps the maximum offset that process can go to.

The OS has to do a couple of things:
- Maintain a free list of segments that are free for putting a process page over there
- Clean upon a process being killed/terminated
- Save and restore the MMU data in the PCB along with the registers and other stuff
- Provide exception handlers to the hardware
Segmentation: Second Model

We generalize the base-bound idea to be per logical-segment instead of per process. Here a logical segment would just be like a section of the binary we know, stack, heap, code, bss etc.

Translation scheme looks pretty much the same, but this time it's per segment instead of per process. Though the general bounds on the process still exist of course.

To know which address we are referring to, we can keep tag bits at the top of the virtual address that are read to determine the segment. This, however, limits the size of the addresses you can use. Another approach is to do it implicitly by noticing how the address was formed, for example, an address contained in the instruction pointer would mostly point to the code section.

Apart from base and bound, now it also needs to know which way a segment grows, so that bounds are actually set properly. This is done with another bit in MMU that tells the hardware that growth is positive or negative.

In addition, we can now share certain segments with other processes (code section of a library for example). This way, whatever is used by multiple processes only has one instance in memory. To support this, we need protection bits for the segment, so that segments that are shared are read-only. This way, we can share code segments and only have one physical instance in memory while also having multiple virtual instances, one per process that's running it.

For some cases it might be necessary to extend the segment beyond what it already has. In particular, this is true for heap. To do that, you need to have a syscall that extends or shrinks it. Linux has brk and sbrk.

Fragmentation is also a problem. If we do not have large contiguous chunks of memory free, we might not be able to allocate a new segment a chunk. One solution is to rearrange the chunks such that we have a large contiguous memory segment free. Once that happens, we also need to copy the memory chunks and free the existing ones. Apart from this, we also need to update the MMU regarding the new locations. Needless to say, it is very expensive, and consumes a lot of CPU power. Also, it makes using brk and sbrk much harder as you need to rearrange chunks every single time.

There are multiple algorithms to deal with the problem and all of them work well to a certain degree but none of them can evade the problem entirely.

Free Space management

If it's fixed size chunks then managing free memory is really easy. It's hard when you have variable sized units (happens with heap A LOT). The free space gets fragmented into multiple small chunks.

We study malloc and free and how they operate. We also work on the premise of no compaction, and that is because once memory is handed out to a program it cannot be reclaimed before a corresponding free is called.

A free list is a set of elements that describe the free space still remaining in the heap. Assume we have a 30 byte heap with the first and last 10 bytes free. A request for anything greater than 10 bytes will fail and anything equal to 10 bytes can be satisfied by either chunk.

If something smaller than 10 bytes is requested, the allocator splits the memory chunk and gives out the required memory to the process. A corollary mechanism is called coalescing which combines contiguous free chunks into one chunk.

Most allocators store a header along with the returned block that contains the information about size and other metadata.

For the actual implementation, we create a list in the memory where we are allocating. We take an example of mmap that gives us a pointer to free space (raw free space).

void *mmap(void addr[.length], size_t len, int prot, int flags, int fd, off_t offset);

This is the signature of mmap on linux. If addr is NULL, then it chooses a page aligned address. We assume that the other parameters are all fine, we have PROT_READ and PROT_WRITE for protection. and anonymous and private mapping.

We use the length to be whatever we need, plus the size of the header field. We also have a head pointer somewhere by the allocator that maintains the first free address in the entire address space. The next pointer of that marks the next free block in memory, or NULL if the node is the last one.

Let's say now that one of the chunks that was sandwiched in between two allocated chunks was freed. In such a case, you have to update the head to point to this now free chunk and its next pointer will be where the head originally was. This way, we have a fragmented heap.

After that, the order of freeing the chunks matters. If you free them in a random order, they will be a weird linked list that will point to each other with the head somewhere in the middle.

This is a problem since the heap is massively fragmented even though it shouldn't be. So we need to coalesce. Go through the list and merge neighbouring chunks, and the heap will be whole again.

Strategies for managing free space

All these algorithms suffer from the problem of scaling since searching lists can be quite slow.

Best fit

Search through the free list until you get the closest chunk in terms of size. Takes too long to search the list though.
Worst fit

Search the list to find the largest chunk. Use it to minimize fragmentation. Usually a bad algorithm.
First fit

Find the first chunk that can satisfy it and return that. Usually leads to a lot of fragmentation in the earlier part of the list.
Next fit

Like first fit, but keep a pointer to where you last searched. Reduces the fragmentation at the beginning of the list and keeps the allocations evenly distributed.
Segregated lists

This already keeps a list of chunks you'd popularly need to service standard requests faster. If the specialized allocator is running low on space, it can request more from the general allocator, which provides it. If the specialized allocator has too many chunks and the general allocator needs to get a large chunk, it can also request some chunks from the specialized allocator.
Buddy system

This is meant for coalescing easily since coealescing is a very important part of the heap. One example is the binary buddy allocator. Here the heap is thought of as a big space of \(2^N\). When memory is requested it divides it into half till the chunk size just fits the requirement. Since each block has a pair, you check the pair of the returned block for whether or not its free. If it is, then you coalesce them and keep going up till everything is coalesced or the adjacent block is in use.

Coalescing is easy because you only need to change one bit to find the buddy which is determined by the level in the tree you're at which in turn is determined by the size.

Paging

We chop up memory into rigid fixed-size pieces. Each piece is called a page. This allows for flexibility as we don't need to do a lot of the book-keeping we don't need to make assumptions about how the binary will use the page.

To keep a track, we use a per-process data structure called page table that is maintained by the kernel. The generation and translation scheme is the same as that discussed in cs2600. They are not stored in the MMU like before but instead stored in the kernel virtual memory.

They can go to any depth of page tables, linear being the most simple in complexity.

TLBs and their role in OSes

We use translation lookaside buffers that store the last few page mappings and use it to generate frequently used mappings faster. This is a part of the chip's MMU and leads to tremendous speedup for the translation. We basically take the virtual address part from the virtual address (basically remove the offset bits) and XOR it with every entry in the TLB. If any of them is a zero, we have a hit, which implies that the translation is already present. Else we have a miss, in which case we need to do an address translation.

We raise a trap when a TLB miss happens in case of RISC architecture and CISC handles it entirely on its own based on some special registers. The trap generated by a TLB miss has to be handled separately than other traps. Other traps resume operations from the instruction after the one that generated the trap while this one resumes from the one that generated the trap. A TLB also has some metadata bits needed for the pages and other stuff.

Context switching

When we change process, the TLB has to be changed. This is achieved by setting the valid bit of all entries to 0, so that the next cycle makes it a point to look into the page table. However, this can be very expensive in practice, and hence, it is problematic to do so. Hardware manufacturers support an additional ASID field that lets you store multiple translations along with which process they belong to.

Replacement policy can be any of LRU or random, which are the two common ones used.

Smaller tables

Page tables can take up a lot of space in memory. A simple method to make smaller page tables is to have larger page sizes. However, that leads to internal fragmentation as pages will have unused regions of memory inside them.

A simple solution is to increase the page size of the system. Doing so, however, leads to larger internal fragmentation in the pages.

Another approach is to use page and segments combined. Just like the MMU did, we store all the base + bounds + direction triplets in the table and use that during lookup to check for the proper section.

An even simple solution is multi-level page tables. Instead of having a giant array, we have a tree of tables with multiple levels of table lookup as we described in cs2600. That allows us to elide the construction of entire subtrees and thus saves a lot of space. It also makes the table as the same size as that of a page, simplifying memory management.

Another extreme example of this behaviour is inverted page tables. In this case we basically do a mapping from physical pages to virtual pages. This allows us to only have a finite sized page table at any point of time, and lets us save memory by doing so. However, since a linear lookup isn't very efficient, the systems utilizing this often use a hashmap to get that mapping.

Finally, we can also relax the condition that pages must necessarily be in the memory and swap inactive pages to the disk in a method called swapping.

Swapping Mechanisms and Policies

Swap is memory present on the disk that is used for storing the extra page tables. To support that, we first need the present bit to be supported in page tables. This indicates whether the page is present in memory or not.

Now, when the present bit is set to 0, we know that the page isn't present in the main memory and it leads to the hardware generating a page fault. The page fault handler loads the page from the disk, and during this time the process goes into a blocked state because of the disk I/O happening.

Once the page is loaded from memory, we update the page table entries and continue execution of the process. If the memory is full, then the OS picks a page it wants to remove.

Now the replacements themselves may happen when the OS wants them to. OS's have a low watermark and a high watermark. When we have fewer than LW pages left, the OS runs a background process that removes some of the memory pages such that atleast HW pages are free and goes to sleep.

The way in which we swap out the pages is called page replacement policy. The most optimal policy is basically impossible to implement. It evicts the page that will be used furthest in time.

Another example of a simple policy for page replacement is FIFO. It simply means that the page that came in the earliest should be the one that goes out first. Another very simple policy is replacing a page at random. The performances of these two is not very good for very obvious reasons.

The best method would be to take into account the history of the system. This is the principle behind LRU or Least Recently Used. This keeps in track the last usage of the page and evicts pages based on that.

There are a couple of methods to implement LRU practically. One of them is to have hardware support for having a time field in each of the pages. Later, the OS can just scan all the pages for the one with the least timestamp and evict it. Unfortunately, this leads to a lot of time spent in scanning pages as the number of pages grows.

There is a simple approximation to this as done in the linux kernel which is called as clock-pro. Clock-pro by itself has a lot of tweaks and patterns to make it way more efficient, but the basic idea remains the same as what is described below.

We have a use bit in the page metadata, that allows us to keep a track of when it was used last. We have a circular list with a clock hand that points at one entry. When a replacement has to occur, the OS checks if the use bit of the entry is set or not. If it's set, then it was recently used and thus shouldn't be evicted. This is now set to 0, and the clock hand continues forward till it finds a page with the bit set to 0 already.

Apart from all of this, OS's also use demand paging; a page is loaded into memory only when it's needed. Also, clustering of writes happens often, as that leads to better performance with lesser write overhead.

A complete Virtual Memory System

We study two examples: VAX and Linux.

VAX
- Had a 32 bit virtual address space.
- Pages of 512 bytes, thus the offset was 9 bits and 23 bits for VPN in a virtual address.
- Upper 2 bits of the 23 bits were used to differentiate between the segments the page resided in.
- Lower half of the total address space was know as "process space" and was unique to each process.
- In the first half of this were the user programs and heap which grew downwards. The second half had the stack the grew upwards.
- The upper half of the address space is called system space and it is where the important OS code and data resides.
- No page table was used for the portion between the stack and the heap. They had base and bounds registers.
- The process memory pages resided in the kernel memory section.
- The VAX paging system had no reference bit, so it was not possible to determine which pages were in use recently.
- To deal with memory hogs, the developers came up with segmented FIFO.
  - Each process can keep a fixed number of pages in memory, also known as its Resident Set Size.
  - Then the first page in RSS would be evicted.
- Then, to avoid FIFO performance, the developers came up with two second chance lists that were used to store a global list of clean and dirty pages respectively.
  - If the page is in the lists and the process faults on it, then we can simply pick the page off the list and give it to the process. If some other process requests it before that, however, then we need to swap it in from memory.
- VMS also had demand zeroing. Basically pages were zeroed out when they are needed, usually on a page fault for a heap page.
- Apart from that, it had Copy on Write. This copies the page memory only when it is written to. Otherwise, it only keeps a read-only reference to the original location in memory.
Linux
- User space from 0x0 to 0xC0000000.
- Stack at the highest address.
- Lowest is forbidden invalid zero page.
- Following that is the user code, then user heap that grows to the higher address space.
- Addresses higher than that are kernel memory, logical and virtual.
- Logical is obtained from the kernel allocator kmalloc. These cannot be swapped to disk.
  - Logical has the property that the physical memory and the logical memory are in one to one correspondence.
  - Used for stuff like DMA etc. where contiguous mapping is preferred.
- Virtual memory exists and is allocated using the vmalloc function call.
  - This does not have contiguous memory.
  - The reason this exists is because we need to have a large virtual memory space for the kernel.
  - Not an issue on 64 bit machines.
- Sets up regular multi-level page tables for the structures. Four level.
- It allows for multiple page sizes, so the kernel has to support multiple page sizes.
- Linux also has a page cache hash table that allows it to service frequently used pages it tracks clean and dirty pages and a background process frequently writes them back to the responsible file (IO device/file/swap space).
- Linux uses 2Q replacement policy on a machine with very low memory. Two lists: active and inactive.
- When the page in the first (inactive) list is referenced again, it is sent to the active list.
- The active list is regularly pruned to be 2/3rds of the total cache size, helping with keeping the list fresh.
- Replacement is done from a page of the other list.
- Security is a focus, so ROP chains and code injections have mitigations ASLR and NX bits respectively.
- KPTI is kernel page table isolation so that users are not able to read it at all, only the bare minimum.

Concurrency

Basically like separate processes but having the same address space. This is helpful as it lets processes do better in terms of performance. Similar to a PCB, we now have a Thread Control Block for each of the process that keeps a track of the thread that is being executed. Also, there's overhead of switching page tables as the pages themselves remain the same.

One problem that does arise is the presence of multiple stacks in the address space that leads to a lot of fragmentation. However, this is fine as stacks are usually small except for programs that make heavy use of recursion.

API:

pthread_create creates a thread, using the function to execute, the the properties of the thread, the thread struct itself, and the pointer to the args of the thread.
```
pthread_create(pthread_t *thread, const pthread_attr_t *attr,
               void *(*start_routine)(void *), void *arg);
```
pthread_join waits for the thread to finish execution and then joins back the main thread that spawned this thread.

The program needs to be coherent. Hence no matter the order of execution, the end view of memory should be the same across all the threads.
```
int pthread_join(pthread_t thread, void **value_ptr);
```
The second argument is the return value of the thread function.
Another thing that is extremely important for parallelism is locks. These allow mutual exclusion access to the critical section. We have pthread_mutex_lock and pthread_mutex_unlock that lock and unlock a mutex respectively. However, it also needs to be initialized properly. This can be done in two ways:
- Using PTHREAD_MUTEX_INITIALIZER (compile time).
- Using pthread_mutex_init() function call.
- Corresponding destruction needs to be done with pthread_mutex_destroy().
There are other functions that exist for interacting with locks. These are pthread_mutex_trylock that returns failure if the lock is already held (instead of waiting on it), and pthread_mutex_timedlock returns after the time specified or after the lock is held, whichever happens first.
There are other things involved as synchronization primitives such as condition variables, this is achieved by binding the mutex with a condition variable. The APIs to do are as follows:
- int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex);
- int pthread_cond_signal(pthread_cond_t *cond);
For either of the above to be called the lock has to be held on. Initializers for this follow a similar pattern as mutex.

More on Locks

Locks need to have mutual exclusion and fairness. Apart from that, another thing to take care of is performance.

One of the simplest ways to enable locking was to disable interrupts. This was simple enough as there was basically no complexity involved, and worked well for a single-processor system.

A non-exhaustive list of problems:

privilege escalation
doesn't work with multiple processors
disabling interrupts for too long opens a separate can of worms.

Dekker's Algorithm: Only assumes atomic loads and stores

int flag[2];
int turn;
void init() {
// indicate you intend to hold the lock w/ ’flag’
flag[0] = flag[1] = 0;
// whose turn is it? (thread 0 or 1)
turn = 0;
}
void lock() {
// ’self’ is the thread ID of caller
flag[self] = 1;
// make it other thread’s turn
turn = 1 - self;
while ((flag[1-self] == 1) && (turn == 1 - self))
; // spin-wait while it’s not your turn
}
void unlock() {
// simply undo your intent
flag[self] = 0;
}

Another method is to use test-and-set to atomically set a flag and use a spin lock after that. It provides mutual exclusion but without any guarantee of fairness, since that depends on the thread scheduler. It may also lead to performance degradation as there are a lot of empty cycles being wasted in single threaded cores. However, with multiple CPUs they perform reasonably well as other threads are not starved and the thread scheduler doesn't have to kick in.

There are other instructions that can be used to make this work. One of them is the atomic compare-and-swap which checks the value first and then changes if it is equal to some specified value.

Another method to build a lock is to use a load-linked and store-conditional instruction pair. The first is just a regular load instruction, but the second one keeps a track of when the first one happened. The second one only succeeds if there was no intervening store to the address.

This way, we first load a value of 1 to the pointer. Then let's say another thread comes in and also loads a value of 1. Then when the first thread tries store-conditional, it'll fail as the value was updated between the load and the store.

Another method is to use the fetch-and-add instruction, which increments a value atomically and returns the old value. To build a lock over this, we simply have to give a constantly incrementing number as a ticket. Then we also have a global turn variable for all these threads that keeps incrementing at every unlock. Once the turn reaches the ticket, the lock is held by that thread, and when it unlocks, it's incremented again so that it can be given to the next thread. Thus, it ensures fairness.

Reducing Lock spinning overhead

A simple solution for that is to yield. Basically, anytime a thread that is in a locked state is executed, it simply yields its control back to the scheduler using some syscall that calls the thread scheduler again.

This also has problem.

Context switches, even though less costly than utilizing entire time slices for spinning, is still too costly.
Starvation still isn't dealt with. One thread may enter an endless yield loop while others repeatedly enter and exit the critical section.

typedef struct __lock_t {
    int flag;
    int guard;
    queue_t *q;
} lock_t;

void lock_init(lock_t *m) {
    m->flag = m->guard = 0;
    queue_init(m->q);
};

void lock(lock_t *m) {
    while(test_and_set(&m->guard, 1) == 1) ; //spin
    if (m->flag == 0) { //lock not in use
        m->flag = 1; //lock acquured
        m->guard = 0;
    } else {
        queue_add(m->q, gettid());
        m->guard = 0;
        park(); // put the thread to sleep
    }
}

void unlock(lock_t *m) {
    while (test_and_set(&m->guard, 1) == 1) ; //spin
    if (queue_empty(m->q)) {
        m->flag = 0; // no need of lock anymore
    } else { // don't unlock, just call the next thread
        unpark(queue_remove(&m->q));
    }
    m->guard = 0;
}

To improve upon it, it's better to use queues that put the thread to sleep instead of making it yield. We use a queue to keep track of the threads that are waiting on a lock. We keep a guard on the queue (basically another spin lock) that keeps track of who's pushing and popping off the queue.

While locking, we acquire the guard, and then check if the main lock was acquired or not. If it wasn't, then we execute the thread and keep the lock acquired. If it was acquired, we will push the current thread to the queue and yield.

While unlocking, if there are no threads waiting on the queue, we unlock it, else we keep the lock active (so that no other thread steals it), and then let the next thread continue its invocation.

There is a race condition in this lock and unlock setup. We release the guard on the queue before we put the thread to sleep. When we are about to sleep, a context switch might happen. This can then lead to the first one being invoked right before sleeping, and when invoked it'll go to sleep, expecting to be woken up. And that'll fail because it was already woken up.

Linux uses a futex to avoid further overhead when locking and unlocking:

void mutex_lock(int *mutex) {
    int v;
    if (atomic_bit_test_set(mutex, 31) == 0) // bit was clear. we got mutex
        return;
    atomic_increment(mutex); // bit was set, we increment the futex value and
                             // wait on it
    while (1) {
        if (atomic_bit_test_set(mutex, 31) == 0) { // if futex happens to be unlocked
                                                   // revert the increment operation
                                                   // and take ownership
            atomic_decrement(mutex);
            return;
        }
        // have to wait to make sure the futex value we're monitoring is negative
        // or locked
        v = *mutex;
        if (v >= 0) continue;
        futex_wait(mutex, v); // if in between accessing v and calling this it was changed
                              // we return immediately and continue the loop
    }
}

void mutex_unlock(int *mutex) {
    // Adding 0x80000000 results in 0 if and only if there are no other threads waiting
    // the futex is guaranteed to be locked rn
    if (atomic_add_zero(mutex, 0x80000000)) return;

    // There are other threads waiting, wake them up
    futex_wake(mutex);
}

Over here, futex_wait puts the thread to sleep assuming the first argument is equal to the second. If it is not, then it returns immediately.

Each thread is allotted a ticket number that is the last 31 bits of the atomic int. If the futex happens to be unlocked when we call the lock, we acquire it immediately and return.

Else, we enter a spin lock region. Here, we check for the lock bit again. If it is not set, then the lock is free to be taken. We revert the previous increment and return.

This is optimized for a single thread lock. In those situations the futex calls never happen and overhead is not there.

Condition Variables in mutex

These are useful when you want to wait for something to become true and wake up automatically. Shared variables and spin locks and other such stuff are very expensive here.

We have pthread_cond_t that allows you to declare a condition variable with a wait() and signal() operations. The former puts the current thread to sleep on this condition variable, the latter is called from some other thread to wake up the one that slept on it.

The POSIX API is like this:

pthread_cond_wait(pthread_cond_t *c, pthread_mutext_t *m) is the wait command for the thread
pthread_cond_signal(pthread_cond_t *c) signals the thread to wake up.

We use a mutex in the wait call as follows:

We assume the thread had locked the mutex before calling wait
the wait call unlocks the mutex and releases it immediately, putting the thread to sleep
when it wakes up, it locks the mutex again, and then continues its work.

We need the mutex to prevent race conditions and making threads sleep forever.

Producer Consumer problem

This happens when you want to have a queue with multiple threads pushing into it and multiple threads popping out of it. Long story short, you need to use two condition variables, one for signalling the producers when a consumer has emptied the buffer and needs more, and the other one when a producer has filled the buffer and needs the consumers to take from it.

Apart from that, you need to keep a state variable to check if the buffer is full or not, and manipulate it accordingly. This check needs to be done in a while loop, because if can wake up multiple threads at once which can cause a data race.

Semaphores

These are something in between locks and condition variables. These are integers rather than a binary state like mutex. These allow the threads using it to have more context than just using a mutex.

We have sem_post() and sem_wait(). If the call to the latter results in a negative value, the thread calling it is put to a wait. The call to the former is like a thread releasing a lock. If there are more waiting threads (the value of the integer is negative) those threads are woken.

A binary semaphore is basically a lock and a condition variable in one. It's just that, sem_post can both be used as an unlock and a signal. Similar for sem_wait. It happens when there are only two threads using it.

Semaphores are also used as an ordering primitive. That is because as said before, sem_post can wake a thread up.

Finally, semaphore is just a lock and a condition variable along with a count. When the count is nonpositive, you wait on the condition, and then decrement the value when you finally stop waiting on it. When you release, increment and signal.

Concurrency Bugs

There are different kinds:

Atomicity Violation (Event needs to be atomic but isn't)
Order Violation (Order of events wasn't established properly)
Deadlocks

The first can be prevented using locks. The second can be solved using condition variables.

The third can be prevented in many ways. The most common is to have a total ordering on the locking mechanism. However, this is often restrictive and needs great design consideration.

Another method is to make all the locks happen at once. This can be done by wrapping all the locks in a single lock and locking that before anything else. However, this leads to too coarse-grained locking and we also need to know which locks to lock ahead of time.

Yet another method is to not use locks at all and instead use a lock-free compare and swap mechanism for basically everything you want to do. However, this can be done only for simple things in practice.

Another mechanism would be just use the scheduler to avoid deadlocks and schedule processes such that they never conflict. Basically a ridiculous proposal.

IO and Storage Devices

IO devices are connected to the memory via PCI (peripheral component interconnect) bus containing high speed components like GPU, or via the peripheral bus (which has slower components like SATA, USB etc). These two buses join the memory bus containing the CPU and memory.

Device Drivers

We abstract out the details of the underlying device with the help of drivers to keep the implementation of higher level stacks like filesystem generic. How that's done depends on the hardware interface/ABI provided by the device.

Storage Interface

We assume there are \(N\) sectors present in the disk and each sector is of size 512 bytes. Only a write to the most basic blocks are guaranteed to be atomic, otherwise multi-sector writes may end up as torn writes.
Protocols for checking status
- Polling: We continuously check the status register and wait for it to be freed.
- Interrupts: Hardware raises an interrupt which resets the instruction vector to a predefined ISR or Interrupt Service Routine which handles the interrupt. This is not always ideal as context switching is involved. It is also possible that there is a huge burst of interrupts (a server receives tons of network packets)
Protocols for transferring data
- PIO: The CPU explicitly copies the data from one place to another using software instructions.
- DMA: This is done by specialized devices called DMA engines. They do memory transfer in the background while the CPU continues to work, and raise an interrupt when they are done with the transfer.
Communicating with the device
There are two methods to communicate with the device:
- Privileged instructions: These are only allowed to be used in OS privileges since they involve the use of privileged device registers.
- MMIO: These are registers that are mapped by the hardware to certain memory locations. Hence, accessing the memory region is the same as accessing the register.

Filesystem

The most basic block of a filesystem is the file itself. Each file has a low-level name, and in case of linux it is the inode number.

The second abstraction is the directory. This has a list of (user-friendly name, machine-friendly name) pairs that are stored in a list. We can also store other directories in this list, thus creating a directory tree. The root is the beginning of the tree, which is / as the name.

On the user side of things, we have APis like open(), which opens a file for reading and writing, and returns a file descriptor (a number special to the process and the file), and also calls like read(), write() etc.

Inode holds the count of the number of links referring to it. That makes things like hardlinks possible. It als stores the blocks of the disk where the data is stored. Now ext- series filesystem has 10 direct block pointers, and 1 each of 1-level, 2-level, and 3-level indirection pointers.

Direct pointers point directly to the segment that stores data. This is used for fast access. Indirect pointers contain a linked list of pointers that allow multiple entries to be stored in the same pointer. This is used for very large files.

Apart from this, there might also be an extent associated with a pointer, which tells how many contiguous blocks of data are present. It might also be that the inode just stores one base pointer and one extent.

ext4 famously departs from the block pointer structure and uses 4 direct blocks with extents to allocate. Further blocks are stored as an HTree.

All these filesystems also have an Access Control List. That allows them to decide if the user is capable of opening the file or not. This is again, stored as metadata in the inodes, and may be in a block or in some other form.

Virtual File System

Any filesystem on linux has to implement a basic interface as dictated by the above thing. That later on uses those things to provide the common userspace APIs such as open(), read(), write() etc.

Yeah that's pretty much it. Everything else is just details in Linux.
Journalling

Before doing anything to the file, whatever has to be done is written to a journal. That ensures that the file is able to restore its operation in case of a failure. Note that this has to be idempotent, if you do an operation, repeating it shouldn't be a problem. Only then will journalling work. This journal is of a finite size and is used as a circular buffer.

Apart from this, the filesystem also caches the writes into memory first and writes all of them together to reduce write overheads.

RAID devices

Raid-0: Striping across multiple disks. Allows you to get more read and write bandwidth and helps boost disk perf.
Raid-1: Mirroring across disks. Helps in having redundant information. This allows disk failure to be recoverable.
Raid-4: Parity check: store the failure info of all the blocks in that row as a parity function in the last disk and get it if needed.
Raid-5: Same as RAID-4, except parity blocks are rotated across the disks to remove the write bottlenecks while computing parities.

Booting on BIOS

Master Boot Record and Device Partitioning

This is done while partitioning the disk. We create a master boot record which contains the boot information and boot code for the partitions. It keeps a set of active (bootable) partitions that have a bootloader and allow for booting the OS (which may reside on a different partition).

Furthermore, these bootloaders can chainload other bootloaders/partition bootcodes. MBR takes up the first 512 byte sector of the disk. As a result, it only allows for 4 partitions and a total storage size of 2 TB. If you want more that 4 partitions, the last one has to be converted from a primary to extended. This further allows you to use logical volumes on it.

InnocentZero's Treasure Chest