The K-Zone: File handling in the Linux kernel: application layer

A layered architecture

Like most complex software systems, the Linux kernel, the applications it is supporting, and the hardware it runs on can be viewed as a `layered' system. At the top of the stack of layers we have a high degree of abstraction, and minimal control over the detailed operation of the system. At the bottom layer of the stack we have the real hardware -- disk, IO, and DMA controllers, and so on. Each layer is an abstraction, or simplification, of the one below it. In an ideal world, each layer would invoke the services of the the layer directly below it, and would itself be invoked by the layer directly above it. If this ideal structure is enforced, the whole system is loosely-coupled: dependencies between layers are minimized, and it is possible to modify one layer without have too severe an impact on other parts of the system.

The Linux kernel is not ideal in this sense. There are, perhaps, two main reasons for this. First, the kernel has grown organically, building on the contributions of a large number of people over many years. Whole chunks of the kernel have been pulled out and replaced, with the new pieces not fitting into exactly the same places. Secondly, it is important to consider the `horizontal' partitioning of the kernel at the same time as its vertical layering. By `horizontal partitioning' I mean the division into subsystems of code that is notionally at the same level of abstraction. Different subsystems often make use of each other but, unless they are at identical levels of abstraction, the interaction between subsystems inevitably to some blurring of the layer distinctions. In order to make these articles easier to understand, I have imposed my own layer structure, but my layers are conceptual, and are not generally so clearly defined in the real kernel.

In this first article I will describe each of the layers in outline, and then go on to a more detailed description of the highest layers (because these are relatively easy to describe). Each of the following articles will deal with one layer, working from top to bottom. So, in this article we will see application code, while in the final article we will see voltages changing on pins.

In outline, the layers which can be identified are the following.

The layered architecture is highly flexible. For example, at any particular layer there can be more than one subsystem, each of which operates in somewhat different ways. For example, in the filesystem layer we have handlers for the ext2 filesystem, ISO9660 filesystem, UDF, and so on. These handlers know nothing about device drivers, and can create and manage a filesystem on almost any block device. At the device driver layer we have support for SCSI, IDE, MFM, and other controllers. These drivers can be used with any filesystem. What's more, sub-stacks of the layered architecture can be stacked on top of other sub-stacks. Consider, for example, the use of USB hard disks. USB is normally a serial interface, but to drive a hard disk from the USB bus we need a protocol for doing block operations through this serial link. We could implement a separate `USB filesystem' and plug it into the VFS layer, but in practice we don't have to. What we could do, for example, is to write a block device driver that responds to requests from any filesystem, and converts them into USB requests. This driver could then be stacked on top of the whole USB protocol stack, which is itself layered. Alternatively -- and what is, in fact, done in the stock kernel -- we could write a driver that converts SCSI disk controller requests into USB requests, and then insert that driver into the stack between the generic SCSI block device driver and the USB stack.

It should be obvious that the use of the layered architecture leads to a high degree of flexibility. It has the disadvantage, however, of making it very difficult to build up a mental picture of the entire system.

Because we can't describe every possible combination of the various layers available to the kernel, in these articles I have selected some specific examples for the purposes of illustration. In particular, the application is written in C, and linked against the GNU standard C library. The filesystem is ext2, and hosted on a hard disk attached to an IDE bus.

Application layer

In these articles, we will assume that the application is written in C, and uses the well-known C low-level file-handling functions. To illustrate the concepts to be discussed, we will use the following, very simple application code. It opens a file, reads a few kilobytes of data, and closes it. Here is the relevant code fragment:
char buff[5000];
int f = open ("/foo/bar", O_RDONLY);
read (f, &buff, sizeof(buff));
close (f);
We will consider the flow of execution resulting from the open() and read() calls, all the way to the disk controller hardware. In all modern operating systems, application code is fundamentally different from kernel code. Application code is much more limited in what it can do, and subject to more rigorous controls. However, ultimately a thread of execution has to be able to pass from application code, through the kernel, and back to the application. The applications and the kernel are not separate processes, or even separate threads. This implies that we need a way to change the mode of execution from `application mode' to `kernel mode' and back again in a single thread. In all architectures there will be some instruction or set of instructions that have this effect. To avoid platform-dependencies, as well as to simplify coding, these instructions are typically encapsulated within standard libraries linked into the application.

The library layer

On Linux systems, the C open() and read() functions, as well as all the other standard C/C++ file handling functions, are usually implemented in the GNU standard C library (`glibc'). The executable code for this library is typically found in the archive /lib/libc-XXX.so, where `XXX' is a version number. When you compile a C program with gcc, it is automatically linked against glibc -- no developer intervention is required. The library functions in glibc are made available to the application by the magic of dynamic linking, so calls are not direct, but that need not concern us here. The way that glibc handles the open() operation, for example, will be architecture-dependant, but in most (all?) cases it will issue a system call, that is, a trap into the kernel. On x86 Linux, what the open() function in glibc does is to load the number for the open() system call (number `5' in this case) into the esp register, then execute the instruction
int 0x80
This software interrupt enters the Linux kernel, through the x86 interrupt vector table, at code defined in the (assembly language) file arch/i386/kernel/entry.S. After some jiggling around of the stack, the system call number is used as an offset into the system call table, which is also defined in entry.S under the name sys_call_table. Call number 5, the `open' call, is defined to point to the address of a function that will handle the open operation. Unless some other piece of kernel-level code has changed it, this function will be sys_open, which is defined in fs/open.c. Although the mechanism of trapping into the kernel varies between architectures, in most (all?) cases the glibc function open() ends up as a call to the kernel function sys_open(), unless the system call table has been changed.
      Incidentally, you won't necessarily find sys_call_table in the list of symbols exported by the kernel. After some intial uncertainty, it is now generally agreed that changing the behaviour of the kernel by modifying the contents of sys_call_table is a Bad Thing, and there are less intrusive ways to achieve the same effect.

All the foregoing architecture-dependent magic is conceptually not very significant. It isn't very incorrect to think of the application's open() call logically being implemented by the sys_open() function in the kernel, with a corresponding change of execution mode from `application mode' to `kernel mode'.
      You should also notice that no magic happens with threads when system calls are made. If applications make multiple, concurrent system calls, then multiple, concurrent threads of execution will enter the kernel. However, although there may be many distinct files on a particular physical disk, there will only be one disk controller per physical disk. As a result, at some point the kernel will have to implement some locking, to prevent multiple threads attempting to interact with the same hardware at the same time. Typically locking only occurs in the lowest levels of the kernel, so we don't need to worry too much about it just yet.

Next: the VFS layer
©1994-2006 Kevin Boone, all rights reserved