Like most complex software systems, the Linux kernel, the applications
it is supporting,
and the hardware it runs on can be viewed as a `layered' system.
At the top of
the stack of layers we have a high degree of abstraction, and minimal control
over the detailed operation of the system. At the bottom layer
of the stack we have the real hardware -- disk, IO, and DMA controllers,
and so on.
Each layer is an abstraction, or simplification, of the one below it.
In an ideal world, each layer would invoke the services of the the layer
directly below it, and would itself be invoked by the layer directly
above it. If this ideal structure is enforced, the whole system is
loosely-coupled: dependencies between layers are minimized, and it
is possible to modify one layer without have too severe an impact
on other parts of the system.
The Linux kernel is not ideal in this sense. There are, perhaps, two main reasons for this. First, the kernel has grown organically, building on the contributions of a large number of people over many years. Whole chunks of the kernel have been pulled out and replaced, with the new pieces not fitting into exactly the same places. Secondly, it is important to consider the `horizontal' partitioning of the kernel at the same time as its vertical layering. By `horizontal partitioning' I mean the division into subsystems of code that is notionally at the same level of abstraction. Different subsystems often make use of each other but, unless they are at identical levels of abstraction, the interaction between subsystems inevitably to some blurring of the layer distinctions. In order to make these articles easier to understand, I have imposed my own layer structure, but my layers are conceptual, and are not generally so clearly defined in the real kernel.
In this first article I will describe each of the layers in outline, and then go on to a more detailed description of the highest layers (because these are relatively easy to describe). Each of the following articles will deal with one layer, working from top to bottom. So, in this article we will see application code, while in the final article we will see voltages changing on pins.
In outline, the layers which can be identified are the following.
fs/
in the kernel source.
fs/ directory, along with the
other VFS stuff, but some is in mm/ with the virtual memory
management infrastructure.
/proc
filesystem has no permanent storage at all. VFS does not care how a filesystem
is implemented, so long as it implements the correct API. Most disk file
systems are, however, implemented on top of a block device. A block device
models a data storage device as a set of contiguous data blocks of a
fixed size. The block device does not know or care what goes in the
data blocks -- that is the job of the filesystem handler.
In most cases, real file
systems to not make calls on block device drivers, even high-level calls.
They are free to do so, but it is often easier for the developer
to use the generic block device support provided by the
code in drivers/block. This generic code provides a
lot of the functionality that all block devices will need, particular
request queue and buffer management.
It should be obvious that the use of the layered architecture leads to a high degree of flexibility. It has the disadvantage, however, of making it very difficult to build up a mental picture of the entire system.
Because we can't describe every possible combination of the various
layers available to the kernel, in these articles I have selected some
specific examples for the purposes of illustration. In particular, the
application is written in C, and linked against the GNU standard
C library. The filesystem is ext2, and hosted on a hard disk
attached to an IDE bus.
char buff[5000];
int f = open ("/foo/bar", O_RDONLY);
read (f, &buff, sizeof(buff));
close (f);
We will consider the flow of execution resulting from the open()
and read() calls, all the way to the disk controller hardware.
In all modern operating systems, application code is fundamentally different
from kernel code. Application code is much more limited in what it can
do, and subject to more rigorous controls. However, ultimately
a thread of execution has to be able to pass from application code,
through the kernel, and back to the application. The applications and the
kernel are not separate processes, or even separate threads. This implies
that we need a way to change the mode of execution from `application
mode' to `kernel mode' and back again in a single thread.
In all architectures there will
be some instruction or set of instructions that have this effect.
To avoid platform-dependencies, as well as to simplify coding, these
instructions are typically encapsulated within standard libraries
linked into the application.
open() and read() functions,
as well as all the other
standard C/C++ file handling functions, are usually implemented in the GNU
standard C library (`glibc'). The executable code for this library is
typically found in the archive
/lib/libc-XXX.so, where `XXX' is a version number. When
you compile a C program with gcc,
it is automatically linked against glibc
-- no developer intervention is required.
The library functions in glibc are made
available to the application by the magic of dynamic linking, so
calls are not direct, but that need not concern us here. The way
that glibc handles the open() operation, for example, will be
architecture-dependant, but in most (all?) cases it will issue
a system call, that is, a trap into the kernel.
On x86 Linux, what the open() function in glibc
does is to load the number for the open()
system call (number `5' in
this case) into the
esp register, then execute the instruction
int 0x80This software interrupt enters the Linux kernel, through the x86 interrupt vector table, at code defined in the (assembly language) file
arch/i386/kernel/entry.S. After some jiggling
around of the stack, the system call number is used as an
offset into the system call table, which is
also defined in
entry.S under the name sys_call_table.
Call number 5, the `open' call,
is defined to point to the address of a function
that will handle the open operation. Unless some
other piece of kernel-level code has changed it, this function
will be sys_open, which is defined in
fs/open.c. Although the mechanism of trapping into the
kernel varies between architectures, in most (all?) cases the
glibc function open() ends up as a call
to the kernel function sys_open(), unless the system
call table has been changed.
sys_call_table
in the list of symbols exported by the kernel. After some intial
uncertainty, it is now generally agreed that changing the behaviour
of the kernel by modifying the contents of sys_call_table
is a Bad Thing, and there are less intrusive ways to achieve the same effect.
All the foregoing architecture-dependent magic is conceptually
not very significant. It isn't very incorrect to think of the application's
open() call logically being implemented by the
sys_open()
function in the kernel, with a corresponding change of execution mode
from `application mode' to `kernel mode'.
You should also notice that no magic happens with threads when system
calls are made. If applications make multiple, concurrent system calls,
then multiple, concurrent threads of execution
will enter the kernel. However, although there may be many distinct files
on a particular physical disk, there will only be one disk
controller per physical disk. As a result, at some point the kernel
will have to implement some locking, to prevent multiple threads
attempting to interact with the same hardware at the same time.
Typically locking only
occurs in the lowest levels of the kernel, so we don't need to worry
too much about it just yet.
Next: the VFS layer
©1994-2006 Kevin Boone, all rights reserved