|
©1994-2007 Kevin Boone | ||||||||||||||||||||||
|
Home > Computing > Linux > File handling in the Linux kernel
File handling in the Linux kernel: VFS layer
Last modified: Fri Aug 3 08:31:28 2007
In the previous article in this series, I explained how application
code entered the kernel by means of a system call. In this article we
will examine the VFS (`virtual filesystem') layer, which provides the
most general implementation of these system calls.
The VFS layerThesys_open function, and the corresponding sys_read,
sys_write, and so on, are entry points into the kernel's
VFS layer.
These functions are largely architecture-independent,
for reasons that will become clear later. The `virtual' part of
`virtual filesystem' reflects the fact that the operations carried out on this
layer are independent of lower-level implementation details. In particular,
VFS abstracts from the application programmer:
The VFS layer deals with the file open operation we are currently discussing, and other primitive operations on files such as reads and writes. It also deals with mounting filesytems and attaching block device drivers to mount points, as we shall see later.
In outline,
int sys_open (const char *name, int flags, int mode)
{
int fd = get_unused_fd();
struct file *f = filp_open(name, flags, mode);
fd_install(fd, f);
return fd;
}
The function get_unused_fd() attempts to find an
empty slot in the current process's file descriptor table.
This operation can fail, of course, if the process has too many
file descriptors open. If it succeeds, the function
filp_open() is called to do the real open; if
filp_open() succeeds, the file structure
it returns is assigned to the allocated slot in the file descriptor
table by fd_install().
In summary, all
struct file *filp_open
(const char *name, int flags, int mode)
{
struct nameidata nd;
open_namei (filename, namei_flags, mode, &nd);
return dentry_open(nd.dentry, nd.mnt, flags);
}
The filp_open function has, in essence, two steps.
First, it uses open_namei (in fs/namei.c) to
generate a nameidata structure). This structure provides,
among other things, a link to the file's inode (of which, more later).
The second step calls dentry_open(), passing the
salient information from the nameidata structure. It is
this latter step which will typically do the `open' operation,
by delegating it to a handler for the appropriate filesystem.
In order for either of these steps to make sense, we will have to make a brief digression into the world of inodes and dentries. In particular, we need to distinguish between the purposes of the inode, dentry, and file
structures.
The concept of an `inode' is an ancient one in the Unix world. An
inode is a block of data that stores system-level information about
a single file, such as its access modes, size, and references to its
location on the physical storage device. In Unix, different
filenames may point to the same file on disk (this is what
links are for), but each file has exactly one inode. A directory,
therefore, is nothing more than a mapping between filenames and
inodes. A directory is itself a file, and consequently has an inode
of its own. A number of processes may have the same file open,
subject to locking restrictions, but each process `sees' the same
inode, because each `sees' the same file. An inode is partly
exposed to the application program by a
So, let's return to
struct nameidata
{
struct dentry *dentry;
struct vfsmount *mnt;
//...
}
The dentry contains a reference to the cached inode of the file,
while the vfsmount references the filesytem on which the
file is located.
As we shall see, the mount utility
has the effect of causing vfsmount elements to be registered
with the kernel.
So, the
struct file *dentry_open
(struct dentry *dentry, struct vfsmount *mnt, int flags)i
{
struct file * f = // get an empty file structure;
f->f_flags = flags;
// ...populate other file structure elements
// Get the inode for the file
struct inode = dentry->d_inode;
// Call open() through the inode
f->f_op = fops_get(inode->i_fop);
f->f_op->open(inode, f);
}
This function finds the address of a function in the filesystem layer
that can handle the open operation and calls it, passing the inode
and the file structure. Where does this function address
come from? It will be found either in the inode itself, or in one of
the inode's parent inodes. In order to understand where these inodes
come from, we need to see how filesystem handlers are registered, and
individual filesystems mounted. We will return to the sys_open()
method later.
mount() support in the VFS layerWe have seen that application-level file operations likeopen()
and read() are made, via a thin layer in the C standard
library, on the kernel's VFS layer. The key feature of the VFS layer is
that it is filesystem-independent, and device-independent. An open() call will work on any kind of filesystem, on any kind of physical hardware,
in more-or-less the same way. However, underneath the VFS layer will be a filesytem layer, which contains support for the various filesystems known to the
kernel. Underneath the filestem layer will be the device layer, which interacts
with the hardware. The mounting process provides the bridge between the
VFS layer and the lower-level operations of the filesystem and device.
There is a potential ambiguity in terminology here. The word `filesystem' is sometimes used mean a particular type of filesystem (e.g., ext3, ufs), and
sometimes to mean a particular mounted instance of that filesystem type
(e.g., the ext3 filesystem mounted on /usr).
In the following, I will use the term `filesystem type' to refer to
a particular type of filesystem, and `mounted filesystem' to refer to
a specific instance. But don't expect the general Linux documentation to be
this consistent.
You can get a list of supported filesystem types by doing cat /proc/filesystemsA typically Linux system will support at least ext2
and iso9660 (CDROM) filesystem types, and some systems will
be configured to support many more.
Some of the filesytem types will be implemented in code directly
compiled into the core kernel. The proc filesystem type itself
is likely to be of this type. The proc type supports the
/proc directory which, as you probably know, allows applications
to find out about, and interact with, the kernel and device drivers.
The proc filesystem type is interesting in another respect
-- it demonstrates that in Linux a filesystem need not correspond to
any real, physical hardware. /proc exists entirely in memory,
and is generated dynamically.
Some filesystem types will more likely be supported through loadable modules, particularly if they are not used all that often. For example, if you mount DOS or Windows disks only occasionally, you may have configured the handlers for FAT and VFAT filesystems as loadable modules. Regardless of whether a filesystem handler is compiled into the core kernel, or implemented in a loadable module, the handler must register itself with the kernel. It will usually do this in its initialization code, by making a call on the kernel's register_filesystem(), which is
defined in fs/super.c. This function looks like
this:
int register_filesystem(struct file_system_type *fs)
{
struct file_system_type ** p =
find_filesystem(fs->name);
if (*p)
return -EBUSY;
else
{ *p = fs; return 0; }
}
All this simple function does is to check whether the specified
filesystem type is already registered and, if not, stores
the supplied file_system_type structure in the
kernel's filesystem table.
The filesystem handler's initialization
code initializes a struct file_system_type,
which looks like this:
struct file_system_type
{
const char *name;
struct super_block *(*read_super)
(struct super_block *, void *, int);
//...
}
The structure contains the name of the filesystem type
(e.g., ext3), and the address of a function called
read_super. This function is provided by the filesystem
handler, and will be called by the kernel when a filesystem of this
type is mounted. The read_super function will have the
task of initializing a struct super_block, the contents
of which will very likely be derived from the superblock on
the physical disk. The superblock (of which there may be more than
one on a real disk or partition) contains fundamental information
about the structure of the filesystem, the maximum supported
file size, and the filesytem type. The super_block
struct contains a memory image of this information, and also
pointers to the block device operations needed to operate
on this filesystem. We will discuss superblock operations
in more detail later.
At the user level, to make a particular instance of a filesystem available,
we typically use the
/*
sys_mount arguments:
dev_name - name of the block special file, e.g., /dev/hda1
dir_name - name of the mount point, e.g., /usr
fstype - name of the filesystem type, e.g., ext3
flags - mount flags, e.g., read-only
data - filesystem-specific data
*/
long sys_mount
(char *dev_name, char *dir_name, char *fstype,
int flags, char *name, void *data)
{
// Get a dentry for the mount point directory
struct nameidata nd_dir;
path_lookup (dir_name, /*...*/, ∓nd_dir);
// Get a dentry from the block special file that
// represents the disk hardware (e.g., /dev/hda)
struct nameidata nd_dev;
path_lookup (dev_name, /*...*/, ∓nd_dev);
// Get the block device structure which was allocated
// when loading the dentry for the block special file.
// This contains the major and minor device numbers
struct block_device *bdev = nd_dev->inode->i_bdev;
// Get these numbers into a packed k_dev_t (see later)
k_dev_t dev = to_kdev_t(bdev->bd_dev);
// Get the file_system_type struct for the given
// filesystem type name
struct file_system_type *type = get_fs_type(fstype);
struct super_block *sb = // allocate space
// Store the block device information in the sb
sb->s_dev = dev;
// ... populate other generate sb fields
// Ask the filesystem type handler to populate the
// rest of the superblock structure
type->read_super(s, data, flags & MS_VERBOSE));
// Now populate a vfsmount structure from the superblock
struct vfsmount *mnt = // allocate space
mnt->mnt_sb = sb;
//... Initialize other vfsmount elements from sb
// Finally, attach the vfsmount structure to the
// mount point's dentry (in `nd_dir')
graft_tree (mnt, nd_dir);
}
I should point out that the code above is a considerable simplification
of the real implementation because, apart from omitting all the
error handling code, it doesn't reflect the fact that filesystem
types are not really generic. For example, some filesystem types do
not have a block device associated with them (the proc
filesystem is of this type). Some filesystem types can serve multiple
mount points, while others can't. And so on. The code above only
shows the operation of a mount that associates a block special file
with a mount point, which is usually the most important case.
The block special file is just a particular kind of file, as far as Linux is concerned, and therefore has a dentry of its own. In the dentry
will be an inode, and that inode will itself have been obtained by a
disk read at some point. So you might be wondering how, if we need a block
special file, and that needs a disk read to fetch its inode,
how can the kernel do the very
first disk read at boot time? Is this not a chicken-and-egg situation?
In order to
resolve this infinite regress, the Linux kernel recognizes a particular
kind of disk filesystem called rootfs. This models the
root filesystem, and is initialized at boot time, not from the inode
of a block device (which cannot yet exist), but from physical device
properties passed in from the boot loader. During the boot sequence,
the root filesytem will normally be remounted as an ordinary filesystem,
which is why you run the command
% mountyou will see something like: /dev/hda1 on / type ext2 (rw)rather than `type rootfs'.
The place where the filesystem is to be mounted is itself a directory,
and therefore has its own Opening a file... continuedYou may remember that we broke off our discussion of opening a file, in thedentry_open() function, at the point where it
called the open function on the file's inode
structure. We should now be in a position to see where that inode
comes from.
When the application requests that a file be opened, the sys_open() code and the functions it calls will attempt
to find a cached inode for that file. It does this by looking in the
list of dentry structures in memory. If the inode is not in
the dentry cache, then the file open code walks
down the requested pathname, from '/' if necessary,
opening a dentry for each directory, until it reaches the
requested file.
You should keep in mind, without brooding on it too much, that these pathname descent operations are themselves filesystem operations, and themselves have to go through the filesystem handler(s), and perhaps the associated block device(s). This means that at the very least, the top level directory of any mounted filesytem must be locatable on disk without needing a disk read. If this were not the case, no other file below the top level directory would be locatable. The location of the top-level directory is made possible by insisting that it always have the same inode number (2). So, to get the root directory ('/') into memory, the VFS code simply asks the filesystem handler for the root directory to read inode number 2. It can then descend the pathname of the file to be opened by opening and reading each directory component in the path, and getting the inode number that corresponds to the directory name at each level of the tree. Of course, in doing this, the descent may cross filesystem boundaries. Suppose, for example, the application is opening the file `/home/fred/test'. The directory '/' may be on one physical disk, and the directory '/home' a different filesystem type on a different physical disk. The VFS implementation also has to contend with symbolic links, which may themselves cross filesystems, but we won't go into that technicality here. The general principle is that, as each inode is loaded and cached, it inherits the elements of its ancestor in the directory tree, unless the directory is a mount point. If it is a mount point, the VFS code finds the vfsmount
structure that was stored in the dentry for the mount
point when the filesystem was mounted, gets the super_operations
structure that was initialized by the filesystem handler, then calls the
function to open the inode for the top level directory.
And so the process continues,
until it arrives at the requested file. The cached inode structure
for this file will contain pointers to functions to carry out file
operations, inherited from the inode for the top-level directory,
which in turn obtained them from the super_block structure,
which itself got them from the the filesystem handler.
So, when the file open procedure in the VFS layer executes the following
code:
// Call open() through the inode f->f_op = fops_get(inode->i_fop); f->f_op->open(inode, f);it is calling a function to open a file that was provided by the handler for the filesystem on which the file is located.
|
|
|||||||||||||||||||||