|
Linux
Guide to Linux Filesystem Mastery
by Sheryl Calish
What is a "filesystem," anyway? Sheryl Calish explains the concept as well as its practical application
Although the kernel is the heart of Linux, files
are the main vehicles through which users interact with the operating
system. This is especially true of Linux, because in the UNIX
tradition, it uses the file I/O mechanism to manage hardware devices as
well as with data files.
Unfortunately, the terminology used to discuss Linux filesystem concepts is a bit confusing for newcomers. The terms filesystem and file system
are used interchangeably in the Linux documentation to refer to several
different but related concepts. They refer to the data structures as
well as the methods that manage the files within the partitions, in
addition to specific instances of a disk partition.
To further confuse the uninitiated, these terms
are also used to refer to the overall organization of files in a
system: the directory tree. Then again, they can refer to each of the
subdirectories within the directory tree, as in the /home filesystem.
Some hold that these directories and subdirectories cannot truly be
called a filesystem unless they each reside on their own disk
partition. Nevertheless, others do refer to them as filesystems,
contributing to the confusion.
Linux veterans understand, from context, the
sense in which these terms are used. Newcomers, however, have an
understandably harder time discerning the context.
The overriding objective of this article is to
provide enough background to help you discern the context of this
terminology for yourself. In the process of untangling the subtleties
of the filesystem terminology, however, you will also acquire the
knowledge to move beyond the theoretical to the practical application
of some very useful related tools.
The article focuses on the Linux disk partitions
and file management system features in version 2.4 of the Linux kernel.
It also reviews new features available in version 2.6 of the kernel.
Overview of Disk Partitions
The basic unit of file storage in both Linux and
UNIX is the disk partition, a logical division of one or more hard
disks treated as an independent disk by the operating system. Files and
file management systems "live" on disk partitions. These disk
partitions are handled as devices by Linux, which, in turn, uses the
file I/O mechanism via special files in the /dev directory.
There are two types of devices files: block and character/raw.
One important difference between them is that block devices are
buffered whereas character devices, because they don't have a file
management system, are not. Before Oracle Cluster File System
(OCFS) became available, using raw devices was a common method of
increasing performance on Oracle datafile partitions. (In a follow-up
to this article we'll take a much closer look at raw devices.)
The partition table, stored at the very
beginning of a disk, provides a map of the partitions on that disk. You
can view a system's partition table by using the fdisk command.
# fdisk -l
Disk /dev/hda: 240 heads, 63 sectors, 1940 cylinders
Units = cylinders of 15120 * 512 bytes
Device Boot Start End Blocks Id System
/dev/hda 1 286 2162128+ c Win95 FAT32 (LBA)
/dev/hda2 * 288 1940 12496680 5 Extended
/dev/hda5 288 289 15088+ 83 Linux
/dev/hda6 290 844 4195768+ 83 Linux
/dev/hda7 845 983 1050808+ 82 Linux swap
/dev/hda8 984 1816 6297448+ 83 Linux
/dev/hda9 1817 1940 937408+ 83 Linux
The nomenclature /dev/hda to /dev/hdd in the
partition table refers to IDE drives 1 through 5, with hda referring to
drive 1, hdb referring to drive 2, and so on. Partitions within a drive
are referred to by number, so that /dev/hda5 would be the fifth
partition on the first IDE drive. For SCSI drives, a similar naming
scheme is used: /dev/sda to /dev/sdd.
Partitions No. 1 through 4 are reserved for
primary partitions, and 5 and up are used for logical partitions. So,
for the partition tables shown above, there is one drive, hda, with one
primary partition, hda1, and one extended partition, hda2, with five
logical partitions, /dev/hda5 through /dev/hda9. The filesystem listed
as shmfs represents the shared memory filesystem mounted as a special
filesystem according to POSIX standards in Linux 2.4.
You may have noticed the LBA in parentheses in the fdisk listing. LBA stands for logical block addressing, which converts the cylinder, block, and sector schema of a hard disk into linear block numbers for processing.
In Linux, partitions are either primary, extended, or logical partitions. The term primary partition
is a holdover from the limitation of four partitions on old x86
systems. Unlike DOS and Windows, Linux can boot from a primary or a
logical partition. Primary partitions that serve as placeholders for
logical partitions are referred to as extended partitions. An extended
partition has its own partition table that points to one or more
logical partitions, which are simply subdivisions of a primary
partition. In the fdisk listing above, hda2 is an extended partition.
Overview of File Management Systems
In order for a partitioned disk to be usable a
filesystem must be built on it. In this case, we are referring to the
filesystems also known as "partition types," "disk-based filesystems,"
and "file system types." In reality, these can be thought of as a file
management system, because that is just what they do: They keep the
files on your system in a consistent state by maintaining metadata on
them.
One hallmark of the Linux project is the effort
put into achieving compatibility with multiple styles and preferences
for each of the available utilities, and nowhere is that accommodation
more apparent than in the choice of available file management systems.
This choice is enabled by the Virtual File System (VFS) inside the
Linux kernel. VFS implements a basic set of data structures with which
other file management systems can work. These data structures are the superblock, inode, dentry (or directory file), and the data block.
Each partition has a superblock, which maintains
information on the filesystem within a partition including a set of
inodes uniquely numbered within each superblock, the number of free and
total inodes, the total number of data blocks, the number of free data
blocks, and the filesystem's state. A filesystem's state is either clean, when the filesystem is unchanged, or dirty,
when there have been changes to the filesystem that have not been
written to disk. One inode within a superblock is assigned to each
file.
Except for the filename, all the information about a file is contained in the inode, including the following:
- Address
- Type
- Size
- Owner
- References to the block(s) with the file's data
- Time-stamps for the last file modification and access.
You can view the inodes for the files if you issue the following command:
$ ls -i
As already mentioned, inodes are numbered
uniquely only within a superblock, and there is only one superblock for
each partition, which is why a hard link cannot cross partitions.
The filename is linked to an inode number with a
dentry object, which users see as a directory file. Data blocks hold
the actual file data.
Any file management system that implements the
basic set of functions defined by VFS will be supported by Linux. In
the case of a file management system such as vfat, the Linux project
provides its own device driver.
Different file management systems can exist on
different partitions on the same system, as you can see from the
following output.
df -T
Filesystem Type 1K Blocks Used Available Use% Mounted on
/dev/hda6 reiserfs 4195632 2015020 2180612 49% /
/dev/hda5 ext2 14607 3778 10075 8% /boot
/dev/hda9 reiserfs 937372 202368 735004 22% /home
/dev/hda8 reiserfs 6297248 3882504 2414744 62% /opt
shmfs shm 256220 0 256220 0% /dev/shm
/dev/hda1 vfat 2159992 1854192 305800 86% /windows/C
Currently the most commonly used file management
systems encountered by Oracle users are ext2/ext3, ReiserFS (not
supported by Oracle), and OCFS. Below is a summary table of the major
features of non-Oracle partitions.
|
Feature | ext2 | ext3 | ReiserFS3.6 (not supported by Oracle) |
| Maximum
partition
size | 4TB | 4TB | 16TB |
| Maximum
file size |
2GB-2TB |
2GB-2TB |
8TB |
| Block size | 1KB-4KB | 1KB-4KB | 4KB only |
| Journaling
capabilities | No | Yes | Yes |
| Reboot after
a crash | Slow | Fast | Very Fast |
| State of
data after
crash | Good | Very Good | Fair |
| ACL support | Yes | Yes | No |
| Stability | Excellent | Good | Good |
Both ext2 and ReiserFS provide features such as
user-level security and more efficient use of disk space, so that
defragmentation tools, although they do exist for ext2 at least, are
rarely needed. Ext2 is the traditional, de facto standard Linux file
management system. It is the default for the Red Hat version of Linux,
although ReiserFS is the default on SUSE. The maximum file size for
ext2/ext3 is actually dependent on the choice of blocksize and hardware
architecture. One of ext2's many features is that it allows blocksize
to be determined by disk partition. ReiserFS technology, because it is
based on balanced tree technology rather than being extent-based,
allows variable file sizes within a disk partition, so efficient space
usage, besides journaling capabilities, is inherent in its design.
Journaled file management systems, such as ext3
and ReiserFS, log changes to the filesystem's metadata: inodes, free
block allocation maps, inode maps, and so on. In this manner, in the
event of a system crash, the journal can be checked for the most
recently modified metadata, thus ensuring a rapid recovery of the
filesystem. This capability is especially important on large systems.
Without this feature, a filesystem such as ext2 would require the fsck
facility to run on reboot after a hardware failure. For large
filesystems, this process can take hours.
Of course, there is a price to be paid for
journaling in the form of a trade-off between processing time and
recovery time. In the case of ext3, there is a choice of journaling
modes that allow some discretion in trade-offs. The journal
mode, which logs all filesystem data, including the data blocks, and
the metadata, is the most secure but slowest mode. The default mode,
known as orderd,
only records the metadata but writes the data blocks to disk before it
writes the metadata, thus providing the middle ground between fast
recovery and fast performance. The fastest mode is the writeback
mode, which records only the metadata. In this mode, file data may be
lost but the integity of the filesystem itself is maintained.
As of this writing, Reiser4 had just been
released. Like ReiserFS3.6, ReiserFS4 journals only metadata. Unlike
ReiserFS3.6, it is based on a new dancing tree algorithm that seems to
be faster than balanced tree algorithms. It also scales to numerous
CPUs and includes built-in encryption and compression on disk writes.
OCFS is a specialized file management system for
Oracle Real Application Clusters (RAC), configuration files, and
database files. Other files, even the Oracle software files, will have
better performance on ext2/ext3 or ReiserFS.
Currently, the common wisdom regarding the
choice of a file management system seems to be that, except for a few
situations, performance is comparable between ext2, ext3, and ReiserFS.
Flame wars, however, have erupted among proponents of the various
systems. ReiserFS, because of its ability to handle variable file
sizes, seems to be better on systems with many small files. Of course,
if you are running or plan to run Oracle RAC on Linux, you probably
want to install OCFS or use Automatic Storage Management (ASM) for the
Oracle datafiles and configuration files.
Besides the most common ext2/ext3 and ReiserFS
filesystems, Linux also supports other native filesystems, including
IBM's jsf and SGI's xfs. Support for traditional UNIX filesystems
includes SYSV, BSD, Solaris, Next, and Veritas VxFS. Other filesystems
supported at various levels include
- Microsoft's fat, ntfs, vfat, fat32
- IBM's hpfs (OS/2)
- Apple's Macintosh hfs
- Amiga's affs
- Acorn Disk Filing System adfs
Please note that some filesystems are not supported by Oracle, so use them at your own risk.
The most significant new feature in Linux
version 2.6 of the kernel is the presence of access control lists
(ACLs). ACLs allow a list of one or more users, or groups of users, to
be granted permissions for individual files. Other new features include:
- Improved support for ISO 9660 filesystems used on CD-ROMs
- Default mount options that can be stored in the filesystem
- Indexed directories to speed up file searches
- Support for Windows' Logical Disk Manager (dynamic disks)
- The ability to mount ntfs as read/write, although writing is still experimental
- Improved support for fat12 (old DOS filesystems)
Tools for Working with Partitions and Filesystems
To add a new disk or resize an existing one, you need to use fdisk or cfdisk. Although cfdisk is ostensibly easier to use, fdisk is the tried-and-true favorite for disk partitioning. Here are a few guidelines for using the Linux version of fdisk to help you know what to expect.
First, invoke fdisk as the superuser with the device name:
# fdisk /dev/hda
The number of cylinders for this disk is set to 1940.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): m
You can obtain a display of your partition table by using the p, or print, command. A new partition is created with the n, or new, command, and the w, or write, command will write the new partition table to disk. Once you enter the new command, fdisk will need to know if you are creating a logical or primary partition:
Command (m for help): n
Command action
l logical (5 or over)
p primary partition (1-4)
l
No free sectors available
Command (m for help):
As you can see, if there isn't any free space, as above, you will get the above message. However, if you do have free space, fdisk will want to know your desired partition number. If you enter "p," for primary partition, you will have the following option.
Partition number (1-4):
For a logical partition, you will have the choice of
Partition number (5 or over):
Then you can enter the beginning cylinder number for the new partition. fdisk will recommend a default number, like this:
First cylinder (1-1940, default 1):1
which you can choose to accept . Next, you need to enter the last cylinder or the size of the partition:
Last cylinder or +sizeM or +sizeK(1-1940), default 5721: +1G
At this point, fdisk
will assume that this is a regular Linux partition, identified by the
hexadecimal number 83 in the "ID" column in a partition table. The
partition type is changed with the t, or type, command in fdisk. The available partition types for fdisk are obtained with the l, or list, command. Here is a partial listing of the available types:
|
ID | System |
| 82 | Linux swap |
| 83 | Linux |
| 85 | Linux extended |
| 8e | Linux LVM |
It is important to know that until you run the write command, anything you do in fdisk will be temporary—which is actually good if you need to bail out of fdisk for any reason.
Reorganizing Partitions and the File Management System
Because each partition contains its own
file management system, resizing a partition involves resizing the file
management system and the partition. The repartitioning tools available
are therefore dependent on the type of file management system used. For
ext2/ext3 systems, there is a choice of resize2fs used with fdisk, GNU
Parted, or Partition Magic. For ReiserFS, the choice is limited to resize_resiszerfs used with cfdisk, because GNU Parted is still being refined for use with ReiserFS.
Both resize2fs and resize_reiserfs
resize the file management system and require the use of a separate
partition resizer-—either fdisk or cfdisk. I have personally used GNU
Parted to repartition ext2 partitions. This is a reasonably simple
program to use. GNU Parted's support for ReiserFS is due to become more
robust in the future. Partition Magic is a commercial program for DOS
and Windows but can be used with Linux ext2/ext3 partitions if run from
the bootable floppy or CD-ROM that comes with it.
Although the actual commands depend on which
system you are changing to, the general procedure for changing a file
management system involves
- Backing up files on the partition
- Removing files from the partition
- If you are using fdisk, possibly dropping a partition to make room for two smaller ones
- Making the new filesystem with the appropriate commands. For example, to create an ext2 filesystem you would use
$ mke2fs /dev/hda5 15088
_ ..I
The block count can optionally be specified, as
it is above (15,088). The only exception to the above sequence of
events is in migration from an ext2 system to an ext3 system with a
command such as
$tune2fs -j /dev/hda3
although a backup is still in order.
Mounting a Partition
A partition is not available in Linux until it
is mounted by a user with superuser privileges. For Linux partitions
listed in the /etc/fstab file, mounting happens automatically when the
system boots. For CD-ROM and floppy-disk drives, it is usually only a
matter of clicking on the appropriate icon.
The options available for use with the mount
option are dependent on the file management system. For example, you
can specify ext3 journaling options as follows:
$ mount -t ext3 -o data=journaled /dev/hda9 /home
To remove a floppy disk or a CD-ROM, you need to specifically unmount it before removing it, using the command
$ umount /media/floppy
Prior to Linux 2.4, a filesystem could be
mounted only once. Now, however, there is no limit on the number of
times a filesystem can be mounted.
Conclusion
The Linux filesystem is a multifaceted concept.
This discussion was meant to serve as a basis for further research into
its usefulness and desirability according to your own requirements.
In Part 2 of this article we'll examine discussing cluster filesystems, including OCFS.
Sheryl Calish (scalish@earthlink.net)
is an Oracle developer, specializing in Linux, for Blue Heron
Consulting. She is also funding chair for the Central Florida Oracle
Users Group and marketing chair for the IOUG Linux SIG.
|