Linux

Guide to Linux Filesystem Mastery
by Sheryl Calish

What is a "filesystem," anyway? Sheryl Calish explains the concept as well as its practical application

Although the kernel is the heart of Linux, files are the main vehicles through which users interact with the operating system. This is especially true of Linux, because in the UNIX tradition, it uses the file I/O mechanism to manage hardware devices as well as with data files.

Unfortunately, the terminology used to discuss Linux filesystem concepts is a bit confusing for newcomers. The terms filesystem and file system are used interchangeably in the Linux documentation to refer to several different but related concepts. They refer to the data structures as well as the methods that manage the files within the partitions, in addition to specific instances of a disk partition.

To further confuse the uninitiated, these terms are also used to refer to the overall organization of files in a system: the directory tree. Then again, they can refer to each of the subdirectories within the directory tree, as in the /home filesystem. Some hold that these directories and subdirectories cannot truly be called a filesystem unless they each reside on their own disk partition. Nevertheless, others do refer to them as filesystems, contributing to the confusion.

Linux veterans understand, from context, the sense in which these terms are used. Newcomers, however, have an understandably harder time discerning the context.

The overriding objective of this article is to provide enough background to help you discern the context of this terminology for yourself. In the process of untangling the subtleties of the filesystem terminology, however, you will also acquire the knowledge to move beyond the theoretical to the practical application of some very useful related tools.

The article focuses on the Linux disk partitions and file management system features in version 2.4 of the Linux kernel. It also reviews new features available in version 2.6 of the kernel.

Overview of Disk Partitions

The basic unit of file storage in both Linux and UNIX is the disk partition, a logical division of one or more hard disks treated as an independent disk by the operating system. Files and file management systems "live" on disk partitions. These disk partitions are handled as devices by Linux, which, in turn, uses the file I/O mechanism via special files in the /dev directory.

There are two types of devices files: block and character/raw. One important difference between them is that block devices are buffered whereas character devices, because they don't have a file management system, are not. Before Oracle Cluster File System (OCFS) became available, using raw devices was a common method of increasing performance on Oracle datafile partitions. (In a follow-up to this article we'll take a much closer look at raw devices.)

The partition table, stored at the very beginning of a disk, provides a map of the partitions on that disk. You can view a system's partition table by using the fdisk command.

# fdisk -l

Disk /dev/hda: 240 heads, 63 sectors, 1940 cylinders
Units = cylinders of 15120 * 512 bytes

Device   Boot     Start       End      Blocks   Id     System
/dev/hda            1          286    2162128+   c     Win95 FAT32 (LBA)
/dev/hda2   *     288         1940   12496680    5     Extended
/dev/hda5         288          289      15088+  83     Linux
/dev/hda6         290          844    4195768+  83     Linux
/dev/hda7         845          983    1050808+  82     Linux swap
/dev/hda8         984         1816    6297448+  83     Linux
/dev/hda9        1817         1940     937408+  83     Linux

The nomenclature /dev/hda to /dev/hdd in the partition table refers to IDE drives 1 through 5, with hda referring to drive 1, hdb referring to drive 2, and so on. Partitions within a drive are referred to by number, so that /dev/hda5 would be the fifth partition on the first IDE drive. For SCSI drives, a similar naming scheme is used: /dev/sda to /dev/sdd.

Partitions No. 1 through 4 are reserved for primary partitions, and 5 and up are used for logical partitions. So, for the partition tables shown above, there is one drive, hda, with one primary partition, hda1, and one extended partition, hda2, with five logical partitions, /dev/hda5 through /dev/hda9. The filesystem listed as shmfs represents the shared memory filesystem mounted as a special filesystem according to POSIX standards in Linux 2.4.

You may have noticed the LBA in parentheses in the fdisk listing. LBA stands for logical block addressing, which converts the cylinder, block, and sector schema of a hard disk into linear block numbers for processing.

In Linux, partitions are either primary, extended, or logical partitions. The term primary partition is a holdover from the limitation of four partitions on old x86 systems. Unlike DOS and Windows, Linux can boot from a primary or a logical partition. Primary partitions that serve as placeholders for logical partitions are referred to as extended partitions. An extended partition has its own partition table that points to one or more logical partitions, which are simply subdivisions of a primary partition. In the fdisk listing above, hda2 is an extended partition.

Overview of File Management Systems

In order for a partitioned disk to be usable a filesystem must be built on it. In this case, we are referring to the filesystems also known as "partition types," "disk-based filesystems," and "file system types." In reality, these can be thought of as a file management system, because that is just what they do: They keep the files on your system in a consistent state by maintaining metadata on them.

One hallmark of the Linux project is the effort put into achieving compatibility with multiple styles and preferences for each of the available utilities, and nowhere is that accommodation more apparent than in the choice of available file management systems. This choice is enabled by the Virtual File System (VFS) inside the Linux kernel. VFS implements a basic set of data structures with which other file management systems can work. These data structures are the superblock, inode, dentry (or directory file), and the data block.

Each partition has a superblock, which maintains information on the filesystem within a partition including a set of inodes uniquely numbered within each superblock, the number of free and total inodes, the total number of data blocks, the number of free data blocks, and the filesystem's state. A filesystem's state is either clean, when the filesystem is unchanged, or dirty, when there have been changes to the filesystem that have not been written to disk. One inode within a superblock is assigned to each file.

Except for the filename, all the information about a file is contained in the inode, including the following:

  • Address
  • Type
  • Size
  • Owner
  • References to the block(s) with the file's data
  • Time-stamps for the last file modification and access.

You can view the inodes for the files if you issue the following command:

$ ls -i

As already mentioned, inodes are numbered uniquely only within a superblock, and there is only one superblock for each partition, which is why a hard link cannot cross partitions.

The filename is linked to an inode number with a dentry object, which users see as a directory file. Data blocks hold the actual file data.

Any file management system that implements the basic set of functions defined by VFS will be supported by Linux. In the case of a file management system such as vfat, the Linux project provides its own device driver.

Different file management systems can exist on different partitions on the same system, as you can see from the following output.

df -T
Filesystem    Type         1K Blocks    Used      Available   Use%   Mounted on
/dev/hda6     reiserfs     4195632      2015020   2180612     49%    /
/dev/hda5     ext2         14607        3778      10075        8%    /boot
/dev/hda9     reiserfs     937372       202368    735004      22%    /home
/dev/hda8     reiserfs     6297248      3882504   2414744     62%    /opt
shmfs         shm          256220       0         256220       0%    /dev/shm
/dev/hda1     vfat         2159992      1854192   305800      86%    /windows/C

Currently the most commonly used file management systems encountered by Oracle users are ext2/ext3, ReiserFS (not supported by Oracle), and OCFS. Below is a summary table of the major features of non-Oracle partitions.

Featureext2ext3ReiserFS3.6 (not supported by Oracle)
Maximum partition size4TB4TB16TB
Maximum file size 2GB-2TB 2GB-2TB 8TB
Block size1KB-4KB1KB-4KB4KB only
Journaling capabilitiesNoYesYes
Reboot after a crashSlowFastVery Fast
State of data after crashGoodVery GoodFair
ACL supportYesYesNo
StabilityExcellentGoodGood

Both ext2 and ReiserFS provide features such as user-level security and more efficient use of disk space, so that defragmentation tools, although they do exist for ext2 at least, are rarely needed. Ext2 is the traditional, de facto standard Linux file management system. It is the default for the Red Hat version of Linux, although ReiserFS is the default on SUSE. The maximum file size for ext2/ext3 is actually dependent on the choice of blocksize and hardware architecture. One of ext2's many features is that it allows blocksize to be determined by disk partition. ReiserFS technology, because it is based on balanced tree technology rather than being extent-based, allows variable file sizes within a disk partition, so efficient space usage, besides journaling capabilities, is inherent in its design.

Journaled file management systems, such as ext3 and ReiserFS, log changes to the filesystem's metadata: inodes, free block allocation maps, inode maps, and so on. In this manner, in the event of a system crash, the journal can be checked for the most recently modified metadata, thus ensuring a rapid recovery of the filesystem. This capability is especially important on large systems. Without this feature, a filesystem such as ext2 would require the fsck facility to run on reboot after a hardware failure. For large filesystems, this process can take hours.

Of course, there is a price to be paid for journaling in the form of a trade-off between processing time and recovery time. In the case of ext3, there is a choice of journaling modes that allow some discretion in trade-offs. The journal mode, which logs all filesystem data, including the data blocks, and the metadata, is the most secure but slowest mode. The default mode, known as orderd, only records the metadata but writes the data blocks to disk before it writes the metadata, thus providing the middle ground between fast recovery and fast performance. The fastest mode is the writeback mode, which records only the metadata. In this mode, file data may be lost but the integity of the filesystem itself is maintained.

As of this writing, Reiser4 had just been released. Like ReiserFS3.6, ReiserFS4 journals only metadata. Unlike ReiserFS3.6, it is based on a new dancing tree algorithm that seems to be faster than balanced tree algorithms. It also scales to numerous CPUs and includes built-in encryption and compression on disk writes.

OCFS is a specialized file management system for Oracle Real Application Clusters (RAC), configuration files, and database files. Other files, even the Oracle software files, will have better performance on ext2/ext3 or ReiserFS.

Currently, the common wisdom regarding the choice of a file management system seems to be that, except for a few situations, performance is comparable between ext2, ext3, and ReiserFS. Flame wars, however, have erupted among proponents of the various systems. ReiserFS, because of its ability to handle variable file sizes, seems to be better on systems with many small files. Of course, if you are running or plan to run Oracle RAC on Linux, you probably want to install OCFS or use Automatic Storage Management (ASM) for the Oracle datafiles and configuration files.

Besides the most common ext2/ext3 and ReiserFS filesystems, Linux also supports other native filesystems, including IBM's jsf and SGI's xfs. Support for traditional UNIX filesystems includes SYSV, BSD, Solaris, Next, and Veritas VxFS. Other filesystems supported at various levels include

  • Microsoft's fat, ntfs, vfat, fat32
  • IBM's hpfs (OS/2)
  • Apple's Macintosh hfs
  • Amiga's affs
  • Acorn Disk Filing System adfs

Please note that some filesystems are not supported by Oracle, so use them at your own risk.

The most significant new feature in Linux version 2.6 of the kernel is the presence of access control lists (ACLs). ACLs allow a list of one or more users, or groups of users, to be granted permissions for individual files. Other new features include:

  • Improved support for ISO 9660 filesystems used on CD-ROMs
  • Default mount options that can be stored in the filesystem
  • Indexed directories to speed up file searches
  • Support for Windows' Logical Disk Manager (dynamic disks)
  • The ability to mount ntfs as read/write, although writing is still experimental
  • Improved support for fat12 (old DOS filesystems)

Tools for Working with Partitions and Filesystems

To add a new disk or resize an existing one, you need to use fdisk or cfdisk. Although cfdisk is ostensibly easier to use, fdisk is the tried-and-true favorite for disk partitioning. Here are a few guidelines for using the Linux version of fdisk to help you know what to expect.

First, invoke fdisk as the superuser with the device name:

# fdisk /dev/hda

The number of cylinders for this disk is set to 1940.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
 (e.g., DOS FDISK, OS/2 FDISK) 

Command (m for help): m

You can obtain a display of your partition table by using the p, or print, command. A new partition is created with the n, or new, command, and the w, or write, command will write the new partition table to disk. Once you enter the new command, fdisk will need to know if you are creating a logical or primary partition:

Command (m for help): n
Command action
   l   logical (5 or over)
   p   primary partition (1-4)
l
No free sectors available

Command (m for help):

As you can see, if there isn't any free space, as above, you will get the above message. However, if you do have free space, fdisk will want to know your desired partition number. If you enter "p," for primary partition, you will have the following option.

Partition number (1-4):

For a logical partition, you will have the choice of

Partition number (5 or over):

Then you can enter the beginning cylinder number for the new partition. fdisk will recommend a default number, like this:

First cylinder (1-1940, default 1):1

which you can choose to accept . Next, you need to enter the last cylinder or the size of the partition:

Last cylinder or +sizeM or +sizeK(1-1940), default 5721: +1G

At this point, fdisk will assume that this is a regular Linux partition, identified by the hexadecimal number 83 in the "ID" column in a partition table. The partition type is changed with the t, or type, command in fdisk. The available partition types for fdisk are obtained with the l, or list, command. Here is a partial listing of the available types:

IDSystem
82Linux swap
83Linux
85Linux extended
8eLinux LVM

It is important to know that until you run the write command, anything you do in fdisk will be temporary—which is actually good if you need to bail out of fdisk for any reason.

Reorganizing Partitions and the File Management System

Because each partition contains its own file management system, resizing a partition involves resizing the file management system and the partition. The repartitioning tools available are therefore dependent on the type of file management system used. For ext2/ext3 systems, there is a choice of resize2fs used with fdisk, GNU Parted, or Partition Magic. For ReiserFS, the choice is limited to resize_resiszerfs used with cfdisk, because GNU Parted is still being refined for use with ReiserFS.

Both resize2fs and resize_reiserfs resize the file management system and require the use of a separate partition resizer-—either fdisk or cfdisk. I have personally used GNU Parted to repartition ext2 partitions. This is a reasonably simple program to use. GNU Parted's support for ReiserFS is due to become more robust in the future. Partition Magic is a commercial program for DOS and Windows but can be used with Linux ext2/ext3 partitions if run from the bootable floppy or CD-ROM that comes with it.

Although the actual commands depend on which system you are changing to, the general procedure for changing a file management system involves

  • Backing up files on the partition
  • Removing files from the partition
  • If you are using fdisk, possibly dropping a partition to make room for two smaller ones
  • Making the new filesystem with the appropriate commands. For example, to create an ext2 filesystem you would use

$ mke2fs /dev/hda5  15088
 
_ ..I

The block count can optionally be specified, as it is above (15,088). The only exception to the above sequence of events is in migration from an ext2 system to an ext3 system with a command such as

$tune2fs -j /dev/hda3

although a backup is still in order.

Mounting a Partition

A partition is not available in Linux until it is mounted by a user with superuser privileges. For Linux partitions listed in the /etc/fstab file, mounting happens automatically when the system boots. For CD-ROM and floppy-disk drives, it is usually only a matter of clicking on the appropriate icon.
Further Resources

Linux Technology Center

OCFS

ReiserF/SReiser4

Ext2fs Home Page

The options available for use with the mount option are dependent on the file management system. For example, you can specify ext3 journaling options as follows:

$ mount -t ext3 -o data=journaled /dev/hda9 /home

To remove a floppy disk or a CD-ROM, you need to specifically unmount it before removing it, using the command

$ umount /media/floppy

Prior to Linux 2.4, a filesystem could be mounted only once. Now, however, there is no limit on the number of times a filesystem can be mounted.

Conclusion

The Linux filesystem is a multifaceted concept. This discussion was meant to serve as a basis for further research into its usefulness and desirability according to your own requirements.

In Part 2 of this article we'll examine discussing cluster filesystems, including OCFS.


Sheryl Calish (scalish@earthlink.net) is an Oracle developer, specializing in Linux, for Blue Heron Consulting. She is also funding chair for the Central Florida Oracle Users Group and marketing chair for the IOUG Linux SIG.


Please rate this document:

Excellent Good Average Below Average Poor


Send us your comments

Printer View Printer View