Katedra počítačů FEL ČVUT
Karlovo nám. 13
121 35 Praha 2
e-mail: zemanekp@fel.cvut.cz
Key words: operating system, performance measurement, performance tuning
Abstract: This article defines basic measures for operating systems performance. We present extremely useful tools for performance monitoring used in the very popular UNIX operating system, as well as methods for interpretating the results of measurement. Methods for increasing the performance of processes are discussed. The performance of processes can be improved by reducing the demand for process resources or by increasing the operating system resources. Finally, methods for tuning disks and file systems are presented. Performance of ORACLE database systemrunning on LINUX operating system is given as case study.
1. Introduction - General Principles of Performance
Monitoring and Operating System Tuning
Present-day information technologies need to be managed to meet the requirements of high performance. Different users in different environments may have very different response time needs and expectations. In a broader sense, system performance refers to how well the computer resources accomplish the work they are designed to do. The performance of any computer system may be defined by two criteria:
Response time (User Perspective): Response time is the time between the instant the user asks the system for some information (e.g. the user presses the return key) and the receipt of the desired information from the computer.
Throughput (IT Management Perspective): Throughput is the number of transactions accomplished in a fixed time period.
Of the two measures, throughput is the better measure of how much work is actually getting done; however, response time is more visible and is, therefore, used more frequently.
Expectations and needs define good or acceptable performance. Setting expectations correctly is essential. Because needs vary from site to site, the ability to control the system's resource use and response time becomes the key factor in achieving customer satisfaction and the perception of good performance. Basically, the operating system should maintain a high throughput by efficient process scheduling and resource sharing.
Most interactive users prefer a small and stable response time which gives them a sense of performance predictability. This is something to keep in mind when monitoring the system for workload balancing and system tuning.
The performance of any IT system is affected by following factors:
Managing system performance means:
A resource is a bottleneck, if the size of a request exceeds the available resource. In other words, a bottleneck is a limitation of system performance due to the inadequacy of a hardware or software component, or of the system's organisation.
Fig.1 Two ways to solve a bottleneck
There are two ways to solve a bottleneck (see Fig.1) - increase the size of available resource or decrease the size of the request.
Which of the two methods to choose depends on each individual case. Usually, adding to the size of the resource involves system tuning (SW) or upgrading (HW). Reducing the size of the request involves system and workload balancing.
If we are tuning a real system, we have to follow several general guidelines:
Traditional computer bottlenecks usually are associated with CPU, memory and the disk. UNIX systems can have these problems, too, but the complex interaction among resources leads to further complexities. Because UNIX is often the operating system of choice in networked environments and for use in graphics workstations, there are two other bottlenecks that play an important role in performance analysis: networking and graphics.
Another type of bottleneck that is characteristic of UNIX systems which relates to kernel resources. This is due to the kernel architecture of UNIX, which is quite different from that of the usual mainframe computer operating system.
2. Tools for Performance Analysis
In general, performance tools are of two types: instantaneous and logging. Instantaneous tools give you a snapshot of the system's performance at the moment you are running the tool. A logging tool samples performance data over time to a disk file. Data collected to the file may be used for further processing.
Instantaneous tools are most useful for diagnosing system problems which are constant and current. The on-the-spot performance view can assist in quickly isolating the problem which is degrading system performance at the moment you are using the tool.
However, some problems are intermittent or tied to periodic events, and you may not be able to view the system at the moment that the problem reveals itself. Also, some problems are the result of conditions which build up gradually over time. To detect problems of this sort, logging tools that collect data over time may be a more suitable choice.
$ sar -u 2 20
drs drs 4.2 7.7.8 486/EISA 01/09/98
14:13:32 %usr %sys %wio %idle
14:13:34 1 4 0 95
14:13:36 0 0 0 100
14:13:38 0 0 0 100
14:13:40 6 9 23 62
14:13:42 6 24 44 27
14:13:44 6 30 64 0
14:13:46 4 34 61 0
14:13:48 2 30 68 0
14:13:50 4 36 60 0
14:13:52 5 29 65 0
14:13:54 5 31 63 0
14:13:56 1 11 17 70
14:13:58 0 0 0 100
14:14:00 0 3 2 95
14:14:02 1 2 0 97
14:14:04 0 4 0 96
14:14:06 1 5 9 84
14:14:08 26 23 37 13
14:14:10 15 13 71 0
14:14:12 0 1 11 88
Average 4 15 30 51
$
Ex. 1 sar -u output
System Activity Reporter (sar) [2] is oldest tool that is available for collecting performance data in the UNIX operating system. It samples the internal counters of the UNIX kernel that keep track of requests, completion times, I/O block counts and so on. It then calculates rates and ratios and makes reports on them. sar reports are useful for gaining an insight into what is happening in the system and where the bottlenecks are occurring.
sar will collect a very large amount of data, particularly when invoked for long periods of time. When sampling over many hours, it is recommended that the sampling interval be no less than 30 seconds. The system load produced by sar is fairly small, typically 1-2 percent.
In its simplest form, sar can be set up to gather data over a period of time by entering the following command, in which t is the number of seconds between samples and n is the number of samples
sar n t
If you wish to collect data automatically, or over a long period of time, it may be useful to set up a crontab entry. Typically, you will use a shell script called sa1 that calls sadc. sadc, which is also called by sar, produces a binary file (/var/adm/sa/sadd, in which dd is the day of the month).
Once data has been collected, there are different ways to generate reports. If we want to generate reports automatically on a routine basis, we can use sa2 in a crontab entry. sa2 is a shell script that calls sar and generates reports from the binary files created earlier and stores the reports in /var/adm/sa/sardd, in which dd is the current day.
iostat is a BSD (Berkeley) based tool used primarily to report disk activity. Its output can be more useful than that of sar -d, although sar is a more complete tool.
vmstat is another BSD based tool. It reports certain statistics kept about the virtual memory, processes, trap and CPU activity. It will also, optionally, report on disk activity, the number of forks and vforks since system start-up the number of pages of virtual memory involved in each kind of forks or sum structures of several kinds of paging-related events.
vmstat and iostat report statistics in a similar way. They take an interval of seconds as an argument, as well as a count of how many times to repeat. They take data from the internal counters of the kernel and print their report to the standard output.
For analysing the current state of the system, there are other specific tools like HP GlancePlus. This is an on-line diagnostic tool that monitors ongoing system activity. System resources are monitored and displayed at the global or individual process level via multiple screens.
Other HP-UX tool, MeasureWareAgent is used to gather information on a more long term basis. Once gathered, the data can be used to show how the system performed over a period of time, and its output can be used to analyse trends with reports displayed in a graphical output for ease of viewing. To display the log file's data, you may use the PerfView graphical interface.
3. Performance Tuning of Processes
3.1 CPU Architecture and CPU Tuning
On RISC processors, the CPU loads instructions from memory and runs them one instruction per cycle. Therefore, the faster the clock speed, the faster each instruction is executed. However, the CPU can run only one instruction per cycle, only if the associated hardware can supply the instructions at a timely rate. Otherwise, the CPU waits. To minimise the time that the CPU spends waiting for instructions and data, we use a data and instruction cache.
The cache [1] is a very high speed memory that can be accessed in one CPU cycle with the contents being a subset of main memory. Typical access time of the cache is 10-20ns compared to 80-90ns of normal RAM. As the CPU requires instructions and data, they are loaded into the cache along with additional instructions and data that the kernel predicts the CPU may use next; therefore, the size of the cache has a very large bearing on how busy the CPU is kept. The larger the cache, the more likely that it will contain the instructions and data to be executed.
The typical ratio between the size of a cache and the size of RAM is 1:1000 (e.g. 1kB for 1MB RAM).
The addresses of pages of virtual memory (see paragraph 3.3) that are currently in physical memory are stored in the page table. Modern processors use a multilevel page table. For example, the SUN SPARC processor uses a three level page table [1].
The TLB cache (different from data and instruction cache) is used to speed up the translation of virtual addresses into physical addresses. The TLB is a cache that stores the recently accessed virtual address and its associated physical address along with access rights and access ID. TLB is implemented as associative memory, and is searched directly for the contents (we do not have to go through it). The current state of hardware technology does not allow us to implement the whole page table as associative memory. The typical size of TLB is 10 kB page table entries. For example, the size of the whole page table for 1GB of RAM and 4kB page is 250 kB page table entries.
There is no way to measure the hit rate or efficiency of the processor cache and the TLB. The only way to increase the speed of the processor is to increase the size of the processor cache. The applications can be written to take advantage of specific cache and TLB sizes, but this is difficult.
The complete instruction execution process - including fetching the instruction, executing the instruction, and storing the results - can be broken into a number of independent machine operations [1]. These operations can be performed in parallel for different instructions. For example, the fetching of an instruction can be performed in parallel with the execution of the previous instruction. This technique is referred to as instruction pipeling. The number of operations performed in parallel determines the depth of the pipe.
The optimisation in the compiler will sometimes alter the way the code works. In some cases this will affect the actual result, that is, it will be wrong. Pragmas can be used to optimise certain portions of the code while leaving other parts untouched.
3.2 Processes and Scheduling
Processes are placed in RAM and their instructions are performed by the CPU. Process management influences the performance of system memory and processor.
Every process runs in two modes. While executing instructions of its own code, it runs in user mode. Some features are not allowed for the process to perform - I/O operations, interprocess communication, network communication, etc. Such features are performed by the operating system kernel - a process resident in RAM. The user level is connected to the kernel level through the system call interface. At the kernel level, an operating system performs functions that direct the actions of the hardware. The kernel is the only part of the operating system that has direct access to the hardware.
Fig.2 Life cycles of a process
UNIX is a multi-user and multitasking system. Its performance varies, depending on the number of users, the type of user tasks, and the configuration of the system hardware and operating system. How long the execution of a process really took (i.e. measured on a wrist watch) and how long it spent in the user and system modes is measured by the time command.
The time command executes a command and after its completion it prints the elapsed time spent in system mode, and the time spent in user mode. timex is a variant of time that will also, if requested with the -p option, pick up information from the process accounting records (/var/adm/pacct [2]).
The life cycle of any process [6] begins when the process is created form its parent by a fork system call [5]. The process (or part of it) is placed into RAM where it begins execution of its code. If there is a lack of free RAM or the process has to wait for finishing an I/O operation, it may wait in memory or it may be placed to disk. Such processes are placed to swap space, a disk area that usually lies outside the file system.
Any process may be stopped by a signal, an asynchronous event generated by hardware (processor), software (kernel, application program) or user demand (keyboard).
Once the required data is available in memory, the process waits for the CPU scheduler to assign the process CPU time. CPU scheduling [1] forms the basis for the multi-user, multitasking operating system. By switching the CPU between processes that are waiting for other events, such as I/O, the operating system can function more productively. The CPU lets each process run for a pre-set amount of time called quantum or time slice until the process completes or is pre-empted to let another process run.
The default time slice is usually 100 ms. It may be changed by kernel parameter (e.g. timeslice in HP-UX). Every context switch has some overhead [1] (around 5 ms). Decreasing the timeslice parameter may lead to large system overhead. If we increase it too much, it may block the users and applications. We may see the rate of context switch in the output of vmstat and sar -w commands.
$ sar -w 5 10
drs drs 4.2 7.7.8 486/EISA 01/09/98
14:18:56 swpin/s pswin/s swpot/s pswot/s pswch/s
14:19:01 0.00 0.0 0.00 0.0 30
14:19:06 0.00 0.0 0.00 0.0 56
14:19:11 0.00 0.0 0.00 0.0 98
14:19:16 0.00 0.0 0.00 0.0 101
14:19:21 0.00 0.0 0.00 0.0 104
14:19:26 0.00 0.0 0.00 0.0 103
14:19:31 0.00 0.0 0.00 0.0 108
14:19:36 0.00 0.0 0.00 0.0 108
14:19:41 0.00 0.0 0.00 0.0 87
14:19:46 0.00 0.0 0.00 0.0 102
Average 0.00 0.0 0.00 0.0 90
$
Ex. 2 sar -w output
Each process has a priority associated with it at creation time. Priorities are dynamically adjusted during process execution (every 40ms). Usually there are three priority classes:
Real-time
System
User
For example, in the HP-UX operating system, the priorities range form 0 to 255. Priorities 1-127 are reserved for real-time processes. The highest priority real-time process pre-empts all other processes and runs until it sleeps or exits or is pre-empted by a higher or equal real-time process. Equal real-time priorities run in a round robin fashion. A CPU bound, real time process will halt all other interactive use of the system.
Priority of process may be set by the super-user before the process starts or it may be changed during process execution. SVR4 uses a command priocntl, while HP-UX uses rtprio command. The POSIX compliant command is rtsched.
Timeshare processes are grouped into system and user processes. Priorities 128-177 are reserved for system processes, priorities 178-251 are for user processes. Priorities 252-255 are used for memory management. Timeshare processes run until the end of timeslice.
Realtime processes and timeshare processes from the top of system class are usually not signalable, i.e. signals sent to them are stored until the process is completed [4].
If we do not set the priority of a process, it is defined by the kernel according to actual system load, the memory requirements of the process and the nice value. The user can make a slight modification of process priority by the nice value. The nice value is the only control the user has to give less priority to a time share process. The default nice value is 20. Therefore, to make a process run at a lower priority, it should be assigned a higher nice value, for example 39, using the nice command. The superuser can assign a lower nice value to a process, effectively running it at a higher priority.
Time share processes lose priority as they are executed. When a process waits, its priority is increased. At every clock tick, the hardclock routine is run, which sets the CPU usage for the running process. Every 40ms, the running process priorities are adjusted.
The standard UNIX scheduler runs on a priority-based round robin scheme, which means basically that there is a logical queue at each priority level. Whoever runs the most processes gets the most CPU time. There are special schedulers (like HP-UX PRM - Process Resource Manager) that guarantee each user (not each process) his fair share of CPU.
3.3 Memory Management - Paging and Swapping
Modern operating systems allow developers to design programs with 32 bit virtual address space. Such programs may have 4 GB of virtual memory. If we imagine a real multitasking system with hundreds of running processes in real physical RAM (typical size from 32 MB to 1GB), we have to develop specific strategies to optimise the memory management.
The virtual memory of each process is divided into 4 segments - code (instructions, also called text), data, shared code and shared data [4]. Each segment may have up to 1 GB. The data segment is divided into the initialised data (data that have some value assigned), uninitialised data (variables without values), stack, heap and environment (see [1]). We may find out the size of the text, initialised data and uninitialised data segments using the size command. The memory layout of a process is shown in Fig.3.
(a) (b)
Fig. 3 Virtual memory layout of one process (a) and of physical RAM (b)
Processes are stored in executable files on a disk. Kernel parameters prevent users from running large processes. The size of text segment is limited by kernel parameter maxtsiz, the size of data segment is limited by maxdsiz. Shared memory segments are controlled by shmni and shmmax kernel parameters.
If the amount of free memory decreases below 2 MB, the memory management daemon starts the paging process. Pages (usually 4 kB) of data segments are written to swap space - a partition of a disk that lies outside the file system.
If paging is not able to free a sufficient amount of RAM, the memory management daemon starts swapping. All pages of some processes are saved to swap space. This mechanism is also called process deactivation.
Memory management is controlled by a daemon process called vhand or pagedaemon. Do not confuse the process swapper with the memory management daemon. Swapper (PID 0 process) is an obsolete BSD name for process scheduler [7]. Scheduler is now usually called sched [6].
In the UNIX operating system we usually use a two-handed clock algorithm to choose which memory is free. Each page has two bits - a reference bit and a modified bit. The reference hand of the clock algorithm turns off the reference bit of each memory page that it references. If the reference bit is still zero when the second hand gets to it (the page has not been referenced since the first passed by), the page is freed (i.e. it may be rewritten by a new page). If the page is clean (referenced but unmodified), it is also added to the list of free pages. If the page is dirty (referenced and modified), we write it to the swap device and then we add it to the list of free pages.
There are several methods used by memory management daemon to find pages to be paged out from RAM to disk. An overview of page replacement algorithms may be found in [1].
Because the code of a process is usually not modified during execution, we do not have to save it. We only have to re-read it from the file (e.g. /usr/bin/ls) where the process is stored on the disk. Paging and swapping concerns only the data pages of a process.
The paging daemon starts paging if the free memory falls below a constant called LOSTFREE. The value of LOSTFREE is calculated during boot and is typically 2MB. If there is less free memory than DESFREE, the deactivation of processes is started. Any deactivation activity represents a condition in which normal paging is inadequate to handle the memory demands. Thus, any deactivation activity is a sign that the system is reaching the limits of its memory.
The swapping subsystem reserves swap space at process creation time, but does not allocate space from the disk until pages need to go out to the disk [6]. Reserving swap space at process creation protects the swapping daemon from running out of swap space. When the system cannot reserve enough swap space for a new process it will not allow the process to be started. Additionally, as running processes try to dynamically acquire more memory, more swap space is reserved. If there is insufficient swap space for additional reservation, the process will be killed.
With very large memory systems, it becomes less desirable to have enormous amounts of disk space reserved for swap. Using pseudo-swap, this requirement is relaxed. The use of pseudo-swap avoids the waste of resources that could occur if, for example, a 1GB (virtual memory) process were running and had reserved 1GB of disk space for swap, and then did not need to use the swap space.
Use of pseudo-swap allows you to configure less swap space on disk. Pseudo-swap space is set to a maximum of three quarters of system memory because the system can begin paging only once three-quarters of system available memory has been used. The unused quarter of memory allows a buffer between the system and and the swapper to give the system more breathing room.
3.5 The Buffer Cache
The data read from and written to a disk are stored in buffer cache. This is a pool of RAM memory designed to minimise the amount of time that the system spends physically accessing the disk. The cache works in two ways. First, when a read is made from disk file, the data is also copied to the buffer cache. Second, when a write is made through the block device, the data is copied to the buffer cache. The contents of the buffer cache are written to the disk asynchronously by syncer daemon [2]. Every 30 seconds syncer calls the sync() system call [4]; sync() writes the contents of buffer cache back to the disk.
Buffer cache contains not only the data from files, but also metadata - superblocks of mounted file systems, inodes of open files, cylinder group information tables, etc. The metadata may be updated synchronously, by syncer (quick but unsafe) or asynchronously, just after each change of metadata (slower but safer). The default is synchronous write. Asynchronous updates of metadata may be allowed if we set the kernel parameter fs_async to 1.
The buffer cache may have a fixed size (static buffer cache) or it may be configured dynamically. The static approach is historically older; the default size of static buffer cache is usually 10% of physical RAM size. Static buffer cache is defined by kernel parameters nbuf and bufpages. nbuf defines the number of buffer headers, i.e. the number of files that may have a buffer cache entry. The size of buffer cache is defined by kernel parameter bufpages. The size is given in 4kB pages.
If bufpages and nbuf are both set to 0, then the dynamic buffer cache is activated. The range in size of dynamic buffer cache can be specified by dbc_min_pct and dbc_max_pct (dynamic buffer cache minimal and maximal percentage). A fixed size buffer cache may also be configured by setting dbc_min_pct equal to dbc_max_pct. Note that this approach will fix the buffer cache size based upon a percentage of physical memory rather than a fixed number of pages.
The buffer cache operates like paged memory. If there are not enough free pages, some pages are freed from buffer cache. The kernel uses the same algorithm (e.g. LRU) for finding the pages of buffer cache that will be freed.
The buffer cache is reasonable easy to adjust without doing the system harm. It can have a very large impact on system performance and should be one of the main areas to look for when tuning the system. The allocation of fixed buffer cache obviously reduces the memory available for processes. Too large a value can result in excessive paging.
3.4 Correcting CPU and Memory Bottlenecks
Understanding CPU bottlenecks involves understanding who uses the CPU. Users can be divided into two categories - application processes and the system (kernel services).
The user's applications occupy the CPU doing computations. They also make requests of the operating system via system calls that result in the kernel using the CPU. The kernel, in addition to servicing the users, also carries out a number of management functions (such as process management and memory management).
For the whole system, sar -u will report on the user, system idle and wio (waiting for I/O) system states. In an ideal situation, the system should spend about 80% in user mode, about 15% in kernel mode and 5% idle.
$ sar -q
drs drs 4.2 7.7.8 486/EISA 01/09/98
00:00:01 runq-sz %runocc swpq-sz %swpocc
01:00:00 1.5 2
02:00:00 1.3 3
03:00:01 1.1 8
03:55:29 unix restarts
04:00:00 2.2 17
05:00:00 1.0 0
06:00:00 1.5 0
07:00:00 1.2 2
08:00:01 1.2 1
08:20:01 1.4 14
08:40:01 1.7 3
09:00:01 1.3 3
09:20:00 1.7 4
09:40:01 1.6 6
10:00:01 1.7 8
10:20:01 1.3 2
Average 1.4 5
$
Ex. 3 sar -q output
If the CPU has too much work to do, it becomes a bottleneck. The symptoms of CPU bottleneck are:
Quite often, especially with the CPU, a bottleneck that appears to point to one conclusion may actually be hiding another. For example, if we have severe memory problems, the CPU will spend many cycles executing the memory management software. Increasing the amount of memory will relieve the problem, and in so doing frees the CPU.
$ sar -c 2 20
drs drs 4.2 7.7.8 486/EISA 01/09/98
14:39:55 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s
14:39:57 51 10 6 0.00 0.00 1620 425
14:39:59 39 9 4 0.00 0.00 1617 398
14:40:01 152 14 10 1.00 1.00 2250 511
14:40:03 242 18 9 1.99 2.49 3582 901
14:40:05 373 26 11 3.43 2.94 12710 918
14:40:07 632 49 84 1.98 2.48 56546 2022
14:40:09 409 28 90 0.50 0.50 7523 2911
14:40:11 255 5 108 0.00 0.00 1614 3073
14:40:13 563 12 146 0.00 0.00 1643 4107
14:40:15 586 15 116 0.00 0.99 2550 3810
14:40:17 446 18 77 3.47 2.97 4811 2576
14:40:19 335 14 106 1.49 1.49 44991 3005
14:40:21 349 28 68 0.50 0.50 7785 1874
14:40:23 433 14 98 0.99 0.99 2942 2810
14:40:25 321 31 11 1.49 1.49 62665 774
14:40:27 54 10 5 0.00 0.00 1653 434
14:40:29 711 39 9 1.00 0.50 9989 679
14:40:31 32 5 1 0.00 0.00 1622 397
14:40:33 35 7 2 0.00 0.00 1623 398
14:40:35 326 55 9 6.93 8.42 13872 769
Average 317 20 49 1.24 1.34 12179 1641
$
Ex. 4 sar -c output
Two basic approaches may be taken to solve a CPU resource shortage problem: increase the amount of resource available, or decrease the demand for the resource.
The first approach involves adding another CPU to a multiprocessor system. With multiprocessor systems, users should be careful to ensure that the application is suited for the multiprocessor environment. Some applications show significant improvements; others show very liltle improvement.
Other solutions include increasing the cache size of the CPU or adding a floating point coprocessor. Remember that integer multiplications carried out in the FPU.
The second approach for tuning the CPU is to reduce the demands made on it. This process can be difficult, as it involves having a good deal of knowledge about how the system works and what jobs it runs.
One problem that often arises is everyone getting in and starting up all their utilities at the same time, which puts an unusually high demand on the CPU. At this point, you can stop or suspend any jobs that can be deferred until a later time. The ultimate CPU load is not reduced, but is distributed more evenly.
Another option is to use the nice command. Typically, it has a little effect on the system, but it is easy to do and worth trying if the jobs are not critical. If really important processes are suffering, using the rtprio command will certainly help.
Recent developments in software technology have brought products such as taskbroker and DCE to the marketplace. Taskbroker can be used to offload jobs to other systems in the network; it functions by calculating things, such as processor speed, current usage, and so on, to decide which system is most capable. DCE (distributed computing environment) allows an application developer to spread his or her application among many different systems and even different architectures. This requires modification of the existing source code, but the benefits can be tremendous.
We may also use the system call plock() to help important processes. If a process calls plock(), it is locked to the memory, i.e. its pages are not paged and the process is not swapped (deactivated). The amount of RAM that may be locked is defined by kernel parameter unlockable_mem. This defines the size (in bytes) of RAM that has to remain free for paged and swapped processes. Locking processes into RAM may help them, but the system will suffer from memory shortage - the paging and swapping activity will increase and we will have less CPU time devoted to user processes. This fact may slow down even the locked processes.
Reducing the kernel limits for process memory segments may also help. We may reduce the maximum size of code segment (kernel parameter maxtsiz), the size of static data segment (maxdsiz) or dynamic data segment (maxssiz). We may also be able to lower the limit of running processes per user (maxuprc) or the number of processes running in the system (nproc).
The only memory resident program is the operating system kernel. Making the kernel smaller can be achieved by eliminating unused drivers and subsystems. Reducing the size of kernel tables can help. Parameters such as nproc, ninode and maxusers can all enlarge the kernel unnecessarily if they are set too high.
4. Disk and File System Tuning
4.1 Universal File System
The Universal File System (ufs) is one of the most popular file systems used on UNIX platforms. It is used in Solaris 1 and 2 operating systems (SUN) HP-UX version 9 and 10 (HP), SINIX 5 (Siemens), ICL UNIX (ICL) and many others. ufs was developed at the University of California at Berkeley in version 4.3 BSD UNIX [4]. ufs is a redesign of AT&T s5 file system that was developed when disks were small. Remember that early versions of UNIX used two 5 MB disks. As disk sizes grew by orders of magnitude, the initial design of the file system began to show its age in respect of performance.
Fig.4 The s5 file system layout.
The s5 [4] file system consists of a superblock (8kB), an inode table and a data area. Superblock contains strategic information about the whole file system, e.g. total number of free (unallocated) data blocks, the head of the list of free blocks, the number of free inodes, the head of free inode list.
Information nodes (inodes) in s5 are 64B long. They contain all information about the file except its name (stored in directory) and the data (stored in data blocks) addressed from inode.
The size of data blocks is fixed at 512B. We have 10 direct 3B addresses of data block in the inode. If this is not sufficient (the file is longer than 5120B), we use a single indirect address, i.e. an address of one block that contains addresses of data blocks. If this is not enough, we may use double or triple indirect address.
In ufs [7] the disk is further divided into cylinder groups. A cylinder is a collection of tracks formed as the head-disk assembly positions all of the heads on multiple platters at the same distance from the edge of disk surfaces. The first cylinder group is formed from cylinders 1 to 16, the second cylinder group from cylinders 17 to 32 etc.
The cylinder groups were introduced to increase the performance of the disk. Files are stored preferably in one cylinder group, in order to minimise the movement of disk heads. The file system parameter maxbpg specifies how much of a cylinder group any one file can occupy. By default this is 25%. For example, if a cylinder group is 4MB, a file can only occupy 1MB of the cylinder group before having to expand further into a new cylinder group. Decreasing this value will cause large files to be spread across the file system; increasing maxbpg causes files to be more localised. If your file system comprises mainly large files, maxbpg should be increased. We may change the value of maxbpg even during file system existence by using tunefs utility with -e option.
Remember that the tunefs utility is worth using only on an unmounted file system, because it operates directly on the physical superblock. If we use it on a mounted file system, the information changed in the physical superblock can be overwritten by information from RAM by syncer daemon.
Each cylinder group contains a copy of primary superblock. A copy of the superblock is located in each cylinder group so that any single track, cylinder, or platter can be lost without losing all copies of the superblock. If the primary superblock is not readable, we may use any of the redundant copies for file system reconstruction (using fsck -b command).
The cylinder group information table contains the dynamic parameters of the cylinder group. Its definition is stored in /usr/include/sys/fs.h header file in every UNIX that uses ufs.
ufs inodes are 128B long (see Fig.5). The number of inodes allocated per cylinder group is determined when the file system is created. It cannot be modified after the file system has been created. By default, one inode is allocated for 2 kB (HP-UX Version 9, Solaris 1) of disk space or 6 kB of disk space (HP-UX Version 10, Solaris 2). If we allocate space for one inode as using the older default (2kB), we have 12.8% of total disk space wasted for inode table. For example, for 2GB disk we create 1 million inodes of 128B.
The ratio between the number of inodes and the file system size is defined by file system parameter nbpi (bytes per inode). It may be changed during file system creation by the option -b of mkfs utility.
The data is stored in data blocks and fragments. The default sizes are 8kB blocks and 1kB fragments. Fragments are used for the last portion of data. For example, an 11kB file is stored in one block and three fragments. Block sizes can be 4kB, 8kB, 16kB, 32kB, 64kB, while fragment sizes are 1kB, 2kB, 4kB and 8kB. Block and fragment size can be changed by mkfs utility using -b and -f options.
There is no possibility of defragmentation in ufs. The only way to use the disk space efficiently is to store the entire file system of a tape and to write it back to the disk.
4.2 Journaled File System
The journaled file system was developed by Veritas, Inc, in 1992. It is also referred to as the Veritas Extended File System (vxfs). It offers fast file system recovery and on-line features - on-line backup, resizing, and reorganisation (defragmentation).
Generally, we have two types of information on any disk. First, data blocks store relevant information. Second, we have additional structures that are used to speed up access to data blocks, like superblock, cylinder group information tables, and inodes. Additional data structures are referred to as metadata.
When a change is made to a file within the file system, such as when a new file is being created, a file is being deleted, or a file is being updated, a number of updates must be made to the metadata. For example, if we delete a file (using rm command), we must add the removed data blocks to the list of free blocks, and we must add the unallocated inode to the list of free inodes. We also have to change the number of free blocks and inodes in the superblock, and we have to change the size of the directory where the file resided. All metadata updates are first made to the copy in the memory. Next, all the metadata updates are recorded in a single intent log transaction record (in the RAM). The record is then flushed out to a file on the disk called intent log.
There is only one record in the RAM. If we perform the next transaction with the file system metadata in RAM, a new record is created and written to the end of the intent log. The intent log contains recent file system data structure updates. After a failure the system checks the intent log and performs the required roll back or roll forward.
The intent log is a circular log located at the beginning of the file system on disk, following the superblock. A record of all metadata updates for a given transaction is written with one disk access. This ensures that all the metadata updates are atomic.
If the operating system were to crash, the file system can quickly be recovered applying all changes in the intent log. Since only entire transactions are logged, there is no risk of a file change only being partially updated. Either a whole transaction is lost or the whole transaction is logged to the intent log.
The vxfs file system is divided into smaller units called allocation units. An allocation unit is similar to the ufs cylinder group. A vxfs block represents the smallest amount of disk space that will be allocated to a file. It is recommended that the default 1kB always be used.
When you create a file, vxfs will allocate an extent to the file. An extent is a group of blocks that are continuous. If continuous space is not available, vxfs will find another extent. The inode keeps track of the beginning of the extent and how many blocks are contained in the extent.
4.3 Disk Configurations
There are three ways to divide disks into parts. First, we may use the whole disk approach, i.e. we do not divide the disks into smaller units. A disk can contain one file system or raw partition. Optionally, a swap area may follow the file system. The root disk may contain a boot area in addition to the root and swap partitions.
Some of the limitations of this method include:
We can overcome the last drawback by dividing the disk into partitions of fixed sizes. Using this approach we may construct several file systems on one disk. Moreover, it is desirable to set fixed limits on partition sizes to guarantee that no file system will grow without limitations.
For more flexible configurations, the concept of virtual disks has been introduced. We will explain it with reference to the Logical Volume Manager (LVM) developed by HP. LVM allows a disk to be divided into multiple partitions, referred to as logical volumes. Logical volumes are groups (if possible continuous sequences) of smaller units called extents. Extents are (by default) 4 MB large. Do not confuse LVM extents with vxfs file system extents. Each logical volume has its own device files associated to it. You choose how many logical volumes, and what size each one will be. LVM allows multiple disks (up to 255) to be combined to create large partitions. The size of one logical volume is limited by 128GB. Logical volumes can also be extended (or reduced) later if the need arises, even across multiple disks.
4.4 Correcting Disk and File System Bottlenecks
The symptoms of a disk bottleneck are
Disks that are 100% busy require investigation as the goal is to have balanced I/O across disks. Queue lengths greater than 4 (see sar -q output in example 3) may indicate that excessive time is being spent waiting for the disk, although the request may well have been serviced within the time period that the tool is reporting in.
We see that buffer cache hit ratios (see sar -b output in example 5) should be greater than 75% for writes and 90% for reads, but the application may not suit the buffer cache; it may be causing disk problems due to increased paging, because the buffer cache has too much of the memory.
We can take a variety of approaches to deal with a disk bottleneck. We have already seen that one bottleneck can manifest itself as another bottleneck (for example, a high process deactivation rate can produce high disk I/O figures). First, we need to be sure that the disk is really a problem; then there are many things we can do to improve the situation.
Hardware upgrading (i.e. increasing of resources) includes
Adding spindles generally will help spread the I/O among many disks. The tendency today is to buy larger disks and put all of your data on them. This is certainly the cost-effective way, but will ultimately lead to lower performance. If at all possible, buy the smallest disks possible for the capacity you require.
$ sar -b 2 20
drs drs 4.2 7.7.8 486/EISA 01/09/98
14:24:18 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
14:24:21 0 0 100 0 0 100 0 0
14:24:23 0 0 100 0 0 100 0 0
14:24:25 0 0 100 0 0 100 0 0
14:24:27 0 0 100 0 0 100 0 0
14:24:29 0 0 100 0 0 100 0 0
14:24:31 28 121 77 0 1 100 0 0
14:24:33 43 176 75 0 3 100 0 0
14:24:35 44 238 82 10 10 0 0 0
14:24:37 36 266 86 0 0 100 0 0
14:24:39 25 191 87 14 14 0 0 0
14:24:41 33 275 88 11 11 0 0 0
14:24:43 38 206 82 0 0 0 0 0
14:24:45 35 185 81 2 2 0 0 0
14:24:47 19 138 86 28 28 0 0 0
14:24:49 40 180 78 0 0 100 0 0
14:24:51 43 192 78 0 0 100 0 0
14:24:53 26 98 73 22 22 0 0 0
14:24:55 36 144 75 3 3 0 0 0
14:24:57 29 207 86 17 17 0 0 0
14:24:59 30 107 72 0 0 100 0 0
Average 25 136 81 5 6 4 0 0
$
Ex. 5 sar -b output
Once you have bought small disks, try to spread the load further among different controllers. This can be done on big servers, where there are extra slots for controller cards.
Disk stripping provides no data redundancy. Data are stripped across the set of disks to provide load balancing. Stripping is also known as RAID 1. Stripping may be supported by hardware (i.e. transparent to software) or may be created using virtual disks.
If we use disk stripping, it is possible to have several disk drives all working for you at once. We are able to gain some of the advantages of spreading our load over several drives within a single file system. The advantages are very much application dependent. In an application where files can be deliberately placed in different file systems or on different drives, this tends to be the policy.
Disk mirroring is provided with two sets of disks that maintain identical copies of the data. While writing to a mirrored volume, the application will normally send the request to all of the physical disks in parallel, causing a slight overhead because the system has to process each of the requests. However, most of the time for disk transaction is caused by the drive actually positioning itself, then writing the data, and it occurs at the same time on both drives.
For reading from mirrored disks, the disk driver checks the request queues for each of the disk drives that can satisfy the request, then sends the request to the drive with the shortest queue. Mirrored disks can handle more than one request at a time, and hence improve performance. Most systems perform more reads than writes, thus a mirrored system is likely to outperform an nonmirrored system.
The previous topic described ways of tuning the disk subsystem by adding more resources. As we know, if we cannot add resources, the second method is to reduce the demand. This may be achieved by configuration solutions:
The solutions mentioned above do not actually reduce demands for disk I/O. What they attempt to do is to rearrange the layout of swap areas, file systems, and files to distribute the load more evenly. This has the effect of allowing more disk accesses to occur in parallel because the requests are to different spindles.
Configuring multiple swap areas on different disks will allow the swapping daemon to interleave the requests. This spreads the paging and swapping more evenly among available disks. Balancing the system I/O across multiple spindles has the same effect.
Dedicating a disk to an application is an attempt to reduce the head movement. However, like all strategies, the more users and the more applications you get, the less effective this strategy is likely to be.
Dedicating a section (partition) to a file access type is an attempt to optimise the use of the buffer cache and the file system block size. If all of the files on the section are accessed sequentially, you want the largest block size you can get (e.g. 64kB in the ufs file system).
If the accesses are random, you want the smallest block size (e.g. 2kB in the ufs file system) to eliminate the wasted information that is transferred when larger block sizes are used. The best performance can be achieved if the size of the filesystem blocks is the same as the size of blocks used by the database runtime module.
Raw I/O minimises the path length of the code that is executed by bypassing all of the file system. Most common database applications will either allow or specify the use of this type of I/O. The drawback of using raw devices lies in the administration. Is difficult to search any data in the raw device or to archive selected data from a raw device. The performance of databases running on raw devices does not show significant improvements compared to the performance of databases above a file system.
Other solutions of disk bottlenecks include changing the kernel parameters to increase the size of kernel tables related to files (e.g. nfile, ninode, nflocks). We can also increase the size of buffer cache.
Increasing the kernel table sizes may help to decrease the disk activity. The inode table is a cache for recently used inodes, and holds inodes for currently accessed data, as well. Unreferenced inodes may be needed during path name resolution, and for opening previously opened files.
Changing the size of buffer cache should increase the buffer cache hit rate, which will result in fewer disk accesses for file system work. However, as we have already seen, more memory for the kernel means less memory for the user processes, and can lead to increased paging.
Another approach for solving disk bottlenecks involve file system solutions:
When data is written to any disk, the file system tries to place all of the information in the optimal position on the disk. The system has an allocation policy that attempts to spread the directories throughout the disk space, but store a file in the same locality as its parent directory. Therefore, frequently used deep hierarchical directory structure can result in large head movements.
If the disk is empty, the time taken to allocate a new space for a file will be small, as the requested block will almost always be available. The fuller the file system gets, the longer it takes to find a place to put the data. A disk that is over 90% full can take over 50% more time to write the same amount of information than it would if the disk were empty. We can control the space that has to remain free in the file system by the value minfree in mkfs or tunefs commands. minfree is usually set to 10%.
Keeping the directories flat, using relative path names and minimising symbolic links will have the effect of reducing the number of inodes that have to be looked up, and therefore the number of disk accesses required to resolve the path names.
The PATH variable can easily be tuned for increased performance. Every time you issue a command, you will search the directories in the PATH variable list. Frequently used directories should be near the beginning of the list. Try to avoid making the list too long, as incorrect commands will require a search through all of the directories in the list.
Finally, we can tune the file system itself. The file system layout can be optimised to the physical disk by using mkfs with the disk details specified. Block and fragment size, as well as bytes per inode, can only be set with the mkfs command. The tunefs command can be used to tune some of the file system parameters that reside inside the superblock, e.g. minfree. It is important to tune the file system when it is unmounted, because tunefs makes changes only to the disk superblock. If we tune the disk when it is mounted, as soon as it is unmounted, the memory superblock (with unmodified parameters) will overwrite the superblock on the disk.
5. Case Study - Tuning the ORACLE Database System on LINUX Operating System
5.1 ORACLE Server
ORACLE Server is a database engine that consists of an ORACLE Database and an ORACLE Instance. ORACLE Database is used for dealing with SQL queries and for interpreting programs written in procedural language PL/SQL. PL/SQL is used for manipulation of data flow, handling error situations, etc.
Each ORACLE Database has its physical and logical structure. Physical structure is defined by means of underlying operating system (e.g. UNIX) and consists of data files, redo log files (transaction journal files) and checkpointing files. Logical structure of a database is defined by one or more tablespaces and database schematic objects (tables, procedures, synonyms, indexes, DB links, clusters, etc.)
ORACLE Instance consists of RAM data structures (SGA, System Global Area) and processes (see Fig. 5). There are two types of processes - user processes (e.g. ORACLE Forms) and server processes.
System Global Area (SGA) is a pool of shared memory that contains data and control information of one ORACLE Instance. SGA is created at the time of ORACLE Instance startup and is released during instance shutdown. The amount of shared memory used by SGA should be as large as possible, in order to minimize disk accesses. SGA contains also a buffer cache used for speeding up disk accesses (buffer cache holds most recently used disk data blocks) and a buffer cache of a transaction journal (redo entries).
Fig. 5 ORACLE Instance
Program Global Area (PGA) is used for storing data and control information of ORACLE Server processes. There are several such processes, e.g. Database Writer, Log Writer, Checkpoint, System Monitor, Process Monitor, Archiver, Recover, Dispatcher, Lock, see Fig. 5).
Basic rules for successful installation and administration of any ORACLE Database are defined in Optimal Flexible Architecture (OFA) - a company standard that guarantees reliable and powerful database.
5.2 LINUX and ORACLE Parameters
LINUX is a free version of UNIX operating system [1]. LINUX can be obtained from Internet in the form of distributions [7].
Currently, there is no version of ORACLE for LINUX operating system, but we may use the version of ORACLE intended to run on SCO UNIX. The advantage of using LINUX compared to other versions of UNIX for computer with Intel processor is that LINUX is distributed together with kernel source code and we may see what the kernel is actually doing.
Besides standard performance monitoring tools (e.g. sar) there are specific LINUX tools. These tools may be obtained from Internet resources [8]. There are tools for general system performance monitoring and process specific monitoring. General tools include xosview used for CPU, swap, network, interrupt and serial ports monitoring and ProcMeter and xperfmon++ for long-term monitoring.
Process specific tools are more important prom the point of view of performance tuning. Popular LINUX specific tools is yamm, a variation of top utility, gives more detailed information concerning a specific process. The strace utility traces all system calls of a process. It can be obtained from [9].
There are several parameters [2] of ORACLE database system that have significant impact on the performance of underlying UNIX operating system.
First, ORACLE supports asynchronous disk writes, i.e. file system metadata (superblock, inodes, cylinder group information tables) are written to the physical disk just after any change of metadata information. This feature is enabled by setting the parameter async_write=TRUE in the initsid.ora file. Asynchronous writes are performed using raw (character) devices, they are unfortunately not implemented in LINUX.
Second, the performance of ORACLE depends on using logical (virtual) [1] disk devices, like HP-UX logical volumes or SOLARIS metadisks. Unfortunately, LINUX does not support any logical disk devices. The only possibility are so called multiple device drivers that create an illusion of concatenated and/or stripped (RAID-0) disks. Software support for RAID disk arrays is not fully included in LINUX. It may be found (as additional utility) at [10].
Significant influence on the performance of disk operations has the size of disk data blocks [4]. LINUX offers 1k, 2k and 4k data blocks.
When reading data from disk, LINUX uses its own disk buffer cache. Second disk buffer cache is used in SGA. Because LINUX does not support raw devices, we are not able to suppress the influence of system buffer cache. Further, LINUX has dynamic buffer cache, i.e. the size of buffer cache is estimated according to the needs of processes and we have no possibility to influence the size of buffer cache. The only parameter that may be changed is the frequency of updating data blocks (default 30 seconds) or the frequency of metadata updates (default 5 seconds) - parameters of /sbin/update process.
The minimal size of swap space of LINUX running ORACLE should be at least twice the size of physical RAM. The swap space is created using mkswap utility and is added/removed using swapon/swapoff utility.
The size of database data block is defined as parameter db_block_size in the initsid.ora file. The size of database data block should be a multiple of file system data block, i.e. 1k, 2k or 8k.
The size of SGA is defined as
fixed_size + vaiable_size + database_buffer_size + transaction_buffer_size
The fixed_size is defined by ORACLE version, variable_size is defined by shared_pool_size and db_block_buffers parameters, database_buffer_size is the value of db_block_buffers multiplied by db_block_size and the transaction_buffer_size is defined by log_buffer parameter.
It is necessary that all SGA should be placed in one segment of shared memory. The value of SHMMAX (maximal size of one shared memory segment) [should be set up as high as possible. Theoretical maximum for SHMMAX is 2GB.
The actual shared memory usage may be seen from the output of ipcs utility [3].
5.3 Tuning LINUX under the ORACLE database server
The situation is database benchmarking is not as simple as in processor benchmarking. The amount of high quality database benchmarks is quite small, especially if we need benchmarks for simulating concurrent work of several users.
TPC (Transaction Processing Performance Council) [11] is a nonprofit organization founded by leading database companies in order to develop performance benchmarks. The most popular benchmarks are:
TPC-C - simulation of transaction processing in material orders for manufacturing
TPC-D - simulation of a DDS (Decission Support System)
Wisconsin benchmark is a one process task that simulates typical SQL queries using selection, join, projection, updates and aggregate queries.
AS3AP (ANSI SQL Standard Scaleable and Portable) is similar to Wisconsin. It uses 39 queries and it allows multitasking load simulation.
Benchmarks mentioned above use synthetic data. If we need to test real data, we may use QA Performer software [12]. QA Performer user real data from our application. During the first phase of testing QA Performer collects data from our application and creates a bundle of test procedures. These procedures may be further customized.
Probably the most suitable SQL benchmarks are available in the MySQL Benchmarks bundle [13]. MySQL included in an implementation of a SQL server called MySQL. MySQL Benchmarks consists of 5 tests:
ATIS - creates 29 tables, populates them with data and applies SQL queries
connect - test of database connectivity
create - test of the speed of creation of database tables
insert - test of the speed of data insertion into tables, finding of duplicate entries, updating and deleting of entries
wisconsin - implementation of the Wisconsin benchmark mentioned above.
5.4 Experimental Results
We have compared ORACLE Server with two other free SQL servers - MySQL server and PostgreSQL Server (http://www.postgresql.org) using MySQL Benchmarks. The results are given in Fig. 6.
Fig. 6 Comparison of three SQL servers with respect to the type of operation
We have also used MySQL benchmarks for testing the influence of several LINUX kernel parameters on the performance of ORACLE Server.
First, we have tested the influence of the number of shared memory segments used for SGA. We have compared one shared memory segment SGA to 15 shared memory segments SGA. We have used two MySQL Benchmarks - Wisconsin and Insert, both running 5 processes serial or parallel. The results (Fig. 7) show that the distribution of shared memory segments into more parts decreases performance only less than 1% (see Fig. 7).
Fig. 7 Comparison of one segment shared memory to 15 segments shared memory, Wisconsin benchmark, serial and parallel run
Fig. 8 Comparison of one segment shared memory to 15 segments shared memory,
insert benchmark, several types of DML operations
Second, we have also tested the influence of shared memory segments to insert benchmark (Fig. 8). Insert uses insert, update and delete database operations.
Finally, we have experimented with the filesystem block size and ORACLE parameter db_block_size (Fig. 5). We have chosen filesystem block size 1k, 2k, 4k and db_block_size 2k, 4k, 6k and 8k. The test was MySQL insert test. From Fig. 5 we see that best results can be achieved when using larger database blocks. The impact of file system block size is not significant. Resulting decrease of performance when choosing small blocks may be up to 100%.
Fig. 9 Filesystem and database block sizes - impact on insert test
6. Conclusion
We have descibed factors that influence the performance of UNIX operating system and methods for tuning the performance. We have discussed the influence of RAM based factors and disk based factors and we presented tools for performance monitoring and tuning. We have studied the tuning of ORACLE database system running on LINUX operating system.
We see that careful tuning of operating and database system can give significant improvement of performance of any application task running above mentioned operating and database systems.
References
[1] TANENBAUM, A.S.: Modern Operating Systems. Upper Saddle River, New Jersey (USA), Prentice Hall 1992. 728p.
[2] O'LEALY, K., WOOD, M.: Advanced System Administration (UNIX System V Release 4). Englewood Cliffs, New Jersey (USA), Prentice Hall, 1992. 323p.
[3] SOLARIS 2.4: Security, Performance and Accounting Administration. Mountain View, California (USA), Sun Microsystems 1994. 228p.
[4] STEVENS, W.R.: Advanced Programming in the UNIX Environment. Reading, Massachusetts (USA), Addison-Wesley 1992. 744p.
[5] LEWINE, D.: POSIX Programmers Guide. Sebastopol, California (USA), O'Reilly & Associates 1991. 607p.
[6] GOODHEART, B., COX, J.: The Internals of UNIX System V Release 4. Englewood Cliffs, New Jersey (USA), Prentice Hall, 1994. 664p.
[7] www.redhat.com, www.debian.org, www.caldera.com
[8] ftp://ftp.fi.muni.cz/pub/linux/system/status, ftp://sunsite.unc.edu/pub/Linux/system/status
[9] ftp://ftp.fi.muni.cz/pub/linux/devel
[10] http://linas.org/linux/raid.html
[11] http://www.tpc.org
[12] http://www.segue.com
[13] http://www.tcx.se