Current Issue


Table of contents

CD-ROM

Sys Admin and The Perl Journal CD-ROM version 12.0

Version 12.0 delivers every issue of Sys Admin from 1992 through 2006 and every
issue of The Perl Journal from 1996-2002 in one convenient CD-ROM!

Order now!

Sys Admin Magazine > Topics > Databases
Print-Friendly Version

Databases

Implementing Highly Available Oracle Databases with Red Hat Cluster Suite

Sergey Nemirovsky

In the current business environment, it is essential that databases, as well as other components of IT infrastructure, remain functional at all times. Downtime of databases is increasingly expensive to companies, and the industry is searching for ways to avoid planned or unplanned downtime. Hosting organizations widely depend on commercial high-availability suites, such as Veritas Cluster Server or Sun Cluster, that provide robust, very customizable, and flexible failover capabilities to a wide range of applications. However, these solutions also carry a significant price tag.

Many organizations are looking for a high-availability engine that is not as expensive, but they are not willing to sacrifice the reliability and availability. Red Hat Global File System (GFS) and Cluster Suite (RHC) offer failover solutions at a fraction of the cost of commercial high-availability software suites, but, as provided, these tools do not allow for the same level of flexibility, reliability, and availability as the commercial solutions.

In this article, I'll discuss Red Hat Cluster Suite's behavior and shortcomings and describe methods to improve the availability that Red Hat Cluster Suite can provide. I'll explain the capabilities and limitations of Red Hat Cluster Suite in reducing downtime and improving availability of Oracle databases to the levels comparable to those provided by commercial high-availability suites, but at a fraction of the cost. The set of scripts implemented and explained here extends Red Hat Cluster's capabilities and represents just one example of added value that can be provided by skillfully leveraging Red Hat Cluster's capabilities with application-specific availability service agents.

Red Hat Cluster Behavior and Shortcomings

The Red Hat Cluster Suite (RHC) promises to provide low-cost high availability for enterprise applications (Red Hat, 2004), and it delivers on its promise well: it is a solid foundation to implement failover capabilities for applications. Red Hat Cluster provides the ability to failover service -- or collection of file systems, virtual IPs, and application daemons -- from one cluster node (server in a cluster) to another. Thus, it improves service availability in the event of hardware or software failure on one of the cluster nodes. Red Hat has provided documentation on how to install and configure cluster hardware and software in the "Configuring and Managing a Cluster" manual, which can be found at:

http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/
General RHC configuration is not covered here, except for the points directly related to the configuration of Oracle Database and Listener.

Typical Oracle Database Service Configuration

Much like commercial availability engines, RHC combines several related resources into a service group. An Oracle database service group is generally composed of Oracle software installed in ORACLE_HOME directory, database processes (daemons) that are running on an active cluster node, a collection of data files, a virtual service IP, and an Oracle listener that is bound to that service IP. A typical Oracle configuration can be described by the dependency tree shown in Figure 1 .

Based on internal requirements and practices of different organizations, there is some variation with respect to exact configuration. For example, some organizations may prefer to install Oracle software on local disks of each cluster node, or they may choose a single installation on a global file system shared by all cluster nodes. Also, the number of shared GFS's (Global File Systems) used by the database instance varies significantly based on the database size, performance requirements, or redundancy levels required by the applications.

The important point here is that all of the cluster nodes need exactly the same Oracle installations, and all of the file systems used by the highly available database instance need to be under RHC control.

The resources, shown as circles in Figure 1 , must be started in "bottom up" order. Shared file systems must be mounted, and software and data files made available before the Oracle database can come up. The network interface must be configured and the IP set before a listener can be started and be able to bind to that IP. The line from the listener to the Oracle database instance is dotted because there is no physical dependency between the two. In other words, the listener can be started before the database is up, but there is a logical connection, because the listener cannot service any incoming connections before the database is fully up.

Behaviors of several components of an Oracle service group (the components managed by RHC HA suite are shown in gray) are handled by Cluster Suite directly, but the behavior of others, such as an Oracle database and listener, are left to be implemented by systems administrators.

The Oracle database configuration described in the Red Hat's Cluster Manual is too simplistic and does not meet availability requirements of many organizations. The goal is to improve the flexibility of Oracle service and to implement a more robust and reliable failover behavior for the Oracle service group.

RHC provides methods to control file systems, service IP, and several applications -- but Oracle database is not one of the applications that RHC supports directly. For additional applications, RHC provides a generic "script" type resource, and RHC calls these resources with "start", "stop", or "status" arguments. In case of Oracle database, the "start" method should start the database, "stop" should bring the database down, and "status" should return the status of the database.

RHC engine only evaluates return values from the scripts to determine success/failure of the operations -- with 0 indicating success, and 1 failure. For example, when RHC engine starts the database, it evaluates the return value from the script; if it receives a return value of 0, it considers the resource to be up and starts probing it by calling the same script with "status" argument. But, if it receives 1 from "start" method, it determines that RHC cannot start the resource (and, consequently, the resource group on a given cluster node) and attempts to start it on one of the other cluster nodes.

Red Hat Cluster Suite Shortcomings

Limited Recovery Policy Options

In the event of a failure, RHC supports several recovery policies:

  • Restart -- The failed service group is restarted on the same cluster node.
  • Relocate -- The failed group is moved to another cluster node in the same cluster.
  • Disable -- The group is disabled and no attempt is made to restart it on any of the cluster nodes.
The RHC does not allow for any combination of the restart policies. For example, if the restart recovery policy is selected, the failed resource group will not get moved to another node as long as it can be successfully restarted on the same cluster node, no matter how many times it fails. There are two conditions, under which a service group would get moved to another available node in the cluster:

    1. The service group cannot be restarted on the active node of the cluster.

    2. The active node is no longer in the cluster (it crashed or was taken out of the cluster manually).

While it is a largely desired behavior, this behavior does not provide for the best failure resolution in all situations. Restarting on the same cluster node is most appropriate when failures are caused by software issues, such as high loads, because, in most cases, the underlying problem is most likely to follow the service group when the group is failed over. However, in the event of hardware failures, it is desired to failover the service group to one of the other cluster nodes. Some types of physical memory problems may result in application crash, and since Oracle database normally is the main user of the system's memory on database server, it is more likely to access bad memory cells. Continuous crashing of the database is indicative of such a problem, but RHC does not provide a direct way to limit a number of restart attempts on a given cluster node.

Ping-Pong Conditions

As mentioned before, if a software issue resulting in a database crash should occur, and if the service group containing that database is failed over to another cluster node, the issue is likely to appear on the other cluster node as well. This would result in another failover, and the situation is likely to continue to occur.

Continuous failover of the service group results in ping-pong situation (i.e., the service group is continuously moved from one cluster node to another). This behavior makes it difficult to troubleshoot the issue, because the issue is also continuously moving from different hosts in the cluster. It is desired to disable the failing resource group to allow the database or systems administrator to investigate and fix the issue, so that the database can be brought back online. RHC does not provide for a mechanism to limit the number of times service groups are moved from one cluster node to another.

Limited Control over Restart Behavior

In case of a single resource failure, the RHC's implementation requires all the resources in that resource group to be restarted. For example, if a database resource fails, the listener is also brought offline, all the file systems are unmounted, IP is unplumbed, and all of the resources are restarted in the opposite order. This results in slow restart times. The time needed to restart a failed listener is negligible when compared to the time needed to restart a database. If a listener fails, only the listener needs to get restarted -- there is no point in stopping the database, unplumbing IPs, or unmounting file systems -- but that is what the RHC would do.

Moreover, a failed Oracle listener does not in any way affect connections to the database by applications or users that have already been established. A listener listens on the service IP for incoming connections and passes those connections to the database but plays no role in further communications that happen between the database and a user or an application that has initiated the connection. It is desired for a high-availability engine to be able to determine when to restart everything, and when the less time-consuming restarts of individual resources are more appropriate. If only a failed resource is restarted, it may shorten the restart time and minimize the impact of restarts on availability of provided services.

Limited Logging

RHC by itself does not provide detailing logs of error conditions on the cluster. The RHC engine logs general information about the failures via syslog, and that information is not always sufficient to fully investigate and resolve the problem. Continued run logs for "start", "stop", and "status" methods are needed to fully investigate and quickly and easily fix underlying problems.

Limited Probing Configurability

The resources are probed by RHC availability engine in a predefined interval -- the probe interval is the same for all resources in the service group. While having resources probed too often adds additional loads to the servers, not probing them often enough prolongs the time between a resource going offline and a cluster engine detecting such a condition.

It is desirable to perform less load-intense probing more often, but to perform load-savvy tests less frequently. For example, the normal probe on a database resource may check for the presence of required database processes and establish a local connection to the database, but the "deep probe" may connect to the database, insert values into a test table, and evaluate success of the insert operation. It is also desired for a probe to be able to test end-to-end availability of a service group, not just availability of individual resources.

After RHC engine starts a resource on one of the nodes and receives a successful return value from the "start" resource script, it immediately proceeds to probing the resource, but sometimes it may take a resource several seconds or even minutes to fully start up. RHC does not allow for a delay between starting the resource and probing it for the first time.

Sub-Optimally Designed Shutdown Methods

RH has provided a suggested configuration for Oracle database, documented at:

http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/ \
  cluster-suite/ch-db-service.html#S1-SERVICE-ORACLE
To shut down a database, this implementation calls sqlplus "shutdown abort" command, which takes down the database quickly but ungracefully. "Shutdown abort" may cause database corruption, and it may require database recovery the next time the database is brought online, thus prolonging the time it takes to restart or failover the database.

It is essential that the "stop" method take the database down -- if it fails, the database cannot be started on any other node of the cluster or be restarted on the same node, to avoid data corruption. But it is also desirable for the RHC engine to at least attempt to bring down the database gracefully, and, if a graceful shutdown fails, then proceed to more severe -- and destructive -- methods.

Inability to Freeze Cluster

During system maintenance, it is convenient to be able to disable a cluster's normal operations. For example, if the database is taken to STANDBY mode under normal cluster operations, the cluster would restart the database and open it in "READ WRITE" mode. Stopping the cluster completely and then starting the database manually in STANDBY mode would require for the cluster-controlled file systems to also be mounted manually, which may be significant overhead, especially if the database uses numerous file systems. It is desirable to temporarily disable resource monitoring in the cluster to allow for the database maintenance, and RHC cluster lacks that ability.

Implemented Solution

Many of the shortcomings of the RHC high-availability engine described above have been addressed by commercial high-availability (HA) suites, such as VCS or Sun Cluster. To achieve the same level of availability that commercial HA suites can provide, while maintaining a low cost of the availability cluster provided by RHC, the features described above must be incorporated into resource scripts that implement "start", "stop", and "status" methods for the Oracle database and listener. In the implementation of RHC Oracle failover database clusters described here, the database and listener are split off to different resources in the same resource group, so that their restarts can be handled independently, and separate scripts for database and listener are developed.

The resource scripts for database and listener resources are shown in Listings 1 and 2, respectively. Listing 3 contains a file, sourced by both scripts, that defines global variables and contains shell functions used by both resource scripts. Listing 4 contains a sample resource group configuration file. The listing files can be found at: http://www.sysadminmag.com/code/.

The listener resource was made to be a child resource of the database, so that the database starts first, and then the listener is brought online. Figure 2 demonstrates the configuration of resources within Oracle database resource group.

The listener is attached to the database resource, and the database is attached to IP and GFS file systems. RHC starts the resources from top to bottom and shuts down resources from bottom up; attaching resources to one another ensures the correct startup and shutdown order.

As I mentioned before, a failure (return status of 1 from "status" method of a resource script) originally resulted in restarting all the resources in the group, which may take long time, specifically for the database resource. That is why restart functionality was moved from RHC engine to "status" methods of resource scripts. If a resource can be successfully restarted, the "status" method will return success, so no other resources are restarted in the event of individual resource failure. The benefits achieved by implementing custom scripts for each of the resources are described below.

Improved Recovery Behavior

As described before, the "status" method for each of the resources -- a database and a listener -- calls the restart function if an outage is detected. If the restart function can successfully restart the resource, the "status" script exits with 0 (success); thus, the cluster engine does not restart or failover the resource group. In this implementation, the recovery option is defined to be "restart", but failovers are forced with a system of flags that stop repeatedly failing resources from starting on the same node.

To limit the number of times each resource is restarted, the restart function also maintains a restart history (i.e., it stores restart times for each resource in a flag file). The number of times each resource is restarted is controlled by two configuration variables: RESTART_ON_FAILURE_INTERVAL and RESTART_ATTEMPTS. Each resource is allowed to be restarted RESTART_ATTEMPTS times within RESTART_ON_FAILURE_INTERVAL hours. For example, if RESTART_ON_FAILURE_INTERVAL is set to 2 and RESTART_ATTEMPTS is set to 3, the resource would be restarted by "status" method up to three times within RESTART_ON_FAILURE interval, should it keep failing. If the resource fails more than three times, the "status" script would exit with a status of 1 (failure) and let the cluster handle the recovery process.

Restart functions within each resource script also maintain the PING-PONG flag. If RESTART_ATTEMPTS within RESTART_ON_FAILURE_INTERVAL criteria have been met, a PING-PONG flag is set (via flag file), which prevents the "start" method from starting the database. If the PING-PONG flag is set on all cluster nodes, the resource group does not come online to allow for human intervention.

PING-PONG flag has a limited effective time. If the date in the PING-PONG flag file is more than RESTART_ON_FAILURE_INTERVAL hours old, the PING-PONG flag file is ignored and removed. This is done so that "start" is allowed on a node significant time after a failure. For example, if a resource group has been failed over from node0, and PING-PONG flag has been set on node0, several days later, when the resource group tries to fail over to node0 again, the scripts would allow for the resource group to come online. This way, if the underlying problem on node0 has been fixed, the PING-PONG flag will not stop the normal cluster operation.

Improved Logging

The logging of the scripts is implemented via log_msg and log_alert functions within common functions file. Log_msg is generally used to maintain a run log of each script. The logs are maintained on a "per script, per instance" basis: if there are two resource groups for two database instances defined on the same cluster, each will have its own run log. Each line of the log is prepended with a time stamp and a PID for the script; that way, if probes overlap, it is easy to follow what has happened within each probe.

More severe messages are logged via the log_alert function. Log_alert displays a message in a terminal of each of the logged-in users, a corresponding run log, and the system log. Each of the alert messages contains a reference to the appropriate run log, so that more information about the failure could easily be gathered by the systems and database administrators not closely familiar with cluster scripts implementation. Each line of the alert log contains PID of the corresponding probe to allow for easy cross-reference of the system log with run logs.

Run logs can be quite wordy and generate fairly large log files. To limit the impact of run logs, the total size of each log file is limited by the MAX_LOG_SIZE configuration variable. If any log maintained by a resource script reaches the size limit, that log is rotated out and a new log is generated. It is possible for probes ("status" scripts) to overlap, which is why an active log is redirected to a log with an incremental number by the "redirect" operator. The Unix mv command could not have been used for this, because mv is a blocking operation and can potentially cause a probe script to hang. However, inactive logs are rotated using mv, because mv is nearly instantaneous when used within the same file system (only names are updated with mv within the same file system, no data blocks are actually being moved around). An additional advantage of the mv command in this case is that it preserves time stamps on the log files, so it is quick and easy to tell when logs were rotated and which time period is covered by each log.

Improved Startup and Shutdown Behaviors

As I mentioned before, in some situations, the database may not be all the way up even after the startup script successfully completes. To avoid false positives from a database resource script that has not yet completely started, the STARTUP_TIME configuration variable is used. This variable is used differently by the database and listener resources.

The database resource script exits as soon as the startup method completes, but the "status" method checks whether the database has been brought online less than STARTUP_TIME seconds ago and exits with success if the STARTUP_TIME has not yet passed. This way, RHC engine proceeds with starting the listener before the database is completely up and has been probed, thereby speeding up the startup process. The listener resource script simply sleeps for STARTUP_TIME seconds after the "start" command exits successfully, then the RHC resource begins to probe the listener.

Shutdown methods first initiate graceful shutdown with the sqlplus shutdown immediate command for the database and lsnrctl stop LSNR_NAME for the listener. If SHUTDOWN_TIME seconds pass, each resource will shut down itself more aggressively, beginning with a kill -9 to the database smon process first, then the rest of database processes, followed by the sqlplus shutdown abort command for the database and a kill -9 to the tnslsnr process.

Immediate Remediation Procedure

In many situations, when the database has crashed, it cannot be restarted without database recovery. For example, if any database files were in archive (backup) state when the database crashed, a simple "startup" command would not start the database, and recovery of media would be needed. The "start" method of the implemented solution would attempt to recover the database if a failure was detected. In many cases, this would eliminate the need for human intervention and shorten the recovery time needed.

Improved Probing

Each resource is probed by its own method. The "status" script for the database checks for the presence of database processes and examines the open_mode column of v$database view to determine the state of the database. Listener method checks for the tnslsnr process and evaluates the return value of the tnsping command and the output of the lsnrctl status command.

It is largely assumed that monitoring the database and listener independently is sufficient to evaluate the overall database service state. However, the deep_probe method of the listener resource script, which runs every DEEP_PROBE_INTERVAL seconds, monitors end-to-end functionality. It establishes a connection to the database through the listener and evaluates results. For security reasons, the deep_probe method does not have valid database login credentials, but it evaluates the "access denied" string from the database as success. After all, if it retrieves the response from the database, it is safe to assume that both listener and database are functioning correctly.

The deep_probe method is an exception in the way it uses the recovery policy. If deep_probe detects a failure condition, the method does not call the restart function, but rather exits with a failure return value. It does not set the restart or ping-pong flags, because it is unable to evaluate which component -- the database or the listener -- has faulted; rather, it evaluates the complete service. Deep_probe relies on the fact that, in the case of a single resource failure, the RHC HA engine will restart all of the resources in a service group. So, if deep_probe detects an outage, RHC will restart all of the resources in the resource group or failover the entire group, depending on the recovery policy defined.

Ability to Freeze Individual Resource Groups

In this implementation of RHC Oracle Service Agent, a resource group can be "frozen" (i.e., all cluster actions on Oracle or listener resources can be disabled). If the cluster calls a resource script with the "stop" argument, a failure return value is returned when the cluster is frozen, which disables all additional cluster actions. To disable a resource group, the database or listener resource script must be called with the "freeze" argument on the active cluster node. This creates a flag file that is checked by both listener and database resource scripts, and no action is taken by those scripts if the flag file exists. To unfreeze the cluster, one of the resource scripts can be called with "unfreeze" argument, or the flag file can be manually removed.

If file systems or IP resources fail when the cluster is frozen, the listener and database resource scripts will return a failure condition from the "stop" script, which would result in the resource group going into "failed" state. To re-enable that group, it would then need to be unfrozen then disabled and re-enabled.

Conclusions

Using a lower cost solution to achieve high availability for enterprise applications requires significant overheard of implementation and deployment time and resources. While the out-of-the-box solution that Red Hat Cluster Suite provides is not as flexible and reliable as those provided by commercial high-availability suites, most of the shortcomings of the RHC can be worked around and very similar availability levels can be achieved with RHC, as with its commercial counterparts, but at significantly lower cost.

It would have been a better solution to implement these functionalities in RHC HA engine itself, as opposed to handling in external scripts. However, to satisfy the strict availability requirements of most organizations, it is possible to successfully overcome these difficulties by moving the desired functionalities into the resource-specific scripts. The additional features requests have been forwarded to Red Hat, and it remains to be seen whether they will implement new capabilities and configuration options in the next release of Red Hat Cluster.

Although the methods described here are specific to Oracle database and listener resources, these same methods could be applied to any other application resource scripts. The Oracle database and listener scripts are used to show how these methods could be applied to improve the overall database service availability, and they could be used on databases other than Oracle or completely different types of applications.

References

Red Hat, 2006. Red Hat Cluster Suite: Configuring and Managing a Cluster. Retrieved on January 2, 2006. -- http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/

Sergey holds a M.S. in Applied Computer Systems Management from UMUC and a B.S. in Computer Science from UMBC. He has extensive experience in implementing, deploying, and maintaining high-availability solutions using a variety of commercial and open source availability suites.




MarketPlace

Build IT Knowledge with Current & Trusted Content
Helps Employees Develop & Hone New Technical Programming Skills. Sign Up & Get Full Access.

Villanova University Six Sigma & IT Certificate Programs
100% Online programs in Six Sigma, IS Security, CISSP Prep, Business Analysis, Proj. Mgmt. and more!

Learn Embedded Linux, $349
Hands-on kit teaches fundamentals of embedded Linux development on real target hardware. ARM9 SBC.

Workflow Enabled Help Desk & IT Service Management
Automate service desk activities and integrate processes across IT. Learn more here.

Wanna see your ad here?