Implementing Highly Available Oracle Databases with Red Hat Cluster Suite
Sergey Nemirovsky
In the current business environment, it is essential that
databases, as well as other components of IT infrastructure, remain
functional at all times. Downtime of databases is increasingly
expensive to companies, and the industry is searching for ways to avoid
planned or unplanned downtime. Hosting organizations widely depend on
commercial high-availability suites, such as Veritas Cluster Server or
Sun Cluster, that provide robust, very customizable, and flexible
failover capabilities to a wide range of applications. However, these
solutions also carry a significant price tag.
Many organizations are looking for a high-availability engine
that is not as expensive, but they are not willing to sacrifice the
reliability and availability. Red Hat Global File System (GFS) and
Cluster Suite (RHC) offer failover solutions at a fraction of the cost
of commercial high-availability software suites, but, as provided,
these tools do not allow for the same level of flexibility,
reliability, and availability as the commercial solutions.
In this article, I'll discuss Red Hat Cluster Suite's behavior
and shortcomings and describe methods to improve the availability that
Red Hat Cluster Suite can provide. I'll explain the capabilities and
limitations of Red Hat Cluster Suite in reducing downtime and improving
availability of Oracle databases to the levels comparable to those
provided by commercial high-availability suites, but at a fraction of
the cost. The set of scripts implemented and explained here extends Red
Hat Cluster's capabilities and represents just one example of added
value that can be provided by skillfully leveraging Red Hat Cluster's
capabilities with application-specific availability service agents.
Red Hat Cluster Behavior and Shortcomings
The Red Hat Cluster Suite (RHC) promises to provide low-cost
high availability for enterprise applications (Red Hat, 2004), and it
delivers on its promise well: it is a solid foundation to implement
failover capabilities for applications. Red Hat Cluster provides the
ability to failover service -- or collection of file systems, virtual
IPs, and application daemons -- from one cluster node (server in a
cluster) to another. Thus, it improves service availability in the
event of hardware or software failure on one of the cluster nodes. Red
Hat has provided documentation on how to install and configure cluster
hardware and software in the "Configuring and Managing a Cluster"
manual, which can be found at:
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/
General RHC configuration is not covered here, except for the
points directly related to the configuration of Oracle Database and
Listener.
Typical Oracle Database Service Configuration
Much like commercial availability engines, RHC combines several
related resources into a service group. An Oracle database service
group is generally composed of Oracle software installed in ORACLE_HOME
directory, database processes (daemons) that are running on an active
cluster node, a collection of data files, a virtual service IP, and an
Oracle listener that is bound to that service IP. A typical Oracle
configuration can be described by the dependency tree shown in
Figure 1
.
Based on internal requirements and practices of different
organizations, there is some variation with respect to exact
configuration. For example, some organizations may prefer to install
Oracle software on local disks of each cluster node, or they may choose
a single installation on a global file system shared by all cluster
nodes. Also, the number of shared GFS's (Global File Systems) used by
the database instance varies significantly based on the database size,
performance requirements, or redundancy levels required by the
applications.
The important point here is that all of the cluster nodes need
exactly the same Oracle installations, and all of the file systems used
by the highly available database instance need to be under RHC control.
The resources, shown as circles in
Figure 1
, must be started in "bottom up" order. Shared file systems must be
mounted, and software and data files made available before the Oracle
database can come up. The network interface must be configured and the
IP set before a listener can be started and be able to bind to that IP.
The line from the listener to the Oracle database instance is dotted
because there is no physical dependency between the two. In other
words, the listener can be started before the database is up, but there
is a logical connection, because the listener cannot service any
incoming connections before the database is fully up.
Behaviors of several components of an Oracle service group
(the components managed by RHC HA suite are shown in gray) are handled
by Cluster Suite directly, but the behavior of others, such as an
Oracle database and listener, are left to be implemented by systems
administrators.
The Oracle database configuration described in the Red Hat's
Cluster Manual is too simplistic and does not meet availability
requirements of many organizations. The goal is to improve the
flexibility of Oracle service and to implement a more robust and
reliable failover behavior for the Oracle service group.
RHC provides methods to control file systems, service IP, and
several applications -- but Oracle database is not one of the
applications that RHC supports directly. For additional applications,
RHC provides a generic "script" type resource, and RHC calls these
resources with "start", "stop", or "status" arguments. In case of
Oracle database, the "start" method should start the database, "stop"
should bring the database down, and "status" should return the status
of the database.
RHC engine only evaluates return values from the scripts to
determine success/failure of the operations -- with 0 indicating
success, and 1 failure. For example, when RHC engine starts the
database, it evaluates the return value from the script; if it receives
a return value of 0, it considers the resource to be up and starts
probing it by calling the same script with "status" argument. But, if
it receives 1 from "start" method, it determines that RHC cannot start
the resource (and, consequently, the resource group on a given cluster
node) and attempts to start it on one of the other cluster nodes.
Red Hat Cluster Suite Shortcomings
Limited Recovery Policy Options
In the event of a failure, RHC supports several recovery policies:
- Restart -- The failed service group is restarted on the same cluster node.
- Relocate -- The failed group is moved to another cluster node in the same cluster.
- Disable -- The group is disabled and no attempt is made to restart it on any of the cluster nodes.
The RHC does not allow for any combination of the restart
policies. For example, if the restart recovery policy is selected, the
failed resource group will not get moved to another node as long as it
can be successfully restarted on the same cluster node, no matter how
many times it fails. There are two conditions, under which a service
group would get moved to another available node in the cluster:
1. The service group cannot be restarted on the active node of the cluster.
2. The active node is no longer in the cluster (it crashed or was taken out of the cluster manually).
While it is a largely desired behavior, this behavior does not
provide for the best failure resolution in all situations. Restarting
on the same cluster node is most appropriate when failures are caused
by software issues, such as high loads, because, in most cases, the
underlying problem is most likely to follow the service group when the
group is failed over. However, in the event of hardware failures, it is
desired to failover the service group to one of the other cluster
nodes. Some types of physical memory problems may result in application
crash, and since Oracle database normally is the main user of the
system's memory on database server, it is more likely to access bad
memory cells. Continuous crashing of the database is indicative of such
a problem, but RHC does not provide a direct way to limit a number of
restart attempts on a given cluster node.
Ping-Pong Conditions
As mentioned before, if a software issue resulting in a
database crash should occur, and if the service group containing that
database is failed over to another cluster node, the issue is likely to
appear on the other cluster node as well. This would result in another
failover, and the situation is likely to continue to occur.
Continuous failover of the service group results in ping-pong
situation (i.e., the service group is continuously moved from one
cluster node to another). This behavior makes it difficult to
troubleshoot the issue, because the issue is also continuously moving
from different hosts in the cluster. It is desired to disable the
failing resource group to allow the database or systems administrator
to investigate and fix the issue, so that the database can be brought
back online. RHC does not provide for a mechanism to limit the number
of times service groups are moved from one cluster node to another.
Limited Control over Restart Behavior
In case of a single resource failure, the RHC's implementation
requires all the resources in that resource group to be restarted. For
example, if a database resource fails, the listener is also brought
offline, all the file systems are unmounted, IP is unplumbed, and all
of the resources are restarted in the opposite order. This results in
slow restart times. The time needed to restart a failed listener is
negligible when compared to the time needed to restart a database. If a
listener fails, only the listener needs to get restarted -- there is no
point in stopping the database, unplumbing IPs, or unmounting file
systems -- but that is what the RHC would do.
Moreover, a failed Oracle listener does not in any way affect
connections to the database by applications or users that have already
been established. A listener listens on the service IP for incoming
connections and passes those connections to the database but plays no
role in further communications that happen between the database and a
user or an application that has initiated the connection. It is desired
for a high-availability engine to be able to determine when to restart
everything, and when the less time-consuming restarts of individual
resources are more appropriate. If only a failed resource is restarted,
it may shorten the restart time and minimize the impact of restarts on
availability of provided services.
Limited Logging
RHC by itself does not provide detailing logs of error
conditions on the cluster. The RHC engine logs general information
about the failures via syslog, and that information is not always
sufficient to fully investigate and resolve the problem. Continued run
logs for "start", "stop", and "status" methods are needed to fully
investigate and quickly and easily fix underlying problems.
Limited Probing Configurability
The resources are probed by RHC availability engine in a
predefined interval -- the probe interval is the same for all resources
in the service group. While having resources probed too often adds
additional loads to the servers, not probing them often enough prolongs
the time between a resource going offline and a cluster engine
detecting such a condition.
It is desirable to perform less load-intense probing more
often, but to perform load-savvy tests less frequently. For example,
the normal probe on a database resource may check for the presence of
required database processes and establish a local connection to the
database, but the "deep probe" may connect to the database, insert
values into a test table, and evaluate success of the insert operation.
It is also desired for a probe to be able to test end-to-end
availability of a service group, not just availability of individual
resources.
After RHC engine starts a resource on one of the nodes and
receives a successful return value from the "start" resource script, it
immediately proceeds to probing the resource, but sometimes it may take
a resource several seconds or even minutes to fully start up. RHC does
not allow for a delay between starting the resource and probing it for
the first time.
Sub-Optimally Designed Shutdown Methods
RH has provided a suggested configuration for Oracle database, documented at:
http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/ \
cluster-suite/ch-db-service.html#S1-SERVICE-ORACLE
To shut down a database, this implementation calls sqlplus
"shutdown abort" command, which takes down the database quickly but
ungracefully. "Shutdown abort" may cause database corruption, and it
may require database recovery the next time the database is brought
online, thus prolonging the time it takes to restart or failover the
database.
It is essential that the "stop" method take the database down --
if it fails, the database cannot be started on any other node of the
cluster or be restarted on the same node, to avoid data corruption. But
it is also desirable for the RHC engine to at least attempt to bring
down the database gracefully, and, if a graceful shutdown fails, then
proceed to more severe -- and destructive -- methods.
Inability to Freeze Cluster
During system maintenance, it is convenient to be able to
disable a cluster's normal operations. For example, if the database is
taken to STANDBY mode under normal cluster operations, the cluster
would restart the database and open it in "READ WRITE" mode. Stopping
the cluster completely and then starting the database manually in
STANDBY mode would require for the cluster-controlled file systems to
also be mounted manually, which may be significant overhead, especially
if the database uses numerous file systems. It is desirable to
temporarily disable resource monitoring in the cluster to allow for the
database maintenance, and RHC cluster lacks that ability.
Implemented Solution
Many of the shortcomings of the RHC high-availability engine
described above have been addressed by commercial high-availability
(HA) suites, such as VCS or Sun Cluster. To achieve the same level of
availability that commercial HA suites can provide, while maintaining a
low cost of the availability cluster provided by RHC, the features
described above must be incorporated into resource scripts that
implement "start", "stop", and "status" methods for the Oracle database
and listener. In the implementation of RHC Oracle failover database
clusters described here, the database and listener are split off to
different resources in the same resource group, so that their restarts
can be handled independently, and separate scripts for database and
listener are developed.
The resource scripts for database and listener resources are
shown in Listings 1 and 2, respectively. Listing 3 contains a file,
sourced by both scripts, that defines global variables and contains
shell functions used by both resource scripts. Listing 4 contains a
sample resource group configuration file. The listing files can be
found at: http://www.sysadminmag.com/code/.
The listener resource was made to be a child resource of the
database, so that the database starts first, and then the listener is
brought online.
Figure 2
demonstrates the configuration of resources within Oracle database resource group.
The listener is attached to the database resource, and the
database is attached to IP and GFS file systems. RHC starts the
resources from top to bottom and shuts down resources from bottom up;
attaching resources to one another ensures the correct startup and
shutdown order.
As I mentioned before, a failure (return status of 1 from
"status" method of a resource script) originally resulted in restarting
all the resources in the group, which may take long time, specifically
for the database resource. That is why restart functionality was moved
from RHC engine to "status" methods of resource scripts. If a resource
can be successfully restarted, the "status" method will return success,
so no other resources are restarted in the event of individual resource
failure. The benefits achieved by implementing custom scripts for each
of the resources are described below.
Improved Recovery Behavior
As described before, the "status" method for each of the
resources -- a database and a listener -- calls the restart function if
an outage is detected. If the restart function can successfully restart
the resource, the "status" script exits with 0 (success); thus, the
cluster engine does not restart or failover the resource group. In this
implementation, the recovery option is defined to be "restart", but
failovers are forced with a system of flags that stop repeatedly
failing resources from starting on the same node.
To limit the number of times each resource is restarted, the
restart function also maintains a restart history (i.e., it stores
restart times for each resource in a flag file). The number of times
each resource is restarted is controlled by two configuration
variables: RESTART_ON_FAILURE_INTERVAL and RESTART_ATTEMPTS. Each
resource is allowed to be restarted RESTART_ATTEMPTS times within
RESTART_ON_FAILURE_INTERVAL hours. For example, if
RESTART_ON_FAILURE_INTERVAL is set to 2 and RESTART_ATTEMPTS is set to
3, the resource would be restarted by "status" method up to three times
within RESTART_ON_FAILURE interval, should it keep failing. If the
resource fails more than three times, the "status" script would exit
with a status of 1 (failure) and let the cluster handle the recovery
process.
Restart functions within each resource script also maintain
the PING-PONG flag. If RESTART_ATTEMPTS within
RESTART_ON_FAILURE_INTERVAL criteria have been met, a PING-PONG flag is
set (via flag file), which prevents the "start" method from starting
the database. If the PING-PONG flag is set on all cluster nodes, the
resource group does not come online to allow for human intervention.
PING-PONG flag has a limited effective time. If the date in
the PING-PONG flag file is more than RESTART_ON_FAILURE_INTERVAL hours
old, the PING-PONG flag file is ignored and removed. This is done so
that "start" is allowed on a node significant time after a failure. For
example, if a resource group has been failed over from node0, and
PING-PONG flag has been set on node0, several days later, when the
resource group tries to fail over to node0 again, the scripts would
allow for the resource group to come online. This way, if the
underlying problem on node0 has been fixed, the PING-PONG flag will not
stop the normal cluster operation.
Improved Logging
The logging of the scripts is implemented via log_msg and
log_alert functions within common functions file. Log_msg is generally
used to maintain a run log of each script. The logs are maintained on a
"per script, per instance" basis: if there are two resource groups for
two database instances defined on the same cluster, each will have its
own run log. Each line of the log is prepended with a time stamp and a
PID for the script; that way, if probes overlap, it is easy to follow
what has happened within each probe.
More severe messages are logged via the log_alert function.
Log_alert displays a message in a terminal of each of the logged-in
users, a corresponding run log, and the system log. Each of the alert
messages contains a reference to the appropriate run log, so that more
information about the failure could easily be gathered by the systems
and database administrators not closely familiar with cluster scripts
implementation. Each line of the alert log contains PID of the
corresponding probe to allow for easy cross-reference of the system log
with run logs.
Run logs can be quite wordy and generate fairly large log
files. To limit the impact of run logs, the total size of each log file
is limited by the MAX_LOG_SIZE configuration variable. If any log
maintained by a resource script reaches the size limit, that log is
rotated out and a new log is generated. It is possible for probes
("status" scripts) to overlap, which is why an active log is redirected
to a log with an incremental number by the "redirect" operator. The
Unix mv command could not have been used for this, because mv is a blocking operation and can potentially cause a probe script to hang. However, inactive logs are rotated using mv, because mv is nearly instantaneous when used within the same file system (only names are updated with mv within the same file system, no data blocks are actually being moved around). An additional advantage of the mv
command in this case is that it preserves time stamps on the log files,
so it is quick and easy to tell when logs were rotated and which time
period is covered by each log.
Improved Startup and Shutdown Behaviors
As I mentioned before, in some situations, the database may not
be all the way up even after the startup script successfully completes.
To avoid false positives from a database resource script that has not
yet completely started, the STARTUP_TIME configuration variable is
used. This variable is used differently by the database and listener
resources.
The database resource script exits as soon as the startup
method completes, but the "status" method checks whether the database
has been brought online less than STARTUP_TIME seconds ago and exits
with success if the STARTUP_TIME has not yet passed. This way, RHC
engine proceeds with starting the listener before the database is
completely up and has been probed, thereby speeding up the startup
process. The listener resource script simply sleeps for STARTUP_TIME
seconds after the "start" command exits successfully, then the RHC
resource begins to probe the listener.
Shutdown methods first initiate graceful shutdown with the sqlplus shutdown immediate command for the database and lsnrctl stop LSNR_NAME for the listener. If SHUTDOWN_TIME seconds pass, each resource will shut down itself more aggressively, beginning with a kill -9 to the database smon process first, then the rest of database processes, followed by the sqlplus shutdown abort command for the database and a kill -9 to the tnslsnr process.
Immediate Remediation Procedure
In many situations, when the database has crashed, it cannot be
restarted without database recovery. For example, if any database files
were in archive (backup) state when the database crashed, a simple
"startup" command would not start the database, and recovery of media
would be needed. The "start" method of the implemented solution would
attempt to recover the database if a failure was detected. In many
cases, this would eliminate the need for human intervention and shorten
the recovery time needed.
Improved Probing
Each resource is probed by its own method. The "status" script
for the database checks for the presence of database processes and
examines the open_mode column of v$database view to determine the state
of the database. Listener method checks for the tnslsnr process and
evaluates the return value of the tnsping command and the output of the lsnrctl status command.
It is largely assumed that monitoring the database and listener
independently is sufficient to evaluate the overall database service
state. However, the deep_probe method of the listener resource script,
which runs every DEEP_PROBE_INTERVAL seconds, monitors end-to-end
functionality. It establishes a connection to the database through the
listener and evaluates results. For security reasons, the deep_probe
method does not have valid database login credentials, but it evaluates
the "access denied" string from the database as success. After all, if
it retrieves the response from the database, it is safe to assume that
both listener and database are functioning correctly.
The deep_probe method is an exception in the way it uses the
recovery policy. If deep_probe detects a failure condition, the method
does not call the restart function, but rather exits with a failure
return value. It does not set the restart or ping-pong flags, because
it is unable to evaluate which component -- the database or the
listener -- has faulted; rather, it evaluates the complete service.
Deep_probe relies on the fact that, in the case of a single resource
failure, the RHC HA engine will restart all of the resources in a
service group. So, if deep_probe detects an outage, RHC will restart
all of the resources in the resource group or failover the entire
group, depending on the recovery policy defined.
Ability to Freeze Individual Resource Groups
In this implementation of RHC Oracle Service Agent, a resource
group can be "frozen" (i.e., all cluster actions on Oracle or listener
resources can be disabled). If the cluster calls a resource script with
the "stop" argument, a failure return value is returned when the
cluster is frozen, which disables all additional cluster actions. To
disable a resource group, the database or listener resource script must
be called with the "freeze" argument on the active cluster node. This
creates a flag file that is checked by both listener and database
resource scripts, and no action is taken by those scripts if the flag
file exists. To unfreeze the cluster, one of the resource scripts can
be called with "unfreeze" argument, or the flag file can be manually
removed.
If file systems or IP resources fail when the cluster is
frozen, the listener and database resource scripts will return a
failure condition from the "stop" script, which would result in the
resource group going into "failed" state. To re-enable that group, it
would then need to be unfrozen then disabled and re-enabled.
Conclusions
Using a lower cost solution to achieve high availability for
enterprise applications requires significant overheard of
implementation and deployment time and resources. While the
out-of-the-box solution that Red Hat Cluster Suite provides is not as
flexible and reliable as those provided by commercial high-availability
suites, most of the shortcomings of the RHC can be worked around and
very similar availability levels can be achieved with RHC, as with its
commercial counterparts, but at significantly lower cost.
It would have been a better solution to implement these
functionalities in RHC HA engine itself, as opposed to handling in
external scripts. However, to satisfy the strict availability
requirements of most organizations, it is possible to successfully
overcome these difficulties by moving the desired functionalities into
the resource-specific scripts. The additional features requests have
been forwarded to Red Hat, and it remains to be seen whether they will
implement new capabilities and configuration options in the next
release of Red Hat Cluster.
Although the methods described here are specific to Oracle
database and listener resources, these same methods could be applied to
any other application resource scripts. The Oracle database and
listener scripts are used to show how these methods could be applied to
improve the overall database service availability, and they could be
used on databases other than Oracle or completely different types of
applications.
References
Red Hat, 2006. Red Hat Cluster Suite: Configuring and Managing a Cluster. Retrieved on January 2, 2006. -- http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/
Sergey holds a M.S. in Applied Computer Systems Management from UMUC
and a B.S. in Computer Science from UMBC. He has extensive experience
in implementing, deploying, and maintaining high-availability solutions
using a variety of commercial and open source availability suites.
|