Saturday, January 9, 2010

Fault Monitoring of resource


I'm still in Senegal busy with the Sun Cluster for Oracle HA UAT.

(Pictures below are Novotel Dakar. Very nice cozy hotel just besides the sea.)






Besides configuring Oracle database for HA, we are also responsible for monitoring customer's applications via Sun Cluster.

There are 2 ways to configure Fault Monitoring for Generic Data Service (GDS):
1. Port monitoring (default)
2. Probe command monitoring

Port monitoring is fairly straight-forward. It assumes your application is running on a particular port. If Sun Cluster detects that this port is down, it will assume that your application is faulted. It will then attempt to restart the resource automatically.

The application for this teleco here is pretty complicated. There are times when the port is still alive, but the application has hung. This is exactly what happened here!!

So Port monitoring is not reliable in this case, at least for this application per se.

We need to use Probe command monitoring instead. Probe command will require us to write shell script that return values like 0 (successful), 100 (complete failure) and 201 (immediate failover).

Now, there is an issue - port monitoring is turned on by default. If you have probe command monitoring added, port monitoring is still running. As such, even if probe command returns 100, but if the port is still alive, Sun Cluster still treats the resource to be alive.

This is no good. We need to disable port monitoring and rely totally on probe command monitoring.

How do we achieve that?

-x Network_aware=FALSE




No comments:

Post a Comment