Friday, July 10, 2009

Total Fibre Cables Failure Test on Sun Cluster

We were having our UAT few days back with our National Healthcare customer here in Singapore.

They have SAP with Oracle running on Sun Cluster 3.2. There are 2 nodes which share a common storage SL500 connected via 2 pairs of Fibre Cables.

One of the test was to ensure that if the 2 fibre cables which are connected to a node are accidentially plug out, the resources running on that node should fail over to the other active node (which still has FC connection to the storage).

As this is a migration job for a hardware upgrade, they would like to use back the same UAT Test Cases from another vendor. (Actually I do not like this arrangement, really)

Nevertheless, we went ahead, but were stuck with this Total Fibre Cables Failure Test.

Instruction:

• Unplug both fibre cables from node nodeA and run “vxdctl enable” to rescan the devices, plug both cables after test.

Expected Results

• There will be error message on node nodeA's console showing link failure of both FC HBA port. After some time, the resource group, oracledb-rg, will failover to node nodeB.


We kept testing but the resource group simply refused to fail-over. We did, however, saw the link failure error message.

We debugged and we discussed. We wanted to know what was the actual expected result then. We were then told by the customer that he actually saw nodeA rebooting when the FC are taken out from nodeA. And this action actually causes the resource group to failover to nodeB.

Now, this is simple!

root@nodeA # clnode show

=== Cluster Nodes ===
Node Name: nodeA
Node ID: 1
Enabled: yes
reboot_on_path_failure: disabled

Node Name: nodeB
Node ID: 2
Enabled: yes
reboot_on_path_failure: disabled


The reboot_on_path_failure was set to "disabled" which means even if there is a FC path failure, no action is taken. (aka nodeA will not reboot; and if nodeA does not reboot, the resource group will not failover)

To solve this problem, it's simple.

root@nodeA # clnode set -p reboot_on_path_failure=enabled nodeA nodeB


In addition, we need to make sure local disk is set to unmonitored.


root@nodeA # cldev status

=== Cluster DID Devices ===

Device Instance Node Status
--------------- ---- ------
/dev/did/rdsk/d1 nodeA Unmonitored
/dev/did/rdsk/d10 nodeA Ok
nodeB Ok



local disk refers to "/dev/did/rdsk/d1".



If only we could write our own UAT document .... *sigh*

No comments:

Post a Comment