VTL - Troubleshooting server failover issues in HA config

Asset ID:	1-75-1373007.1
Update Date:	2012-03-26
Keywords:

Solution Type Troubleshooting Sure

Solution 1373007.1 : VTL - Troubleshooting server failover issues in HA config

Applies to:

Sun StorageTek VTL Storage Appliance - Version: 4.0 - Build 1221 to 4.0 - Build 1221 - Release: 4.0 to 4.0
Sun StorageTek VTL Plus Storage Appliance - Version: 1.0 - Build 1323 to 2.0 - Build 1656 [Release: 1.0 to 2.0]
Information in this document applies to any platform.

Purpose

To provide troubleshooting steps in a VTL HA configuration when one VTL node fails over to it's failover partner.

Last Review Date

November 2, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Collect Xrays from BOTH nodes

Refer to doc id 1199883.1 for collecting Xrays
Note: Collecting Xray from failed server must be done via VTL server commandline (cannot be collected via VTL Console GUI in failover status)
Verify failover was successful
Even though VTL event log may say failover was successful, the best way to verify is to have customer check if Backup App servers from failed VTL node are still able to see virtual drives and run backups to failover partner (surviving node). Or check "ifconfig -a" from good server and verify failed server's virtual IP is listed (i.e., look for "e1000g0:2: ..." on a VTL+ 2.0 server)
- If failover was successful, go to step 3
- If failover not successful, reboot failed VTL node to release resources, so that failover partner can service all Backup App servers.
@Note: If failover was successful, the case should not be considered a Sev1. This is the reason to have HA (failover) in place, to prevent system down situation.
Determine cause of failover

SAN/Network changes or problems?
- Ask customer if any network maintenance have been done recently
- Review Xray for messages like "unable to communicate with failover partner", which indicates network issue (more than likely a temporary network issue). Check if failed server says "2(Ready)" for failover status (sms -v). If "2(Ready)", then it was a temporary issue. If this happens frequently, open a ticket with VTL support.
  Oracle internal reference for adjusting timeout values, see doc id 1021690.1
Power issues?
If there was a partial power outage at site it could cause VTL to failover. Check if failed server says "2(Ready)" for failover status (sms -v). If "2(Ready)", then it was a temporary issue.
Disk array issue?
Use SANtricity Recovery Guru to check for errors.

Correct any issues found in Step 3

Note: If no issue determined, or even if issues found were temporary, a reboot of the failed node is highly recommended.
Perform Failback
Refer to doc id 1013440.1 for detailed failback instructions.

Note: Sometimes an RCA will be requested before failing back, but there are many checks in VTL and if VTL says it's ready to fail back ("2(Ready)"), in most cases the issue was only temporary and failback can be done and not wait for RCA.
Perform RCA, if required

Step 3 may only uncover surface cause and not root cause.
If necessary, open a case with VTL engineering to perform RCA.
Implement any recommendations from RCA, if required

Attachments

This solution has no attachment

Applies to:

Purpose

Last Review Date

Instructions for the Reader

Troubleshooting Details

Collect Xrays from BOTH nodes

Verify failover was successful

Determine cause of failover

Correct any issues found in Step 3

Perform Failback

Perform RCA, if required

Implement any recommendations from RCA, if required