Asset ID: |
1-75-1373007.1 |
Update Date: | 2012-03-26 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1373007.1
:
VTL - Troubleshooting server failover issues in HA config
Related Items |
- Sun StorageTek VTL Plus Storage Appliance
- Sun StorageTek VTL Storage Appliance
|
Related Categories |
- PLA-Support>Sun Systems>TAPE>Virtual Tape>SN-TP: VTL
- .Old GCS Categories>Sun Microsystems>Storage - Tape>Tape Virtualization
|
In this Document
Purpose
Last Review Date
Instructions for the Reader
Troubleshooting Details
Collect Xrays from BOTH nodes
Verify failover was successful
Determine cause of failover
Correct any issues found in Step 3
Perform Failback
Perform RCA, if required
Implement any recommendations from RCA, if required
Applies to:
Sun StorageTek VTL Storage Appliance - Version: 4.0 - Build 1221 to 4.0 - Build 1221 - Release: 4.0 to 4.0
Sun StorageTek VTL Plus Storage Appliance - Version: 1.0 - Build 1323 to 2.0 - Build 1656 [Release: 1.0 to 2.0]
Information in this document applies to any platform.
Purpose
To provide troubleshooting steps in a VTL HA configuration when one VTL node fails over to it's failover partner.
Last Review Date
November 2, 2011
Instructions for the Reader
A Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting Details
-
Collect Xrays from BOTH nodes
Refer to doc id 1199883.1 for collecting Xrays
Note: Collecting Xray from failed server must be done via VTL server commandline (cannot be collected via VTL Console GUI in failover status)
-
Verify failover was successful
Even though VTL event log may say failover was successful, the best way to verify is to have customer check if Backup App servers from failed VTL node are still able to see virtual drives and run backups to failover partner (surviving node). Or check "ifconfig -a" from good server and verify failed server's virtual IP is listed (i.e., look for "e1000g0:2: ..." on a VTL+ 2.0 server)
- If failover was successful, go to step 3
- If failover not successful, reboot failed VTL node to release resources, so that failover partner can service all Backup App servers.
@Note: If failover was successful, the case should not be considered a Sev1. This is the reason to have HA (failover) in place, to prevent system down situation.
-
Determine cause of failover
- SAN/Network changes or problems?
- Ask customer if any network maintenance have been done recently
- Review Xray for messages like "unable to communicate with failover partner", which indicates network issue (more than likely a temporary network issue). Check if failed server says "2(Ready)" for failover status (sms -v). If "2(Ready)", then it was a temporary issue. If this happens frequently, open a ticket with VTL support.
Oracle internal reference for adjusting timeout values, see doc id 1021690.1
- Power issues?
If there was a partial power outage at site it could cause VTL to failover. Check if failed server says "2(Ready)" for failover status (sms -v). If "2(Ready)", then it was a temporary issue.
- Disk array issue?
Use SANtricity Recovery Guru to check for errors.
-
Correct any issues found in Step 3
Note: If no issue determined, or even if issues found were temporary, a reboot of the failed node is highly recommended.
-
Perform Failback
Refer to doc id 1013440.1 for detailed failback instructions.
Note: Sometimes an RCA will be requested before failing back, but there are many checks in VTL and if VTL says it's ready to fail back ("2(Ready)"), in most cases the issue was only temporary and failback can be done and not wait for RCA.
-
Perform RCA, if required
Step 3 may only uncover surface cause and not root cause.
If necessary, open a case with VTL engineering to perform RCA.
-
Implement any recommendations from RCA, if required
Attachments
This solution has no attachment