Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1373007.1
Update Date:2012-03-26
Keywords:

Solution Type  Troubleshooting Sure

Solution  1373007.1 :   VTL - Troubleshooting server failover issues in HA config  


Related Items
  • Sun StorageTek VTL Plus Storage Appliance
  •  
  • Sun StorageTek VTL Storage Appliance
  •  
Related Categories
  • PLA-Support>Sun Systems>TAPE>Virtual Tape>SN-TP: VTL
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Tape>Tape Virtualization
  •  




In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
     Collect Xrays from BOTH nodes
     Verify failover was successful
     Determine cause of failover
     Correct any issues found in Step 3
     
     Perform Failback
     Perform RCA, if required
     Implement any recommendations from RCA, if required


Applies to:

Sun StorageTek VTL Storage Appliance - Version: 4.0 - Build 1221 to 4.0 - Build 1221 - Release: 4.0 to 4.0
Sun StorageTek VTL Plus Storage Appliance - Version: 1.0 - Build 1323 to 2.0 - Build 1656   [Release: 1.0 to 2.0]
Information in this document applies to any platform.

Purpose

To provide troubleshooting steps in a VTL HA configuration when one VTL node fails over to it's failover partner.

Last Review Date

November 2, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details


  1. Collect Xrays from BOTH nodes

    Refer to doc id  1199883.1 for collecting Xrays

    Note: Collecting Xray from failed server must be done via VTL server commandline (cannot be collected via VTL Console GUI in failover status)

  2. Verify failover was successful

    Even though VTL event log may say failover was successful, the best way to verify is to have customer check if Backup App servers from failed VTL node are still able to see virtual drives and run backups to failover partner (surviving node).  Or check "ifconfig -a" from good server and verify failed server's virtual IP is listed (i.e., look for "e1000g0:2: ..." on a VTL+ 2.0 server)
    • If failover was successful, go to step 3
    • If failover not successful, reboot failed VTL node to release resources, so that failover partner can service all Backup App servers.

    @Note: If failover was successful, the case should not be considered a Sev1.  This is the reason to have HA (failover) in place, to prevent system down situation.
  3. Determine cause of failover

    • SAN/Network changes or problems?
      • Ask customer if any network maintenance have been done recently
      • Review Xray for messages like "unable to communicate with failover partner", which indicates network issue (more than likely a temporary network issue).  Check if failed server says "2(Ready)" for failover status (sms -v).  If "2(Ready)", then it was a temporary issue.  If this happens frequently, open a ticket with VTL support.
        Oracle internal reference for adjusting timeout values, see doc id 1021690.1
    • Power issues?
      If there was a partial power outage at site it could cause VTL to failover.  Check if failed server says "2(Ready)" for failover status (sms -v). If "2(Ready)", then it was a temporary issue.
    • Disk array issue?
      Use SANtricity Recovery Guru to check for errors.
  4. Correct any issues found in Step 3

    Note: If no issue determined, or even if issues found were temporary, a reboot of the failed node is highly recommended.

  5. Perform Failback

    Refer to doc id 1013440.1 for detailed failback instructions.

    Note: Sometimes an RCA will be requested before failing back, but there are many checks in VTL and if VTL says it's ready to fail back ("2(Ready)"), in most cases the issue was only temporary and failback can be done and not wait for RCA.
  6. Perform RCA, if required

    Step 3 may only uncover surface cause and not root cause. 
    If necessary, open a case with VTL engineering to perform RCA.

  7. Implement any recommendations from RCA, if required



Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback