VTL - Failover and failback issue due to timeout

Asset ID:	1-72-1310404.1
Update Date:	2011-05-11
Keywords:

Solution Type Problem Resolution Sure

Solution 1310404.1 : VTL - Failover and failback issue due to timeout

Applies to:

Sun StorageTek VTL Plus Storage Appliance - Version: 2.0 - Build 1656 and later [Release: 2.0 and later ]
Information in this document applies to any platform.

Symptoms

VTL node1 failed over to node2, but node1 did not come Ready for failback

Node2 panic'd, causing complete outage.

Possible mutual failover situation

Changes

Issue occurred after applying VTL Get Well Plan (GWP), but this does not appear to be a cause of the events

Cause

Problem was due to the fact that the VTL node hasn't finished loading the resources when it starts taking over the partner node during simultaneous node boot up.

Solution

This issue is resolved by introducing a delay during the startup sequence, to allow self monitoring module to finish loading resources before taking over the partner.

To add the delay, modify the "ipstorfm.sh" script (adding 2 lines) as indicated in code segment below
(/usr/local/vt/bin):

...
# check if ipstorfm is running already
# if it is, return with an error
APID=`$IS_BIN/pidof ipstorfm`
NUM_P=`echo $APID | awk 'BEGIN{} {print NF}'`
if [ $NUM_P -ne 0 ]
then
      RET=1
else
      logger -p daemon.notice Sleeping 500 seconds before starting FM.     <<< added line >>>
      sleep 500                                                                                                   <<< added line >>>
      $IS_BIN/ipstorfm $2&
      sleep 1
      APID=`$IS_BIN/pidof ipstorfm`
      NUM_P=`echo $APID | awk 'BEGIN{} {print NF}'`
      if [ $NUM_P -eq 0 ]
      then
            RET=1
       else
             RET=0
       fi
fi
...

NOTE: It is also recommended that in a failover configuration, each node is rebooted one at a time and not simultaneously, which will also avoid this situation.

Attachments

This solution has no attachment