Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1000548.1 : If T3+ or SE6120 Master Controller Is Disabled, Multiple Battery Errors Might Occur Due to Alternate Master Having Incorrect Battery States
PreviouslyPublishedAs 200694 Product Sun StorageTek T3+ Array Sun StorageTek 6120 Array Bug Id <SUNBUG: 6437076>, <SUNBUG: 6445913>, <SUNBUG: 6440106> Impact This issue generated customer case/escalations due to incoherent battery warranty dates, and in some cases, batteries were replaced when it was not necessary. Battery state may remain as "charge", and volume cache mode may be set to "writethrough" which may cause performance impact to host I/O. Symptoms Multiple issues related to batteries can be observed under the same general situation where: the master controller (typically u1ctr) failed/disabled due to unrelated reasons (may be hardware fault or firmware bug or by user command "disable u1"), and alternate master controller took over. Case A. T3+ or 6120 with firmware 3.1.x: While the array was working without issue, one or more batteries were replaced online. Since then, the array has never been reset or specifically the alternate master controller has never been reset (i.e. disable u2ctr; enable u2ctr). The master controller later failed/disabled, and the alternate master took over. The affected array may show some or all of the following symptoms: Symptom (a) "refresh -s" shows Warranty Expiration and Last Health Check dates were different from before master controller failed/disabled. Symptom (b) "refresh -s", "fru stat" may show one or more batteries has been failed, and "syslog" may show reason as "idle life exceeded". HBTT[2]: E: BATTERY: u1b1 - battery idle life exceeded. HBTT[2]: E: BATTERY: u1b2 - battery idle life exceeded. HBTT[2]: E: BATTERY: u2b1 - battery idle life exceeded. HBTT[2]: E: BATTERY: u2b2 - battery idle life exceeded. Symptom (c) If prior to master controller failed/disabled, batteries had completed a drain/recharge cycle, either due to scheduled battery refresh or by "refresh -c", after alternate master took over, "refresh -s" may show batteries state as "charge", but ".bat -s u?b?" shows "Charger is off". The batteries' "charge" state would not go back to "normal". If all batteries in a tray have state "charge", firmware would also set volumes cache mode to "writethrough", and this may impact host i/o performance. Case B. T3+ (not 6120) with firmware 3.2.3: While the array was working without issues, one or more batteries were replaced online. Since then, the array has never been reset or specifically the master controller (not the alternate master) has never been reset (i.e. disable u1ctr; enable u1ctr). The master controller later failed/disabled, and the alternate master took over. The affected array may show the following symptom: Symptom (d) "refresh -s" shows Warranty Expiration time different from before the master controller failed/disabled, where the difference is the same number of hours as the timezone offset of the array (previously set by the command "set timezone"). This warranty expiration date/time remains the same in subsequent array reset or controller reset, indicating the warranty expiration date/time before the master controller failure was actually the wrong time. In this scenario, Symptom (a) and (b) of Case A does not occur, but Symptom (c) where the batteries state remain in "charge" may still occur. Note: StorEdge 6120 with firmware 3.2.3 exhibits none of the symptoms. Use the command "ver" to display the firmware version. Note that only arrays with 2 controllers in partner pair configurations are affected. Use "refresh -s" command to verify battery states and Warranty Expiration date. Use the password-protected service command ".bat -s u?b?" on the alternate master's serial console to verify battery FRU information in alt master's memory. Root Cause There were 3 separate bugs discovered: CR6437076, CR6440106, CR6445913.
When a battery is replaced, firmware would update the FRU ID seeprom to indicate Warranty Start date and Warranty Expiration date. This action is done by the active controller (master) and the inactive controller (alternate master) is unaware of the change. If the master controller failed for some other unrelated reasons, and the alternate master took over, in firmware 3.1.x, the alternate master did not re-read the FRU ID seeprom of the PCU, and continue to use the out-dated information in memory it had since it was booted. In some cases, the array was running for more than 8 months since the battery replacement and without a reset. The alternate master may have retained a Last Recharge date (same date/time as Last Health Check) that was older than 8 months. Upon master controller failure, the alternate master thought that the battery has not been recharged for more than 8 months and "failed" the battery due to "idle life exceeded". Resolution: With firmware 3.2.3, both T3+ and 6120 would re-read the FRU ID seeprom when alternate master took over as master and therefore, user command like "refresh" which can only be issued from the active master would show the correct information.
During a refresh cycle, both controllers' batteries state are set accordingly through the cycle. For a 6120 with 3.2.3, both master and alternate master would show "normal" state at the end of a refresh. HOWEVER, for T3+ with 3.1.x through to 3.2.3, the batteries state of the alternate master is left at "charge" state after the refresh while the master shows "normal". If a refresh cycle was completed either due to scheduled refresh or user command "refresh -c", and later the master controller failed and alternate master took over, the batteries state would become "charge" due to alternate master's wrong state in memory, and this state remains since no actual recharge was on-going and no charge completion to trigger the state to go back to normal. Resolution: in progress
A new PCU/Battery FRU (with blank dates) insertion event would trigger firmware to update the FRU information like Warranty Start/Expiration dates. The date are stored in UTC time in the FRU ID seeprom. On T3+ with 3.2.3, apparently local time was written to the FRU ID seeprom instead. As long as the master controller stays alive, refresh -s would show the correct expected Warranty Expiration (in local time). However, once the master controller fails or the array reset or the PCU was unplug/re-inserted, where the FRU ID was re-read, the "wrong" time that was written is now shown and became apparent. Since timezone offset can at most be about 12 to 13 hours, this has minor impact to the actual warranty expiration calculation. Also, since firmware 3.x batteries remain usable after the warranty has expired (unlike 2.x firmware where it was failed), there is no real risk to operation of the array. Resolution: in progress Workaround
# On the master controller t3b:/:<1>sun Password: sun: commands enabled t3b:/:<2>.bat -s u1b1 t3b:/:<3>.bat -s u1b2 t3b:/:<4>.bat -s u2b1 t3b:/:<5>.bat -s u2b2 # On the alternate master controller t3b::<1>sun Password: sun: commands enabled t3b::<2>.bat -s u1b1 t3b::<3>.bat -s u1b2 t3b::<4>.bat -s u2b1 t3b::<5>.bat -s u2b2 Compare the output from both controllers. If any battery does not have the same Warranty Start or Expiration dates, this array will hit the issues described here if, any time in the future, the master controller is disabled. To preempt the issue from happening, "reset" the array, or "disable/enable" the alternate master controller. There is no practical workaround to prevent the batteries from showing a "charge" state in a T3+ in the event of master controller disabled. This is because if the array runs long enough, a schedule refresh would have been run which left the alternate master with the "charge" state. It is possible to manually disable/enable the alternate master controller after a scheduled refresh and effectively correct the battery state, but this needs to be done approximately every 28 days (to the nearest weekday specified in "bat.conf") as per the battery refresh schedule, and each time causing LUN and path failovers that would be visible to hosts. REACTIVE MEASURES: If the master controller was disabled, some batteries may have the state "failed" or "charge", DO NOT replace any batteries. The batteries were merely showing the wrong state and FRU ID info, and could be rectified by resetting the active controller. Since this problem only occurs after the previous master controller failed, you will need to fix that first. If the disabled controller was due to hardware failure, replace the controller. If it was a controller crash due to other firmware bug, then re-enable it: t3b:/:<1>enable u1 If downtime for the array is allowed, simply reset the entire array: t3b:/:<2>reset -y u1 would come up as the master and u2 as the alternate master and both controllers, due to the reset, would have re-read all FRU ID and have the correct information in memory. If downtime is not allowed, then reset just the controller: t3b:/:<2>disable u2 u1 would then takeover as master, since u1 was recently replace or rebooted, it has the correct FRU ID information. The telnet session will drop - re-login to the array, and re-enable u2: t3b:/:<1>enable u2 u2 should boot up and consequently re-read the FRU ID. Resolution On a 6120, update firmware to 3.2.3 or later On a T3+, update firmware to at least 3.2.3 to resolve CR6437076. Apply workaround detailed below for the other remaining problems. Previously Published As 102525 Internal Comments DO NOT attempt to use the ".bat -i u?b?" command to reset the battery dates. This has the undesirable effect of changing the actual warranty expiration (extended for 2 years from today) which the customer is not entitled to. Worse yet, this command may cause the battery to be marked "failed" due to "shelf life expired", since some battery FRU may have been manufactured more than 2 years ago. If battery states were "charge", and the failed master controller could not be re-enabled (as in a real hardware failure), AND, replacement is not immediately available, the controller still needs to be "reset" as soon as possible since cache mode may be set to "writethrough". Since there is no path/controller redundancy, downtime for the array is needed in order to reset the only remaining controller. Internal Contributor/submitter [email protected] Internal Eng Business Unit Group KE Authors Internal Eng Responsible Engineer [email protected] Internal Services Knowledge Engineer [email protected] Internal Kasp FAB Legacy ID 102525 Internal Sun Alert & FAB Admin Info Critical Category: Significant Change Date: 2006-07-28 Avoidance: Workaround Responsible Manager: null Original Admin Info: null Product_uuid 2a714b10-0a18-11d6-86e2-d56b387d4fbf|Sun StorageTek T3+ Array 2cd2e7d2-2980-11d7-9c3f-c506fe37b7ef|Sun StorageTek 6120 Array Attachments This solution has no attachment |
||||||||||||
|