Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1000096.1 : Running POST on one domain may Dstop all other running domains on Sun Fire 15K systems with SMS 1.1
PreviouslyPublishedAs 200128 Product Sun Fire 15K Server Running POST on one domain may Dstop all other running domains (see details below). Impact A software error on one domain (such as a heartbeat failure, panic timeout, or error-reset) can cause another domain to DStop on Sun Fire 15K systems running SMS 1.1. The manifestation of this issue may cause the POST running on one domain to Dstop all other running domains. While the occurrence is rare, the impact is platform wide. Depending upon domain configuration and applications, down time can be several hours. This problem is intermittent and may be related to a domain sync operation on the centerplane (reset of unused ports).
Running POST on one domain means that the power-on self tests are executed on any domain in the system. This is done to initially bring a domain online, a DR attach of a board (not currently supported), or a recovery action performed by the SMS software to get a domain back up and running after a reboot, panic, or Dstop. The AMX flow control error shown above is the key message. The system will recover automatically via ASR (automatic system recovery). After recording the Dstop information, SMS restarts the domain(s). Any SMS 1.1 installations without patch 112080-05 or later installed are susceptible to this problem. SMS 1.2 and higher are not affected by this issue. The true cause of the problem is the AMX ASIC which doesn't handle port resets correctly. The bug fix changes how POST performs the reset to ensure it's done safely. A Dstop, or Domain Stop, occurs when the hardware detects an unrecoverable error. The ASICs in the system cease processing transactions as quickly as possible to prevent further corruption of data and facilitate debugging. It also occurs during the centerplane reset of ports. The AMX has a problem with the reset of ports not done under domain sync. Changing the reset so that it is done under domain sync causes the problem to go away. Symptoms A message in the platform message log (/var/opt/SUNWSMS/adm/platform/messages) would report: Jan 17 20:25:55 2002 swmtft901 hwad[22514]: [1156 1693005732870614 ERR InterruptHandler.cc 2127] Domain Stop interrupt detected, domain XXX SMS then creates a Dstop dump file in /var/opt/SUNWSMS/adm/[XXX]/dump. The file name is dsmd.dstop.YYMMDD.hhmm.ss (for this example). If this dump file is opened with "redx" and the "wfail" command is issued, the output below is reported. For example: sc% redx -cl redx> dumpf load dsmd.dstop.020117.2025.55) redx> wfail ...ouptut below... The Dstop signature of this issue is as follows: SDI EX03/S0 Master_Stop_Status0[31:0] = 7004004F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. SDI EX03/S0 Dstop0[31:0] = 12018200 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[25]: D 1E AXQ requests all Dstop (M) Dstop0[28]: D Slot0 asserted Error, enabled to cause Dstop (M) AXQ EX03 ( 3) Error_Flag_02[31:0] = 04008400 Mask = 0000FFFF Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive simultaneously FAIL EXB EX3: Dstop/Rstop detected by AXQ. Primary service FRU is EXB EX3. SDI EX04/S0 Master_Stop_Status0[31:0] = 0004000F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. SDI EX04/S0 Dstop0[31:0] = 02018200 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[25]: D 1E AXQ requests all Dstop (M) AXQ EX04 ( 4) Error_Flag_03[31:0] = 30009000 Mask = 21005EFF Err3[28]: D 1E AMX data ECC uncorrectable error Err3[29]: R AMX data ECC correctable error FAIL EXB EX4: Dstop/Rstop detected by AXQ. Primary service FRU is EXB EX4. Resolution An Authorized Enterprise Services Field Representative may avoid the above mentioned issue by following the recommendations as shown below. For a permanent fix, install SMS 1.1 Patch 112080-05 (or later), or upgrade to SMS1.2. This patch is specifically for SMS 1.1, and is not tied to any one particular Solaris OS release. References: BugId: 4505473 - AMX data ECC uncorrectable error. PatchId: 112080-05 - SMS 1.1: Patch IBIST for pause wafer change. ESC: 534366 - All domain down due to DSTOP. SunAlert: 42881 Previously Published As 100299 Internal Eng Business Unit Group SSG ES (Enterprise Systems) Internal Kasp FAB Legacy ID 100299, I0788-1 (FIN) Internal SA-FAB Eng Submission Running POST on one domain may Dstop all other running domains on Sun Fire 15K systems with SMS 1.1 Attachments This solution has no attachment |
||||||||||||
|