Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1003319.1 : Sun Fire[TM] 12K/15K/E20K/E25K: POST level increments with repeated error
PreviouslyPublishedAs 204606
Applies to:Sun Fire E20K ServerSun Fire 12K Server Sun Fire 15K Server Sun Fire E25K Server All Platforms GoalThe POST level on a Sun Fire 12K/15K/E20K/E25K increments if an error repeats within a given timeframe. This is a desired feature, in which dsmd increases the POST level. However, manual intervention is still required to blacklist the component. If the component is not blacklisted, it will continue to be included in the POST configuration. Eventually POST "may" deconfigure the component, but this DOES NOT blacklist the component. The user must place this in the blacklist file manually.Once the domain recovers and is booted, any subsequent error within a four hour period will be treated as a repeated error. After this 4 hour period the domain will be considered recovered and healthy. NOTE: The initial run of POST, after the domain is considered "healthy" again, depends on the operation which runs. If it is due to a reset, then the POST run will be a -Q (same as level 7) . If it is due to a operator-initiated action, such as setkeyswitch, then the level will be defined by the contents of the .postrc file used. The default level is 16 if not specified in the .postrc file
SolutionThe example shows excerpts from the POSTs and dstop(s) where a repeated error occurred within a given time period.redxl> ld dsmd.dstop.020929.1549.03 Created Sun Sep 29 15:49:04 2002 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=21248 On ssc name = starcat_sc0 Domain = 2=C = starcat_domC Platform = platform_1 Boards in dump: master SC CPs/CSBs[1:0]: 3 EXB[17:0]: 02010 Slot0[17:0]: 02010 Slot1[17:0]: 02010 -D option, -d "DSMD DomainStop Dump" 0 errors occurred while creating this dump. redxl> wfail SDI EX04/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX13/S0 Master_Stop_Status0[31:0] = 2004004F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. SDI EX13/S0 Dstop0[31:0] = 10019000 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M) EPLD SB13 Err1_Dom0: Mask= 00 Err= 80 1stErr= 80 Err1[7]: 1E+ Error reported by BBC1 BBC SB13/BB1 Device_Err_Stat[31:0] = 80008100 DevErr[ 8]: 1E Port 0 Safari device asserted error Proc SB13/P2 (13.0.2) EmuShad[0:78] = 0020 00000000 00000000 (Note rev order) EmuSh[ 9]: THUE: Etag ECC UE due to other access (P$, W$, wrback...). AFSR [63:0] = 03900000.00000000 AFAR [42:4] = 1A3.9E821CE_ AFSR2[63:0] = 01100000.00000000 AFAR2[42:4] = 1A3.9E821CE_ AFSR[52]: 1E PRIV: Priviledged code access error(s) occurred. AFSR[55]: TUE: Uncorrectable Ecache tag ECC error. AFSR[56]: 1E TSCE: SW_handled Correctable Ecache tag ECC error. AFSR[57]: THCE: Hardware corrected Ecache tag ECC error. FAIL Proc SB13/P2: Dstop detected by Proc SB13/P2. Primary service FRU is Slot SB13. DARB C0: enabled ports (expanders) [17:0]: 07E3F DARB C0: other darb req Dstop+Rstop for exps[17:0]: 02000 DARB C1: enabled ports (expanders) [17:0]: 07E3F DARB C1: other darb req Dstop+Rstop for exps[17:0]: 02000 This error occurred four times before POST deconfigured it. The level 64 post finally "deconfigured" (NOT Blacklisted) the suspect component. Please note, due to the error, a lesser post level could have caught this, or a greater post level may have been required to fail the component. Note also that CHS (starting from SMS 1.4.1) could have marked the cpu as faulted; this will avoid the failed asic to be tested in subsequent posts. Finally, below you can see the post level changing in the various post runs. NOTE: A "Short" post is the post execution of dumping ASIC state for capture into the rstop or dstop dump file. Successful dump captures have an exit code of 85. Unsuccessful exit codes are 86 or 87, depending on failure. Obviously, regardless of whether an rstop or dstop occurs, we will generate a short post log from the "capture of ASIC state". The only difference is that after a dstop, we will reboot the domain, in which case a full long post log should appear. Therefore, in this scenario you see eight post files, four of which are generated as a result of capturing the ASIC states. post020929.1549.04.log:# pid = 21248 level = 16 verbose_level = 20 --Short post post020929.1550.13.log:# pid = 21409 level = 16 verbose_level = 20 post020929.1550.13.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 16 --Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters. post020929.1558.16.log:# pid = 22516 level = 16 verbose_level = 20 --Short post post020929.1559.12.log:# pid = 22650 level = 16 verbose_level = 20 post020929.1559.12.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 16 --Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters. post020929.1607.21.log:# pid = 23770 level = 16 verbose_level = 20 --Short post post020929.1608.22.log:# pid = 23913 level = 32 verbose_level = 20 post020929.1608.22.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 32 --Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters. post020929.1620.10.log:# pid = 25531 level = 16 verbose_level = 20 --Short post post020929.1621.05.log:# pid = 25660 level = 64 verbose_level = 20 post020929.1621.05.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 64 --Configured in 333 with 7 procs, 28.000 GBytes, 6 IO adapters. Again, please note that the proc 418 was missing from showdevices and psrinfo, but was NOT blacklisted. Product Sun Fire 15K Server Sun Fire E25K Server Sun Fire E20K Server Sun Fire 12K Server Internal Section Keywords: POST, level, alt_level, 12K, 15K, 20K, 25K Previously Published As 48395 Attachments This solution has no attachment |
||||||||||||
|