Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1012043.1 : Processor may be Incorrectly Offlined When it Encounters a UCC + ME bit set in AFSR
PreviouslyPublishedAs 216501 Description UCC event with the Multiple Error(ME) bit set in the Asynchronous Fault Status Register (AFSR) A single UCC event with the Multiple Error(ME) bit set in the Asynchronous Fault Status Register(AFSR), causes the reporting processor to be offlined. However, if this is a single event, it can be treated as a single bit flip on one SRAM. No hardware replacement is recommended. (ref BugID# 4875077). Susceptible versions of Solaris[TM] Operating System: 8 & 9. BugID# 4875077 has been fixed in the following Solaris OS patch releases, and all later revisions: Solaris[TM] 8 OS PatchID# <SUNPATCH: 108528-29> + <SUNPATCH: 117000-01>. Steps to Follow Resolution From the example logs below:
Per BugID# 4740769 entitled "US-III cpus should be offlined after multiple correctable E$ ECC events", for UCC+ME bits set in AFSR :
However, due to the nature of the trap handling involved, a single UCC event can be detected in the Solaris[TM] OS as a UCC+ME - BugID# 4875077 describes this behavior. What we see in the logs, is a bug and should be treated as a case of a single bit flip on the SRAM chip. No hardware replacement is recommended. Sep 21 04:26:18 loneqresdbp1 SUNW,UltraSPARC-III+: [ID 322949 kern.info] NOTICE: [AFT0] First Error UCC Event detected by CPU10 in User mode at TL=0, errID 0x00002ca7.3a89e5f0 ^^^^^^^^^^^^^^^^^^^^^^^^^^ Sep 21 04:26:18 mydomain AFSR 0x00000400.000001ca AFAR 0x00000000.f1141a10 Sep 21 04:26:18 mydomain Fault_PC 0x141a08 Esynd 0x01ca /N0/SB2/P2/E1 J6300 Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 455778 kern.info] [AFT0] errID 0x00002ca7.3a89e5f0 Data Bit 76 was in error and corrected Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 408847 kern.info] [AFT2] errID 0x00002ca7.3a89e5f0 PA=0x00000000.f1141a00 Sep 21 04:26:18 mydomain E$tag 0x00000003.c4001249 E$state_0 Shared Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x10bfffd5.030063b7 0x9de3bfa0.80a62000 ECC 0x04b Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x02800034.01000000 0xd0062024.80a22000 ECC 0x172 Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x02800032.030063b7 0xd0062014.80a22000 ECC 0x0f5 Sep 21 04:26:18 loneqresdbp1 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x22800004.90102000 0xd0062014.90222070 ECC 0x183 Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 173316 kern.info] NOTICE: [AFT0] UCC Event detected by CPU10 in User mode at TL=0, errID x00002ca7.3a89e5f0 ^^^^^^^^^^^^^^^^^^^^^^^^ Sep 21 04:26:30 mydomain AFSR 0x00200400.000001ca AFAR 0x00000000.f1141a10 Sep 21 04:26:30 mydomain Fault_PC 0x141a08 Esynd 0x01ca /N0/SB2/P2/E1 J6300 Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 455778 kern.info] [AFT0] errID 0x00002ca7.3a89e5f0 Data Bit 76 was in error and corrected Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 408847 kern.info] [AFT2] errID 0x00002ca7.3a89e5f0 PA=0x00000000.f1141a00 Sep 21 04:26:30 mydomain E$tag 0x00000003.c4001249 E$state_0 Shared Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x10bfffd5.030063b7 0x9de3bfa0.80a62000 ECC 0x04b Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x02800034.01000000 0xd0062024.80a22000 ECC 0x172 Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x02800032.030063b7 0xd0062014.80a22000 ECC 0x0f5 Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x22800004.90102000 0xd0062014.90222070 ECC 0x183 Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available Sep 21 04:26:31 mydomain SUNW,UltraSPARC-III+: [ID 489146 kern.notice] NOTICE: [AFT1] CPU10 offlined due to UCC Event with ME set Blurb on ErrID : errID is the value of %stick register of CPU, which gets read by the high resolution timer function called hirestime in Solaris, so events are coincident with time. It attempts to uniquely identify errors by attaching them to this high resolution timer value. It should be noted, however, that due to the nature of traps, and the order in which they are handled and reported, errID is not always useful to chronologically order errors. Product Netra 1280 Server Netra 1290 Server Sun Fire V1280 Server Sun Fire 6800 Server Sun Fire 4810 Server Sun Fire 4800 Server Sun Fire 3800 Server Sun Fire 15K Server Sun Fire 12K Server Sun Fire V480 Server Sun Fire V440 Server Sun Fire V880 Server Netra 440 Server Sun Fire E4900 Server Sun Fire E6900 Server Sun Fire E2900 Server Internal Comments BugID #'s:
Note that bug 4875077 details a potential workaround - disable offlining of BugID# 4875077 has been fixed as of the following Solaris patch releases and all later revisions:
Solaris[TM] 8 PatchID# 108528-29 () + 117000-01 () . 117000-01 was released on Mar. 26 2004.
UCC, ME, ECC, UCC+ME Previously Published As 72159 Change History Date: 2009-12-02 User Name: Josh Freeman Action: Refreshed Comment: Format changes to the document and that is it. ESG Content Team update... Date: 2006-01-17 User Name: 18392 Action: Update Canceled Comment: *** Restored Published Content *** SSH Audit Attachments This solution has no attachment |
||||||||||||
|