Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1007736.1 : Diagnosing multiple DIMM CE errors occurring on multiple DIMMs
PreviouslyPublishedAs 210714 Symptoms Diagnosing multiple DIMM CE errors occurring on multiple DIMMs Multiple DIMM CE errors which occur on multiple DIMMs may not indicate a problem with a DIMM. These errors can happen on a variety of systems or domains. They are categorized by a single common Data Bit in error on multiple DIMMs. These errors are not one time CE error events. Output from cediag may show something similar to the following: cediag: #### CE Summary since last detected reboot ########################### cediag: #### last detected reboot at Mar 8 19:35:21 ######################### cediag: findings: 7 DIMM(s) having CEs with Esynd of 0x01b1 found cediag: advice:HIGH: possible datapath fault - refer to Sun Support [A]s [S]oon [A]s [P]ossible cediag: findings: 0 UE(s) found - there is no rule#3 match cediag: findings: 0 DIMMs with a failure pattern matching rule#4 cediag: findings: 0 DIMMs with a failure pattern matching rule#5 Note - For information on cediag including, downloading, licensing and usage instructions please see, Looking at the CE DIMM messages shows: Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 212399 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050a1f.449595f8 Mar 25 05:27:58 host1 AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1f50eb60 Mar 25 05:27:58 host1 Fault_PC 0x100460a8 Esynd 0x01b1 SB0/P1/B1/D0 J14301 Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 566620 kern.info] [AFT0] errID 0x00050a1f.449595f8 Corrected Memory Error on SB0/P1/B1/D0 J14301 is Intermittent Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 701539 kern.info] [AFT0] errID 0x00050a1f.449595f8 Data Bit 70 was in error and corrected Mar 25 05:28:04 host1 SUNW,UltraSPARC-III+: [ID 106369 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050a20.a9f4263a Mar 25 05:28:04 host1 AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.f82e8930 Mar 25 05:28:04 host1 Fault_PC <unknown> Esynd 0x01b1 SB0/P0/B1/D0 J13301 Mar 25 05:28:04 host1 SUNW,UltraSPARC-III+: [ID 375245 kern.info] [AFT0] errID 0x00050a20.a9f4263a Corrected Memory Error on SB0/P0/B1/D0 J13301 is Intermittent Mar 25 05:28:04 host1 SUNW,UltraSPARC-III+: [ID 434215 kern.info] [AFT0] errID 0x00050a20.a9f4263a Data Bit 70 was in error and corrected Mar 25 05:50:37 host1 SUNW,UltraSPARC-III+: [ID 888311 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b5b.bdee793d Mar 25 05:50:37 host1 AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1f50ea90 Mar 25 05:50:37 host1 Fault_PC 0x1014c858 Esynd 0x01b1 SB0/P2/B0/D0 J15300 Mar 25 05:50:37 host1 SUNW,UltraSPARC-III+: [ID 742879 kern.info] [AFT0] errID 0x00050b5b.bdee793d Corrected Memory Error on SB0/P2/B0/D0 J15300 is Intermittent Mar 25 05:50:37 host1 SUNW,UltraSPARC-III+: [ID 925230 kern.info] [AFT0] errID 0x00050b5b.bdee793d Data Bit 70 was in error and corrected Mar 25 05:50:43 host1 SUNW,UltraSPARC-III+: [ID 262634 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b5d.2390d851 Mar 25 05:50:43 host1 AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.fa8b7ef0 Mar 25 05:50:43 host1 Fault_PC <unknown> Esynd 0x01b1 SB0/P3/B0/D0 J16300 Mar 25 05:50:43 host1 SUNW,UltraSPARC-III+: [ID 689839 kern.info] [AFT0] errID 0x00050b5d.2390d851 Corrected Memory Error on SB0/P3/B0/D0 J16300 is Intermittent Mar 25 05:50:43 host1 SUNW,UltraSPARC-III+: [ID 885340 kern.info] [AFT0] errID 0x00050b5d.2390d851 Data Bit 70 was in error and corrected Mar 25 05:50:49 host1 SUNW,UltraSPARC-III+: [ID 725878 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b5e.89319d5e Mar 25 05:50:49 host1 AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1f50ea90 Mar 25 05:50:49 host1 Fault_PC <unknown> Esynd 0x01b1 SB0/P2/B0/D0 J15300 Mar 25 05:50:49 host1 SUNW,UltraSPARC-III+: [ID 664224 kern.info] [AFT0] errID 0x00050b5e.89319d5e Corrected Memory Error on SB0/P2/B0/D0 J15300 is Intermittent Mar 25 05:50:49 host1 SUNW,UltraSPARC-III+: [ID 406614 kern.info] [AFT0] errID 0x00050b5e.89319d5e Data Bit 70 was in error and corrected Mar 25 05:51:52 host1 SUNW,UltraSPARC-III+: [ID 349173 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b6d.4f1ba200 Mar 25 05:51:52 host1 AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1fd52110 Mar 25 05:51:52 host1 Fault_PC <unknown> Esynd 0x01b1 SB0/P0/B1/D0 J13301 Mar 25 05:51:52 host1 SUNW,UltraSPARC-III+: [ID 993022 kern.info] [AFT0] errID 0x00050b6d.4f1ba200 Corrected Memory Error on SB0/P0/B1/D0 J13301 is Intermittent Mar 25 05:51:52 host1 SUNW,UltraSPARC-III+: [ID 897387 kern.info] [AFT0] errID 0x00050b6d.4f1ba200 Data Bit 70 was in error and corrected These message categorize the type of DIMM errors for which this Symptom Resolution is written. Namely a single bit in error over several DIMMs. In this example, the Data Bit in error is always Data Bit 70. These errors may be called bad reader or bad writer errors. Resolution The resolution is specific for multiple DIMMs reporting the same bit in error and for one CPU correcting the error. For these errors the solution comes from the first line of the message: Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 212399 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050a1f.449595f8 The CPU detecting the error, CPU162, in this case is the same for all of the messages. This indicates that the CPU is either reading or writing the data incorrectly. Either the CPU or the SB the CPU is on needs to be replaced. Product Sun Fire E6900 Server Sun Fire 6800 Server Sun Enterprise 10000 Server Sun Fire E25K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire 12K Server Internal Comments Solutions for Diagnosing multiple DIMM CE errors See also: <Document: 1010642.1> Diagnosis of bad writers and datapath faults from Solaris messages <Document: 1005028.1> Sun Fire [TM] 12K/15K/E20K/E25K: Distinguishing a CPU Which is a BAD Writer From One Which is a BAD Reader
CE, memory, ECC, bad reader, bad writer, 2K, 15K, E20K, E25K Previously Published As 80950 Change History Date: 2010-04-27 User Name: Cootware Action: Added link to cediag doc. Verified valid information. Comment: Document should be archived at Starfire EOSL Attachments This solution has no attachment |
||||||||||||
|