Asset ID: |
1-72-1490988.1 |
Update Date: | 2012-09-27 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
1490988.1
:
T5220/System Reboot; Unrecoverable Hardware Error
Related Items |
- Sun SPARC Enterprise T5220 Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T5xx0
|
In this Document
Created from <SR 3-6130030521>
Applies to:
Sun SPARC Enterprise T5220 Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Symptoms
Customer reporting node reboot (T5220)
Cause
System rebooted due to an unrecoverable hardware panic
panic[cpu0]/thread=2a100047ca0: Unrecoverable hardware error
000002a1000475f0 unix:process_nonresumable_error+234 (2a1000477e0, 0, 1, 40, 0, 0)
%l0-3: 0000000000000040 0000000003000000 0000000000000001 0000000000000000
%l4-7: 000000000180c5e0 0000000100000000 00000000ffffffff 0000000001828400
000002a100047730 unix:ktl0+64 (3003f4cab50, 510, a2, 3003f4cab50, 300185b33b8, 0)
%l0-3: 000000000180c000 0000000000000000 0000004400001604 0000000001027c94
%l4-7: 000000002d2a9b2a 000000002d2e030a 0000000000000001 000002a1000477e0
000002a100047880 genunix:callout_execute+13c (3000167a000, 8000000000000000, 9d568b7, bffffffe11060a23, 9d568b7, 3000167b5f0)
%l0-3: 00000300185b33b8 0000000000000000 000003000167a038 0000001f49542718
%l4-7: 0000000000000000 000003000167b038 0000000009d568b7 00000000000000b7
000002a100047930 unix:softint+108 (0, 3000167a000, 182bf80, 18589f0, 0, 10ce3d0)
%l0-3: 0000000000000001 00000600608445a0 0000000000000000 0000000000000003
%l4-7: 000003000356c040 0000000000000000 00000000018f15d0 0000000000000001
000002a1000479e0 unix:softlevel1+4 (0, 0, 180c000, 2a100047d78, 1, 10131f4)
%l0-3: 0000000000000000 0000000000010000 0000000000000000 0000000000000000
%l4-7: 0000000000000000 0000000000000001 0000000000000000 000000000180c000
syncing file systems... 88 38 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 done (not all i/o completed)
System also showing memory fault:
System Indicator Status:
--------------------------------------------------------------------------------
/SYS/LOCATE /SYS/SERVICE /SYS/ACT
OFF ON ON
caazims1cscf111> showfaults -v
Last POST Run: Tue Dec 6 00:57:16 2011
Post Status: Passed all devices
ID Time FRU Class Fault
1 Aug 22 18:25:41 /SYS/MB/CMP0/BR0/CH1/D0 Host detected fault MSGID: SUN4V-8000-DX UUID: 5cf193bb-dd2f-ee43-fcb2-df60457ae4bf
caazims1cscf111>
caazims1cscf111> logout
/SYS/MB/CMP0/BR0/CH1/D0 Hynix Semicond HYMP151F72CP4D3-Y5 501-7954 126810C7 0x64 (MAINTENANCE REQUIRED, SUSPECT, DE
##### fma/fmadm-faulty.out #####
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 22 18:35:33 5cf193bb-dd2f-ee43-fcb2-df60457ae4bf SUN4V-8000-DX Major
Fault class : fault.memory.dimm 95%
Affects : mem:///unum=MB/CMP0/BR0:CH1/D0/J1201
degraded but still in service
FRU : hc://:product-id=SUNW,Netra-T5220:chassis-id=1113FM902C:server-id=caazims1cscf111:serial=00AD011053126810C7:part=371-2145-02 Rev 50//motherboard=0/chip=0/branch=0/dram-channel=1/dimm=0 95%
Description : The number of errors associated with this memory module has
exceeded acceptable levels. Refer to
http://sun.com/msg/SUN4V-8000-DX for more information.
Solution
HH0: Diagnosing "unrecoverable hardware error" panics |
Closed |
1277973.1
1285535.1 |
6919905 |
HW PANIC: Unrecoverable hardware error panics can be the result of any one of a number of underlying faults, including bus, CPU and memory issue.
NOTE: The unrecoverable hardware errors can't be prevented, however the fix is to make them diagnosable. CR 6919905 is about the visibility of the panic trigger rather than the panic itself. The fix for CR 6919905 is in FW 7.2.9.b (HV 1.7.8), released June 1,2010.
SW 7.3.0.c addresses additional Unrecoverable Hardware Error panics and other issues. |
Jan-2010 |
--------> Confirmation from the kernel engineer who analyzed the core file:
less fmdump-eV.out | grep unum | sort | uniq -c
84 unum = MB/CMP0/BR0: CH1/D0/J1201
485 unum = MB/CMP0/BR0: CH1/D1/J1301
24 unum = MB/CMP0/BR3: CH1/D0/J2601
1405 unum = MB/CMP0/BR3: CH1/D1/J2701
> As per the document, 1353227.1:
"Make sure these actually occurred around the time of the panic to ensure they
correlate to the outage.
This would be enough to suspect faulty physical memory, replace all suspect modules
and monitor for any repeat events
Replace DIMM (MB/CMP0/BR0:CH1/D0/J1201) and upgrade to latest available system firmware
Attachments
This solution has no attachment