Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1490988.1
Update Date:2012-09-27
Keywords:

Solution Type  Problem Resolution Sure

Solution  1490988.1 :   T5220/System Reboot; Unrecoverable Hardware Error  


Related Items
  • Sun SPARC Enterprise T5220 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T5xx0
  •  




In this Document
Symptoms
Cause
Solution


Created from <SR 3-6130030521>

Applies to:

Sun SPARC Enterprise T5220 Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

Customer reporting node reboot (T5220)

Cause

System rebooted due to an unrecoverable hardware panic
 

panic[cpu0]/thread=2a100047ca0: Unrecoverable hardware error
000002a1000475f0 unix:process_nonresumable_error+234 (2a1000477e0, 0, 1, 40, 0, 0)
  %l0-3: 0000000000000040 0000000003000000 0000000000000001 0000000000000000
  %l4-7: 000000000180c5e0 0000000100000000 00000000ffffffff 0000000001828400
000002a100047730 unix:ktl0+64 (3003f4cab50, 510, a2, 3003f4cab50, 300185b33b8, 0)
  %l0-3: 000000000180c000 0000000000000000 0000004400001604 0000000001027c94
  %l4-7: 000000002d2a9b2a 000000002d2e030a 0000000000000001 000002a1000477e0
000002a100047880 genunix:callout_execute+13c (3000167a000, 8000000000000000, 9d568b7, bffffffe11060a23, 9d568b7, 3000167b5f0)
  %l0-3: 00000300185b33b8 0000000000000000 000003000167a038 0000001f49542718
  %l4-7: 0000000000000000 000003000167b038 0000000009d568b7 00000000000000b7
000002a100047930 unix:softint+108 (0, 3000167a000, 182bf80, 18589f0, 0, 10ce3d0)
  %l0-3: 0000000000000001 00000600608445a0 0000000000000000 0000000000000003
  %l4-7: 000003000356c040 0000000000000000 00000000018f15d0 0000000000000001
000002a1000479e0 unix:softlevel1+4 (0, 0, 180c000, 2a100047d78, 1, 10131f4)
  %l0-3: 0000000000000000 0000000000010000 0000000000000000 0000000000000000
  %l4-7: 0000000000000000 0000000000000001 0000000000000000 000000000180c000
syncing file systems... 88 38 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 done (not all i/o completed)

 

 

 System also showing memory fault:

 

System Indicator Status:
--------------------------------------------------------------------------------
/SYS/LOCATE                    /SYS/SERVICE                   /SYS/ACT
OFF                            ON                             ON

caazims1cscf111> showfaults -v
Last POST Run: Tue Dec  6 00:57:16 2011

Post Status: Passed all devices
 ID Time                           FRU               Class             Fault
  1 Aug 22 18:25:41                /SYS/MB/CMP0/BR0/CH1/D0                   Host detected fault MSGID: SUN4V-8000-DX  UUID: 5cf193bb-dd2f-ee43-fcb2-df60457ae4bf
caazims1cscf111>
caazims1cscf111> logout

/SYS/MB/CMP0/BR0/CH1/D0  Hynix Semicond  HYMP151F72CP4D3-Y5    501-7954  126810C7            0x64 (MAINTENANCE REQUIRED, SUSPECT, DE


##### fma/fmadm-faulty.out   #####
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 22 18:35:33 5cf193bb-dd2f-ee43-fcb2-df60457ae4bf  SUN4V-8000-DX  Major
Fault class : fault.memory.dimm 95%
Affects     : mem:///unum=MB/CMP0/BR0:CH1/D0/J1201
                 degraded but still in service
FRU         : hc://:product-id=SUNW,Netra-T5220:chassis-id=1113FM902C:server-id=caazims1cscf111:serial=00AD011053126810C7:part=371-2145-02 Rev 50//motherboard=0/chip=0/branch=0/dram-channel=1/dimm=0 95%
Description : The number of errors associated with this memory module has
             exceeded acceptable levels.  Refer to
             http://sun.com/msg/SUN4V-8000-DX for more information.

 

Solution

HH0: Diagnosing "unrecoverable hardware error" panics Closed 1277973.1
1285535.1
6919905 HW PANIC: Unrecoverable hardware error panics can be the result of any one of a number of underlying faults, including bus, CPU and memory issue. 

NOTE: The unrecoverable hardware errors can't be prevented, however the fix is to make them diagnosable. CR 6919905 is about the visibility of the panic trigger rather than the panic itself. The fix for CR 6919905 is in FW 7.2.9.b (HV 1.7.8), released June 1,2010.

SW 7.3.0.c addresses additional Unrecoverable Hardware Error panics and other issues.
Jan-2010

 

 

--------> Confirmation from the kernel engineer who analyzed the core file:

less fmdump-eV.out | grep unum | sort | uniq -c
 84            unum = MB/CMP0/BR0: CH1/D0/J1201
485            unum = MB/CMP0/BR0: CH1/D1/J1301
 24            unum = MB/CMP0/BR3: CH1/D0/J2601
1405            unum = MB/CMP0/BR3: CH1/D1/J2701


> As per the document, 1353227.1:



"Make sure these actually occurred around the time of the panic to ensure they
correlate to the outage.

This would be enough to suspect faulty physical memory, replace all suspect modules
and monitor for any repeat events

 

Replace DIMM (MB/CMP0/BR0:CH1/D0/J1201) and upgrade to latest available system firmware
 


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback