Asset ID: |
1-77-1489176.1 |
Update Date: | 2012-09-19 |
Keywords: | |
Solution Type
Sun Alert Sure
Solution
1489176.1
:
Sun T4-x Servers May Experience Memory and Power Faults Which can be Prevented by Upgrading to System Firmware 8.2.1.b
Related Items |
- Netra SPARC T4-1 Server
- SPARC T4-2
- Sun Software - Generic
- SPARC T4-1
- Netra SPARC T4-2 Server
- SPARC T4-4
- Sun Hardware - Generic
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
- .Old GCS Categories>Sun Microsystems>Sun Alert>Release Phase>Resolved
|
In this Document
Applies to:
SPARC T4-2
SPARC T4-4
Netra SPARC T4-1 Server
Netra SPARC T4-2 Server
SPARC T4-1
SPARC
_________________________________
SUNBUG:7062523
Date of Resolved Release: 06-Sep-2012
_________________________________
Description
Sun SPARC T4-1, T4-2, T4-4 and Netra T4-1 and Netra T4-2 servers without system firmware 8.2.1.b may experience memory and power faults or prompt unnecessary hardware replacement, which can be prevented by upgrading to System Firmware 8.2.1.b.
Note: There are a number of CRs associated with this issue - please see "Symptoms" for complete details.
Occurrence
These issues can occur on the following platforms:
SPARC Platform
- Sun SPARC T4-1, T4-2, T4-4 and Netra T4-1 & Netra T4-2 servers without system firmware 8.2.1.b
Notes:
1. Memory fault issues may appear on T4-1, T4-2, T4-4 and Netra T4-1, Netra T4-2 servers. Power fault issues may appear on T4-2 and T4-4 servers.
2. No other systems are affected by this issue.
3. This issue does not exist for the x86 platform.
To determine the firmware version on one of these systems, use one of the following methods:
A) Log into the Service Processor and run:
-> show /HOST sysfw_version
/HOST
Properties:
sysfw_version = Sun System Firmware 8.2.0.f 2012/07/09 22:11
B) From Solaris:
# prtdiag -v | grep Firmware
Sun System Firmware 8.2.0.a 2012/05/11 07:34
Symptoms
Symptoms for these issues will vary depending on the Bug/CR and system affected, as in the following examples:
A. Memory faults
FMA fault.component.disabled messages with DIMM(s) or MCU disabled with MB FRU faulted. Failure signature(s) seen in hostconsole log are cited below against each CR.
CR 7062523:
0:0:0>Setup POST Mailbox ....Done
0:0:0>Decode of Disrupting Error Status Reg (DESR HW Corrected) bits 00000000.00040000
0:0:0>Decode of NCU Error Status Reg bits 00000000.10000000
0:0:0> 1 NESR_MCU0SRE: MCU0 issued a Software Recoverable Error Request
0:0:0>Decode of Mem Error Status Reg Branch 0 bits 02040000.00000000
0:0:0> 1 VEU 57 R/W1C Set to 1 on an UE, if VEF = 0 and no fatal error is detected in same cycle.
0:0:0> 1 DAU 50 R/W1C Set to 1 if the error was a DRAM access UE.
0:0:0> DRAM Error Address Reg for Branch 0 = 00000000.11581100
0:0:0> Physical Address is 00000000.00410000
CR 7177943:
2012-06-02 06:29:11.277 1:0:0>ERROR: TEST = Map to VA-ALL TSB
2012-06-02 06:29:11.389 1:0:0>H/W under test = /SYS/PM0/CMP1/BOB1/CH1/D1 (J7101)
2012-06-02 06:29:11.536 1:0:0>Repair Instructions: Replace items in order listed by 'H/W under test' above.
2012-06-02 06:29:11.725 1:0:0>MSG = END_ERROR
CR 7185320:
[CPU 1:0:0] ERROR: MCU0.BoB1.Ch1.D0: Failed to set clock delay
[CPU 1:0:0] ERROR: set_clk_delay failed for MCU0, BoB1, Ch1, DIMM0
[CPU 1:0:0] ERROR: command_clk_training failed for MCU0
[CPU 1:0:0] ERROR: Calibrate DRAM interface failed for MCU0
[CPU 1:0:0] ERROR: MCU0: DRAM init failed
[CPU 1:0:0] ERROR: /SYS/PM0/CMP1/BOB1/CH1/D0 failed to initialize
CR 7177528:
[CPU 1:0:0] ERROR: Lane failures during DQS cleanup for MCU0
[CPU 1:0:0] ERROR: train_ddr_channels failed for MCU0
[CPU 1:0:0] ERROR: Calibrate DRAM interface failed for MCU0
[CPU 1:0:0] ERROR: MCU0: DRAM init failed
CR 7177481:
2012-06-13 12:35:15.074 0:0:0>ERROR: TEST = Test Mailbox region
2012-06-13 12:35:15.260 0:0:0>H/W under test = /SYS/PM0/CMP0/BOB3/CH1/D0 (J4301)
2012-06-13 12:35:15.496 0:0:0>Repair Instructions: Replace items in order listed by 'H/W under test' above.
2012-06-13 12:35:15.803 0:0:0>MSG = CE in critical POST code space.
2012-06-13 12:35:16.002 0:0:0>END_ERROR
For any memory faults seen on systems with System FW 8.2.1b or later, normal troubleshooting procedures should be followed.
B. Power faults
Power faults triggered by some Emerson A239 power supplies on T4-4 and T4-2 platforms may provide incorrect data on the I2C bus. This may lead to false fault indications for other components on that I2C bus segment. For example on T4-4, RIO/TGB are on the same I2C bus segment as the PSU. The updated firmware filters out the incorrect data.
FMA 'fault.chassis.voltage.fail', 'fault.chassis.power.fail', and 'fault.chassis.env.power.loss' messages with Power Supply Unit (PSU) MB (T4-2) or PM (T4-4), faulted most commonly though other hardware components, could also be faulted. Failure signature seen in hostconsole log is cited below.
CR 7180196:
Sensor | minor: Voltage : /SYS/RIO/VDD_+1V8 : Lower Non-critical going high : reading 1.82 >= threshold 1.71 Volts
For power faults seen on T4-2 and T4-4 systems with Emerson PSUs, the upgrade to System FW 8.2.1b should be tried first. If power faults are seen on systems with system firwmare 8.2.1b or later, then normal troubleshooting procedures should be followed.
Workaround
There are no workarounds for these issues.
These issues are addressed in the following releases:
SPARC Platform
System Firmware 8.2.1.b or later, as delivered in the following patches:
- SPARC T4-1 Server with patch 148822-03 or later
- SPARC T4-2 Server with patch 148823-03 or later
- SPARC T4-4 Server with patch 148824-03 or later
- Netra T4-1 Server with patch 148826-03 or later
- Netra T4-2 Server with patch 148827-02 or later
Patches
<SUNPATCH:148822-03>
<SUNPATCH:148823-03>
<SUNPATCH:148824-03>
<SUNPATCH:148826-03>
<SUNPATCH:148827-02>
History
06-Sep-2012: Document released, issue Resolved
19-Sep-2012: Internal Maintenance update; no change in content
There are a couple of additional error messages induced by Emerson A239 PSU that are not currently addressed by FW but will be resolved in future FW version(s). These error messages are seen in SC logs and no FMA fault is triggered so no customer or service action should be initiated:
1) Chassis | major: Hot removal of /SYS/SASBP/HDD#
and
2) Chassis Log critical (##) /SYS/PS#/SEEPROM.FRU_PROM (#x##) Read Data Compare FAILED
Where # indicate numerical integer non-negative values
These are a result of minor corruption of the i2c bus by the Emerson A239 power supply and do not indicate a real system issue, hence should be ignored with no action taken.
---------------------------------
Note also that Emerson PSUs have also been reported as 'Astec', as in the following example:
fru_description = A239C_Power_Supply
fru_manufacturer = 10465 ASTEC INTERNATIONAL LTD SHEN ZHEN CITY CN
fru_version = 02
fru_part_number = 300-23xx
---------------------------------
Please see the other Bugs associated with this issue:
7177943 - T4-4 POST errors showing ERROR: TEST = Map to VA-ALL TSB
7185320 - Set Clock Delay failures need to be properly logged
7180196 - Power Supply issues (T4-2 & T4-4 platforms)
7177528 - T4-4 POST DDR training issues
7177481 - POST memory test error seen: CE in critical POST code space
7062523 - POST memory test error seen
Questions regarding this document should be addressed to
[email protected] and copy the
responsible engineer listed below.
Internal Contributor/Submitter: [email protected]
Internal Eng Responsible Engineer: [email protected]
Internal Services Knowledge Engineer: [email protected]
Internal Eng Business Unit Group: Systems
Internal Escalation ID:
Internal Resolution Patches: 148822-03, 148823-03, 148824-03, 148826-03, 148827-02
References
SUNUBUG:7062523
Attachments
This solution has no attachment