Date of Resolved Release: 19-Jun-2012
_________________________________
Description
After upgrading SPARC T4-4 systems from firmware 8.1.4 (and all sub-releases) to firmware 8.1.5 or later, there are reports of unrecoverable DIMM errors as shown by fault management reports error code 'SUN4V-8000-E2'. These are not real faults and the DIMMs will function normally.
Note: Once the faults have been cleared using the procedure outlined in the "Workaround" section below, this issue will not recur.
Occurrence
This issue can occur on the following platform:
- SPARC T4-4 Systems with firmware 8.1.5 (or later) without patch 147790-01
Notes:
1. No other systems are affected by this issue.
2. This issue will only appear with an upgrade of system firmware to 8.1.5. To check the revision of system firmware, use the following ILOM command:
-> show /HOST sysfw_version
There will be a line similar to the following:
sysfw_version = Sun System Firmware 8.1.4 2012/04/10 18:52
If the string has 8.1.4 or earlier and you are updating to 8.1.5 or later, the system is at risk for this issue.
Failure occurs after booting Solaris, with most failures showing up in less than 1 hour. If the console log or the ILOM log shows a line similar to the following, the fault has occurred:
2012-05-24/09:11:32 3bd80435-6180-47a9-869f-8812c936c0f3 SUN4V-8000-E2 Critical
fault.memory.bank /SYS/PM0/CMP0/BOB0/CH1/D0 (Part Number: 07014672)
The memory bank will always be as defined above and a similar record showing '/SYS/PM0/CMP0/BOB1/CH1/D0' will be displayed. In addition, the error code will always be SUN4V-8000-E2. There have been no other DIMMs seen to fail. As an alternative, check the FMA fault log. In Solaris (logged in as 'root') the following command will show the fault log:
/usr/sbin/fmadm faulty -a
A record similar to the above will be reported as well.
Symptoms
After the system firmware upgrade and booting to Solaris, the following will be displayed in either the console log or the FMA fault log:
2012-05-24/09:11:32 3bd80435-6180-47a9-869f-8812c936c0f3 SUN4V-8000-E2 Critical
fault.memory.bank /SYS/PM0/CMP0/BOB0/CH1/D0 (Part Number: 07014672)
Workaround
This issue is addressed on the following platform:
- SPARC T4-4 Systems with firmware 8.1.5 (or later) with patch 147790-01 or later
Note: The signature of this problem is consistent with Solaris CR 6983432. If running Solaris 10 update 10 or earlier, you should verify that patch 147790-01 has been installed. Do this by running the following command:
$ showrev -p |grep 147790
If this does not return an entry for 147790-01, then this patch should be installed before upgrading firmware.
If after installing the above patch the error message is still present, then clear the fault on both ILOM and the host and flush the state of the FMA. This can be accomplished by doing the following:
1. On the service processor ILOM, first clear the fault log of the above records:
-> start /SP/faultmgmt/shell
faultmgmtsp> fmadm faulty
Find the records associated with the DIMM faults:
faultmgmtsp> fmadm repair <<UUID>>
Where <<UUID>> is the event ID of the DIMM failure. It should only be necessary to clear one of the faults. Rerun the 'fmadm faulty' (fmadm(1M)) command to verify the records have been cleared.
2. Next (on Solaris) clear the fault and remove the record of the fault having occurred:
# /usr/bin/fmadm faulty -a
# /usr/sbin/fmadm acquit <<UUID>>
# /usr/sbin/fmadm flush <<DIMM>>
where the <<UUID>> is replaced by the event ID displayed by the faulty. The <<DIMM>> is the memory bank reported as being faulty. This should always be '/SYS/PM0/CMP0/BOB0/CH1/D0' and '/SYS/PM0/CMP0/BOB1/CH1/D0'. It should be sufficient to only flush one record.
Note: This issue was traced to an error in earlier version of system firmware, which was not reported by the firmware. With revision 8.1.5 the issue was fixed, however, a residual error shows up on Solaris. By clearing the fault on both ILOM and the host and flushing the state of the FMA, the error no longer appears and the issue is resolved.
Patches
<SUNPATCH:147790-01>
History
19-Jun-2012: Date of Resolved Release
18-Jul-2012: Updated "Likelihood" and "Workaround" sections for clarification
The fault has always been against the same 2 memory banks /SYS/PM0/CMP0/BOB0/CH1/D0
and /SYS/PM0/CMP0/BOB1/CH1/D0. The error is always the same: SUN4V-8000-E2.
This issue is most likely caused by 2 CRs. The first is 7027742. Without this fix,
VBSC would overwrite configuration and also modify status.
The second is 7129196. This CR changed the refresh rates for the DIMMs. So if there were
some marginal DIMMs in the system, this would help them function normally. The update to 8.1.5
removed the reason the failures are not being seen as well as correcting the failure. By clearing
the fault and flushing saved FMA information the problem is no longer present.
Questions regarding this document should be addressed to
[email protected] and copy
the Responsible Engineer listed below.
Internal Contributor/Submitter: David Arneson
Internal Eng Responsible Engineer: David Arneson
Oracle Knowledge Analyst: [email protected]
Internal Eng Business Unit Group: Systems Group - SYS
Internal Escalation ID:
3-5664972363, 3-5677851131, 3-5689938551,
3-5707491921, 3-5719825321, 3-5720477081,
3-5721550709, 3-5721114531, 3-5731813441,
3-5745764921, 3-5747640143, 3-5818226651
References
Attachments
This solution has no attachment