Ecache (E$) events and what to do about them

Asset ID:	1-75-1009200.1
Update Date:	2012-07-16
Keywords:

Solution Type Troubleshooting Sure

Solution 1009200.1 : Ecache (E$) events and what to do about them

Applies to:

Sun Enterprise 6000 Server - Version Not Applicable and later
Sun Enterprise 6500 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Enterprise 5500 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Enterprise 3000 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Enterprise 3500 Server - Version Not Applicable to Not Applicable [Release N/A]
All Platforms

Purpose

The document provides insight into how to identify an ecache (or e$) event on a customer's system. It also provides details on Oracle's Best Practice for ecache events and when to replace the CPU or not.

Troubleshooting Steps

Background:

An ecache event (pronounced e-cash) is a hardware event that can occur on any UltraSPARC based system. Such an event occurs when a bit in a cpu's cache memory is mistakenly modified. An UltraSPARC I, II or IIi system will usually panic or reboot when an ecache event occurs. Later cpus, which currently include the UltraSPARC III and IV families, have error correction features which usually result in correction of the errors without any impact on system operation.

An ecache event results in an AFT, or asynchronous fault trap. In a system's messages log, an ecache event is a WARNING which includes the text stating EDP event, WP event, or CP event. It is followed by a score, typically (Score 95) or (Score 05). Often there will be multiple messages, and in that case, the one with (Score 95) identifies the cpu which caused the event.

Here is an example of what you might see:

Jul 10 01:43:34 tronsd81 unix: WARNING: [AFT1] WP event on CPU11, errID 0x00092f64.77726c8d Jul 10 01:43:34 tronsd81 unix: AFSR 0x00000000.00800100 AFAR 0x00000179.fe75f940 Jul 10 01:43:34 tronsd81 unix: AFSR.PSYND 0x0100(Score 95) AFSR.ETS 0x00 Fault_PC 0x100171b0 Jul 10 01:43:34 tronsd81 unix: UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Jul 10 01:43:58 tronsd81 unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x00092f6a.34dfd7e1 Jul 10 01:43:58 tronsd81 unix: AFSR 0x00000000.80200000 AFAR 0x00000002.95b74000 Jul 10 01:43:58 tronsd81 unix: AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10021058 Jul 10 01:43:58 tronsd81 unix: UDBH 0x0203 UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00 Jul 10 01:43:58 tronsd81 unix: UDBH Syndrome 0x3 Memory Module Board 4 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801

In this example above, it is a WP event and CPU11 is the cpu it occurred on with a Score 95.

Oracle's Best Practice for ecache events:

The first thing to do when an ecache event is determined is to identify if the event is the first ecache event that has taken place on the same CPU or not.

Oracle's Best Practice for ecache events is to replace the CPU if this is the second ecache parity error event to take place on the same CPU in the last 6 months.

A single event is considered to be transient in nature and the customer should be instructed to record this fault and monitor for any repeat event on the same CPU in the next 6 month time period.

There is ONE exclusions to the Best Practice rule:

If the CPU module in question is a Mirrored SRAM (or Sombra module) it should be replaced the first time it encounters an ecache event.

If a customer states that they ignore the Best Practice recommendations, do not argue with the customer. It is suggested that you collaborate with the next level of technical support and have a senior engineer discuss Best Practices with the customer.

In cases where it is unknown whether the CPU has had a previous ecache error, where case/service request data is not known, and the customer is unsure of the status, the recommendation is to consider the fault to be transient and consider this event a first event.

Additional Resources:

Also, FIN I0616-1 provides excellent details on the various types of ecache events that can be encountered.
Still have questions?

Collaborate with the appropriate Technical Support Team, create an escalation, or log into their IM chat room (GL-ESG or GL-VSP) and ask for assistance.

For those who are curious as to the relevance of the 'Score' that is emitted with the message, it's a relative likelihood or indication that this P_SYND indicates that the reporting module is bad. We call it a "score", though, not a relative likelihood.

Note: The score is a heuristic, not a rule, so it's not always 100% correct. It's based on the patterns of data observed by the time the CPU gets it and compares it with the other error correction data is has at hand. Though quite a number of different scores are listed, in practice, we usually only see Score 95 (Very good chance of being 'this' module) and Score 05 (No real idea whose fault it was).

ecache, score, 95, AFT, panic, reboot, EDP, e$, bestpractices.central, best practice
Previously Published As
79609

Attachments

This solution has no attachment