Sun SPARC Enterprise[TM] Mx000 Servers: FMA Specified FRU Replacements

Asset ID:	1-71-1002526.1
Update Date:	2011-03-31
Keywords:

Solution Type Technical Instruction Sure

Solution 1002526.1 : Sun SPARC Enterprise[TM] Mx000 Servers: FMA Specified FRU Replacements

Applies to:

Sun SPARC Enterprise M5000 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun SPARC Enterprise M3000 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise M4000 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise M9000-32 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise M9000-64 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
All Platforms

Goal

Investigating a Sun SPARC[TM] Enterprise Mx000 FMA specified FRU indictment.

This document details how to initiate a Service Action Plan to investigate whether a hardware component should be replaced as implicated by the Predictive Self-Healing Diagnosis Engine (FMA DE) on a Sun SPARC Enterprise Mx000 system.

NOTE: The implicated hardware component(s) is referred as a Field Replaceable Unit (FRU) throughout this document.

Solution

This document makes a few assumptions:

An error event caused an automated recovery action to take place on an OPL system (panic/reboot/errors/etc).
The FMA DE determined that a FRU(s) is Faulty and may have automatically disabled or deconfigured the suspect FRU(s).
The FMA DE produced a FMA Event Code which when looked up in My Oracle Support recommends replacing a FRU and may refer to this document.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - M Series Servers

Every hardware component in a Sun SPARC Enterprise Mx000 (OPL) platform is a Field Replaceable Unit (FRU). This requires that a Oracle "badged" or certified partner engineer performs the physical replacement of the component. In order to begin the process of FRU replacement, there are specific steps that Oracle Support Services rely on the customer to perform as detailed below.

1. Collect the FMA Fault Message.

The output can be displayed using fmdump -m on the XSCF console. Example output is as follows:

     MSG-ID: SCF-8001-4X, TYPE: Fault, VER: 1, SEVERITY: Major
     EVENT-TIME: Tue Mar 20 21:23:54 UTC 2007
     PLATFORM: SPARC-Enterprise, CSN: 7860000772, HOSTNAME: genericff2
     SOURCE: sde, REV: 1.12
     EVENT-ID: d93a7654-f414-46aa-aaf5-21f70a5af931
     DESC: The number of uncorrectable and correctable errors on single DIMM exceeds an
     acceptable threshold. This fault is detected while running POST.
      Refer to http://www.sun.com/msg/SCF-8001-4X for more information.
     AUTO-RESPONSE: The memory associated with the memory bank containing the errors is deconfigured.
     IMPACT: POST is restarted after the memory associated with the memory bank has been deconfigured.
     REC-ACTION: Schedule a repair action to replace the affected Field Replaceable Unit (FRU),
     the identity of which can be determined using fmdump -v -u EVENT_ID.
     Please consult the detail section of the knowledge article for additional information.

2. Collect the "fmdump -v -u " output relating to the fault event.

Example (uses the same event as Step 1's example - note the event ID is in bold ):

   xscf> fmdump -v -u d93a7654-f414-46aa-aaf5-21f70a5af931
         TIME                 UUID                                 MSG-ID
         Mar 20 21:23:54.0192 d93a7654-f414-46aa-aaf5-21f70a5af931 SCF-8001-4X
           100%  fault.chassis.SPARC-Enterprise.memory.bank.err
                 Problem in: hc:///chassis=0/cmu=0/mem=0
                    Affects: hc:///chassis=0/cmu=0/mem=0
                        FRU: hc://:product-id=SPARC-Enterprise:chassis-id=7860000772:
                             server-id=san-ff2-21-0:serial=04126711:
                             part=72T128000HR3.7A:revision=252b/component=/MBU_B/MEMB#0/MEM#3A

3. Collect the fault information to prepare to log a Service Request:

The "FMA Fault Message" from step 1.
The "fmdump" output (from step 2).
Specify whether the first FRU in fmdump output has been recently serviced, replaced, or errored.
Mention this document was referenced in opening the request.
Specify your contact information so the Oracle Support Services engineer can contact you to schedule the service.

4. Contact Oracle Support Services or your local service representative and open a "Service Request".

5. Review FRU Replacement Methods information to prepare your configuration for the FRU replacement.

Components can be replaced using three different FRU Replacement Methods depending on which platform is involved, the specific FRU in question, and whether it is redundantly configured.
It is recommended to review <Document: 1003993.1> Sun SPARC Enterprise Mx000: Field Replaceable Unit (FRU) Replacement Methods to be aware of these replacement methods and prepare for the service.

6. A Oracle Support Services Engineer may need additional data to be collected. If so they will specify the data to collect.

Please assist in capturing requested data so Oracle can resolve your issue with as little delay as possible. The most likely data requested will be:

XSCF Snapshot data.
Domain Explorer

Reference <Document: 1008229.1> Running explorer on Sun SPARC Enterprise[TM] Mx000 (OPL) Servers if required for help with either Explorer or Snapshot data requests.

Internal Only - Oracle Support Services Steps

1. Verify the fault event requires FRU replacement.

Confirm the fault event message, fmdump output, and all data are from the same date and implicate the same FRU component.
Confirm whether the implicated FRU(s) resources have been disabled or deconfigured.
Use the Predictive Self-Healing ( PSH ) Knowledge Article to confirm the event requires FRU replacement.

2. Verify that "fmdump -v -u Event-ID" contains the list of FRU indictments for this fault event.

The list of FRUs is displayed in the order in which they are intended to be replaced (percentage of likelihood).
Replacements should proceed in the order of likeliest FRU to least likely.
Prepare to have the Primary FRU replaced (proceed to the next step).

3. Verify the FRU replacement method that can be used for the specific FRU requiring service and the configuration in question.

The customer may have specified a desired method to use, so verify if it the method desired is possible.
Reference: <Document 1003993.1> Sun SPARC Enterprise Mx000: Field Replaceable Unit (FRU) Replacement Methods.

4. Create the Service Action Plan and report the recommendations to the Customer/End User.

Use the Action Plan Creator Tool to create the Service Action Plan.
Provide the Customer/End User with the summary of the action plan, including any changes to the FRU which will be replaced and/or the method in which the replacement will proceed.

5. Dispatch the replacement to the appropriate field resources and choose the appropriate Canned Action Plan in ATR.

Reference the Service Manual for the Platform type and FRU in question if needed:
M3000 collection M4000 collection M5000 collection
M8000 collection M9000 collection

6. Contact the Customer/End User and confirm the fix.

Confirm that the FRU replacement resolved the issue and no errors have repeated for at least 24 hours.
Confirm all resources are re-enabled and configured into the domain(s) properly.

7. If the exact same fault event repeats, go back to step 2 and replace the next likeliest FRU listed in fmdump output.

If the same error persists and all FRUs in the list have been replaced or you are unsure of the next steps, collaborate with the next level of support for further investigation.
See Troubleshooting References for Collaboration below for details.

Troubleshooting References for Collaboration

FMA Event Messaging identified the primary FRU or FRUs involved in a particular error event. The FRUs are intended to be replaced in the order that they are listed in the event messaging (one at a time) because the Diagnosis Engine is designed to specify the FRUs in order of being most likely to have caused the error to least likely.

In the unlikely event that errors persist to the point where all the FRUs in the Event Message have all been replaced, the following troubleshooting references should be utilized to determine the next best course of action given the particular type of fault encountered. The troubleshooting references provide information on the faults which contain additional, sometimes quite unlikely suspects, which may be involved in an error event perhaps as a "pass thru" device or similar.

These documents should be used as a reference to help determine the next best course of action following replacement of all of the "named suspects" if needed - not to preclude replacement of the FMA identified suspects shown in the event messages. These references are grouped based on the component or asic in error, in others words, one for CPU events, one for IOC, and so on.

Collaborate with the next level of technical expertise if utilizing these troubleshooting articles listed below.
CPU Faults <Document: 1006992.1>
Clock Unit Faults <Document: 1002730.1>
DIMM Faults <Document: 1004117.1>
FLP Faults <Document: 1008208.1>
FMSP Faults <Document: 1012954.1>
IOC Faults <Document: 1006871.1>
JTAG Faults <Document: 1017763.1>
MAC Faults <Document: 1012818.1>
MADM Faults <Document: 1005335.1>
MBC Faults <Document: 1002809.1>
Power Faults <Document: 1006872.1>
RCI Faults <Document: 1012820.1>
SC Faults <Document: 1002629.1>
SW Faults <Document: 1012821.1>
Thermal Faults <Document: 1004122.1>
XB Faults <Document: 1008211.1>
XSCFU Faults <Document: 1012822.1>

Attachments

This solution has no attachment