Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1002526.1 : Sun SPARC Enterprise[TM] Mx000 Servers: FMA Specified FRU Replacements
PreviouslyPublishedAs 203504
Applies to:Sun SPARC Enterprise M5000 Server - Version: Not ApplicableSun SPARC Enterprise M3000 Server - Version: Not Applicable and later [Release: N/A and later] Sun SPARC Enterprise M4000 Server - Version: Not Applicable and later [Release: N/A and later] Sun SPARC Enterprise M9000-32 Server - Version: Not Applicable and later [Release: N/A and later] Sun SPARC Enterprise M9000-64 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] All Platforms GoalInvestigating a Sun SPARC[TM] Enterprise Mx000 FMA specified FRU indictment.This document details how to initiate a Service Action Plan to investigate whether a hardware component should be
replaced as implicated by the Predictive Self-Healing Diagnosis Engine (FMA DE) on a Sun SPARC Enterprise Mx000 system. NOTE: The implicated hardware component(s) is referred as a Field Replaceable Unit (FRU) throughout this document. SolutionThis document makes a few assumptions:
To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - M Series Servers Every hardware component in a Sun SPARC Enterprise Mx000 (OPL) platform is a Field Replaceable Unit (FRU). This requires that a Oracle "badged" or certified partner engineer performs the physical replacement of the component. In order to begin the process of FRU replacement, there are specific steps that Oracle Support Services rely on the customer to perform as detailed below. 1. Collect the FMA Fault Message. The output can be displayed using fmdump -m on the XSCF console. Example output is as follows: MSG-ID: SCF-8001-4X, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Tue Mar 20 21:23:54 UTC 2007 PLATFORM: SPARC-Enterprise, CSN: 7860000772, HOSTNAME: genericff2 SOURCE: sde, REV: 1.12 EVENT-ID: d93a7654-f414-46aa-aaf5-21f70a5af931 DESC: The number of uncorrectable and correctable errors on single DIMM exceeds an acceptable threshold. This fault is detected while running POST. Refer to http://www.sun.com/msg/SCF-8001-4X for more information. AUTO-RESPONSE: The memory associated with the memory bank containing the errors is deconfigured. IMPACT: POST is restarted after the memory associated with the memory bank has been deconfigured. REC-ACTION: Schedule a repair action to replace the affected Field Replaceable Unit (FRU), the identity of which can be determined using fmdump -v -u EVENT_ID. Please consult the detail section of the knowledge article for additional information. 2. Collect the "fmdump -v -u Example (uses the same event as Step 1's example - note the event ID is in bold ): xscf> fmdump -v -u d93a7654-f414-46aa-aaf5-21f70a5af931 TIME UUID MSG-ID Mar 20 21:23:54.0192 d93a7654-f414-46aa-aaf5-21f70a5af931 SCF-8001-4X 100% fault.chassis.SPARC-Enterprise.memory.bank.err Problem in: hc:///chassis=0/cmu=0/mem=0 Affects: hc:///chassis=0/cmu=0/mem=0 FRU: hc://:product-id=SPARC-Enterprise:chassis-id=7860000772: server-id=san-ff2-21-0:serial=04126711: part=72T128000HR3.7A:revision=252b/component=/MBU_B/MEMB#0/MEM#3A 3. Collect the fault information to prepare to log a Service Request:
4. Contact Oracle Support Services or your local service representative and open a "Service Request". 5. Review FRU Replacement Methods information to prepare your configuration for the FRU replacement.
6. A Oracle Support Services Engineer may need additional data to be collected. If so they will specify the data to collect. Please assist in capturing requested data so Oracle can resolve your issue with as little delay as possible. The most likely data requested will be:
Reference <Document: 1008229.1> Running explorer on Sun SPARC Enterprise[TM] Mx000 (OPL) Servers if required for help with either Explorer or Snapshot data requests. Internal Only - Oracle Support Services Steps 1. Verify the fault event requires FRU replacement. Confirm the fault event message, fmdump output, and all data are from the same date and implicate the same FRU component.2. Verify that "fmdump -v -u Event-ID" contains the list of FRU indictments for this fault event. The list of FRUs is displayed in the order in which they are intended to be replaced (percentage of likelihood).3. Verify the FRU replacement method that can be used for the specific FRU requiring service and the configuration in question. The customer may have specified a desired method to use, so verify if it the method desired is possible.4. Create the Service Action Plan and report the recommendations to the Customer/End User. Use the Action Plan Creator Tool to create the Service Action Plan.5. Dispatch the replacement to the appropriate field resources and choose the appropriate Canned Action Plan in ATR. Reference the Service Manual for the Platform type and FRU in question if needed: 6. Contact the Customer/End User and confirm the fix. Confirm that the FRU replacement resolved the issue and no errors have repeated for at least 24 hours.7. If the exact same fault event repeats, go back to step 2 and replace the next likeliest FRU listed in fmdump output. If the same error persists and all FRUs in the list have been replaced or you are unsure of the next steps, collaborate with the next level of support for further investigation. Troubleshooting References for Collaboration FMA Event Messaging identified the primary FRU or FRUs involved in a particular error event. The FRUs are intended to be replaced in the order that they are listed in the event messaging (one at a time) because the Diagnosis Engine is designed to specify the FRUs in order of being most likely to have caused the error to least likely. In the unlikely event that errors persist to the point where all the FRUs in the Event Message have all been replaced, the following troubleshooting references should be utilized to determine the next best course of action given the particular type of fault encountered. The troubleshooting references provide information on the faults which contain additional, sometimes quite unlikely suspects, which may be involved in an error event perhaps as a "pass thru" device or similar. These documents should be used as a reference to help determine the next best course of action following replacement of all of the "named suspects" if needed - not to preclude replacement of the FMA identified suspects shown in the event messages. These references are grouped based on the component or asic in error, in others words, one for CPU events, one for IOC, and so on. Collaborate with the next level of technical expertise if utilizing these troubleshooting articles listed below. CPU Faults <Document: 1006992.1> Clock Unit Faults <Document: 1002730.1> DIMM Faults <Document: 1004117.1> FLP Faults <Document: 1008208.1> FMSP Faults <Document: 1012954.1> IOC Faults <Document: 1006871.1> JTAG Faults <Document: 1017763.1> MAC Faults <Document: 1012818.1> MADM Faults <Document: 1005335.1> MBC Faults <Document: 1002809.1> Power Faults <Document: 1006872.1> RCI Faults <Document: 1012820.1> SC Faults <Document: 1002629.1> SW Faults <Document: 1012821.1> Thermal Faults <Document: 1004122.1> XB Faults <Document: 1008211.1> XSCFU Faults <Document: 1012822.1> Attachments This solution has no attachment |
||||||||||||
|