Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1019667.1 : Sun Fire[TM] Server System Board (SB) voltage errors.
PreviouslyPublishedAs 243326
Applies to:Sun Fire 4800 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/ASun Fire E2900 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Fire 3800 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Fire 6800 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Fire E6900 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] All Platforms SymptomsThis document describes how to identify and resolve Sun Fire[TM] Server System Board (SB) voltage errors. The servers included in this product family are as follows:
Error Messages: Look for these "Key Indicators" of a voltage issue in error messaging in the System Controller's (SC) log files (showlogs) or console:
Some examples of those "Key Indicators" from actual failure messaging found in showlogs files: All examples above showed SB0, but the board in question could be any SB in the system and the errors would generally be similar. The showenvironment command may also show an "ERROR LOW" status for the SB and a 3.3 VDC sensor value of 0.xx (in other words, less then the LoWarn value).
Expected Behavior: When a server encounters an SB voltage error and the domain is not yet booted or in operation the domain it is part of will either fail POST tests, domain or board poweron, a Keyswitch operation, or fail to boot properly. If the domain is already in operation it will crash when a SB encounters a voltage issue. If the domain crashes, showlogs data might indicate all sorts of Parity Error events as having taken place, such as an Address Parity Error, Parity Bidi Event, L2CheckError Event, or more. The most important thing to note is that when a domain crashes in addition to one of the key following errors, the root cause is likely to be a voltage issue which caused the Parity Error event - not the other way around. See the Additional Information section of this article for an example. CauseIf you encounter the errors described previously, you need to contact Oracle Support Services and open a service request.Resolution: An error of this nature is caused by a defective power supply located directly on the System Board (called the D150). This is a factory repaired component. The resolution is to replace the System Board.
SolutionCustomers should do the following:A service request is required to schedule the SB replacement. Customers should contact Support Services and create a service request to resolve this problem.
NOTE: For Sun Fire v1280, E2900, Netra 1280, or 1290, also implement the Additional Cooling Action advice detailed in Sun Alert 1021703.1. Internal Support Services instructions are listed in the Internal Section of this article. Additional InformationParity Errors Caused by Voltage ErrorsAs described above, SB voltage errors can cause Parity Error events to occur if a domain is operational when a SB goes bad. Root Cause is the SB voltage issue. It's important to understand this because the voltage problem can cause an address or data parity error (or any number of other strange looking error events) due to a board suddenly disappearing. A parity (or data, or l2check, or other) error does not cause a voltage problem - a voltage problem DOES cause all of the above. The following is an example of just one type of error you might seen: Wed Oct 01 21:55:56 sc lom: [ID 385625 local0.error] /RP0/ar0:> SafariPortError0[0x200] : 0x00008005 AdrPErr [00:00] : 0x1 Address parity error FE [15:15] : 0x1 QUnfErr [02:02] : 0x1 Queue underflow error Wed Oct 01 21:55:57 sc lom: [ID 197878 local0.error] Wed Oct 01 21:55:58 sc lom: [ID 841584 local0.error] [AD] Event: N1290.ASIC.AR.ADR_PERR.104a3000 CSN: DomainID: A ADInfo: 1.SCAPP.20.6 Time: Wed Oct 01 21:57:50 PDT 2008 FRU-List-Count: 2; FRU-PN: 5411384; FRU-SN: 006232; FRU-LOC: /N0/RP0 FRU-PN: 5406679; FRU-SN: 005914; FRU-LOC: /N0/SB0 Recommended-Action: Service action required Wed Oct 01 21:56:10 sc lom: [ID 390680 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: sun.serengeti.HpuFailedException: CpuVoltageA2D.getOutputVoltage: sun.serengeti.CommException: I2cComm.readCmd: Path broken between CBH and SDC: SB0.sbbc1.regs.c0 (102000c0) Wed Oct 01 21:56:10 sc lom: [ID 336982 local0.notice] Device will not be polled Wed Oct 01 21:56:10 sc lom: [ID 120592 local0.notice] /N0/SB0, sensor status, outside acceptable limits (7,1,0x207000d00070000) Collaborate with the Oracle Support Engineer if needing further explanation of this concept. All implicated SBs involved in the error should never be replaced. Only the one which suffered the voltage error needs to be replaced and reset the CHS Status of the other components - which are perfectly sane. Their parity errors were a natural response to the SB which lost power "disappearing" from the configuration suddenly.
Internal Comments INTERNAL SUPPORT ENGINEERS SHOULD DO THE FOLLOWING PRIOR TO FIELD DISPATCH: 1. Confirm the errors match what was described in the Symptoms section of this article. 2. Schedule to have the implicated System Board replaced. 3. For Sun Fire v1280, E2900, Netra 1280, or 1290 systems, make sure the customer implements the Cooling Action Advice documented in Field Action Bulletin 1021064.1 or Sun Alert 1021703.1 (contract reference) on all v1280, E2900, n1280, or n1290 systems in their environment. 4. Reset the CHS status of all other components back to 'ok'. Instructions are in 1004879.1. 5. Execute POST testing on the configuration (after SB replacement and CHS status resets). Reboot the Main SC to avoid any chance of encountering Sun CR 6777187 (symptoms similar to CR 6300392) observed from some ScApp 5.20.3 ( or greater ) installations. 6. Collaborate with the next level of technical support if errors persist or there are any questions with this recommendation. Additional Information: - For I/O Board (IB) voltage errors, see 1017844.1 - If encountering difficulty with the replacement action and specifically when trying to issue a "setkeyswitch on", see 1011267.1. Attachments This solution has no attachment |
||||||||||||
|