Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1009533.1 : Sun Fire[TM] 12K/15K/20K/25K: DR not possible when CSB is blacklisted and domain configured with both Centreplane halves active
PreviouslyPublishedAs 213152
Applies to:Sun Fire 12K ServerSun Fire 15K Server Sun Fire E20K Server Sun Fire E25K Server All Platforms SymptomsSymptomsIt has been observed that when a CSB (Centreplane Support Board) loses one of it's redundant power supplies, attempting to DR (Dynamic Reconfigure) in a system board will fail. This is due to the design of the hpost process and the fact that to DR in a system board, we attempt to set the board to use the same Address, Response and Data bus configuration. As we cannot achieve this configuration, hpost FAILs out the SB (System board). This is in fact, not a bug. By design, we do not change current bus configuration as part of an hpost (Host Power On Self Test) for a DR. hpost can indeed make bus configuration decisions when the domain is being started up from cold, just not during a DR. The method to change the bus configuration on a running domain is 'setbus'. Changes{CHANGE}Cause{CAUSE}SolutionResolutionThis failure is characterized by the very early failure of POST. We can see in the complete post log below that POST fails very early on, and the only real details we have to go on are the ESMD (Environmental Status Monitoring Daemon) blacklist message, listing "cplane 1" (aka CSB1) as disabled and that there is No minimum system left after blacklist file. So, there is not a great deal to go on in the, but there are other places we can look for the data. First - The hpost log we have already discussed: ======================================================================== # SMI Sun Fire 12/15/20/25K POST log opened Tue Jun 13 00:39:13 2006 # hpost version 1.5 Generic 120648-04 Apr 24 2006 12:10:28 # libxcpost.so v. 1.5 Generic 120648-04 Apr 24 2006 11:48:42 # pid = 9538 level = 16 verbose_level = 20 # SC name: e25k1-sc1. ChHostID: XX00XX00XX00X # Domain Id = A # Parent PID = 6081: dxs # Cmdline: /opt/SUNWSMS/SMS1.5/bin/hpost -dA -H16.0 Significant contents of .postrc (platform) /etc/opt/SUNWSMS/SMS1.5/config/platform/.postrc: # ident "@(#)postrc 1.1 01/04/02 SMI" Reading domain blacklist file /etc/opt/SUNWSMS/config/A/blacklist ... # ident "@(#)blacklist 1.1 01/04/02 SMI" Reading platform blacklist file /etc/opt/SUNWSMS/config/platform/blacklist ... # ident "@(#)blacklist 1.1 01/04/02 SMI" Reading system ASR blacklist file /etc/opt/SUNWSMS/config/asr/blacklist ... cplane 1 # ESMD Power Failure 0610.1718.56 SEEPROM probe took 0 seconds. Reading Component Health Status (CHS) information ... No minimum system left after blacklist file! Bailing out! Exitcode = 48: No system after domain, .postrc, blacklist, etc. POST (level=16, verbose=20, -H16.0) execution time 1:09 # SMI Sun Fire 12/15/20/25K POST log closed Tue Jun 13 00:40:22 2006 ======================================================================== Then we have the platform log, located in /var/opt/SUNWSMS/adm/platform/messages, or in /<explorer- dir>/sf15k/adm/platform/messages if you are checking an explorer. In this log, there may be messages that tell us more about previous failures. In the case that this document was written about, there was a prior CSB power supply failure. ======================================================================== Jun 13 00:29:40 2006 e25k1-sc1 esmd[5620]: [2000 2674760785927349 ERR SysControl.cc 1536] A failure has been detected on redundant PS at ps1_power_good_l; located on CSB at CS1. SCHEDULE REPLACEMENT of CSB at CS1 as soon as possible to restore redundancy. ======================================================================== Of course, this board has redundant power supplies, so the platform kept running after this failure, however, as the message notes, we should schedule a replacement as soon as possible. The trick is that this also causes an entry to be made in the ASR blacklist, which hpost must obey. From the Solaris[TM] side, within the domain, the details you would get as a result of this type of failure is a somewhat generic failure: # cfgadm -c configure SB13 Jun 23 10:13:13 v4u-15ka-e-epar02 drmach: WARNING: SMS hpost reported error, see POST log for details cfgadm: Hardware specific failure: test SB13: SMS hpost reported error, see POST log for details Of course, this directs us to check the POST output. So - We have set the scene, and now know that with the CSB partially failed, and listed as blacklisted in the ASR blacklist, we can't DR a system board into the domain. What is the solution? The only supported and sensible answer to this question is to replace the CSB with the failed power supply! An example process follows: (Note: These processes are covered in great detail in the 15K and 25K service manuals. This document only supplies the minimum detail) Let's assume that the failed CSB is CSB1, and the main SC is SC1. We'll assume this config, as it's the hardest to workaround. In essence, we need to get SC0 (The SC in the *good* CSB) to be main, stop using the failed CSB and then replace it. - Failover from SC1 to SC0 - setfailover on wait for sync to complete - setfailover force This fails the SC's over. - Stop using CSB1 - setbus -c cs0 This disables the Address, Data and Response busses for all CSB1 supported paths. This means we are ready to replace the CSB - Halt SC1 - From SC0, poweroff SC1, the SCPER1 and CSB1 - poweroff sc1 scper1 csb1 - Remove SC1, SCPER1 and CSB1 - Install new CSB1 - Install SCPER1 and SC1 *in that order* - SC1 automatically powers on and boots - setfailover on Wait a few minutes for sync - Start using all busses again. - setbus -c cs0,cs1 - Done! Relief/Workaround Using setbus, we can workaround this issue. Note: This assumes that CS1 is the failed CSB, and all work is done on the main SC. - Disable all CSB1 supported busses - setbus -c cs0 - Perform the DR operation - Replace the CSB at the first opportunity! Redundancy in your platform depends on having both CSB's working at 100% capacity. See the 'solution' above. Additional Information Apollo Escalation Id: 1-17735996 Radiance Case: 10868708 Radiance Task: 21830653 See also - Technical Instruction <Document 1003308.1> Sun Fire[TM]12K/15K/E20K/E25K: esmd warning; A power failure has been detected on a redundant power supply at ... Product Sun Fire E25K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire 12K Server Amazon 20/25 Phase 2 Hardware Keywords starcat, CSB, setbus, dynamic reconfiguration, DR, SMS, ASR, hpost Previously Published As 86064 Attachments This solution has no attachment |
||||||||||||
|