![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1006623.1 : Domains on Sun Fire [TM] 12k/15k/E20k/E25k may dstop when SMS is simultaneously started on both SCs
PreviouslyPublishedAs 209238 Applies to:Sun Fire E25K Server - Version Not Applicable to Not Applicable [Release N/A]Sun Fire E20K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire 12K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire 15K Server - Version Not Applicable to Not Applicable [Release N/A] All Platforms SymptomsDomains on Sun Fire [TM] 12k/15k/E20k/E25k may dstop due to clock issues when SMS is simultaneously started on both SCs. CauseThe system controller (SC) in Sun Fire high-end systems is a CP1500 or CP2140-based printed circuit board (PCB) that provides critical services and resources required for the operation and control of the Sun Fire system. The SCs provides several services for the Sun Fire system via the System Management Services (SMS). Among these services, the SCs provides:
The SC that controls the platform is referred to as the main SC, while the other SC acts as a backup and is called the spare SC. When SMS is simultaneously started on both SCs, they will both start their daemons in the "main" mode. During the startup period, the main SC will reprogram the boards to select its clock as input. As both SCs believe that it is the main SC, the synchronization of the clocks is not guaranteed (the spare SC is supposed to "follow" the main SC's clock) and they will both try to reprogram the boards. The SMS software check for the synchronization of SCs'clocks before reprogramming the boards but the programming process may take a couple of seconds and the clocks may get out of synchronization during this small window. In this case, the domains using the reprogrammed boards may dstop due to clock issues. After a short period, the SMS software will detect that both SCs are trying to become the main SC and SC0 will reset SC1 to prevent what is called a split brain situation. But the detection occurs too late to prevent the potential issue of getting the clocks out of synchronization during the reprogramming of the boards. SolutionAs a result, we do always recommend to start SMS on the previous main SC first, wait for the completion of the sms startup process and then start SMS on the spare SC. For example, from the README file of the SMS patches: Special Install Instructions:
Internal Section
Each System Controller (SC) has several clock frequency generators. During normal operation, the SCs use the 75Mhz voltage controlled crystal oscillator (VCXO) to source clocking for the entire platform. The control voltage circuit of the VCXO either uses the output of a phase-frequency comparator or a fixed voltage reference to drive the VCXO. The selection of the reference is done via a software write to a register (REF_SEL) in the Gchip. When REF_SEL=1 (this is called 'leader mode'), the SC use the fixed voltage reference to drive the VCXO. When REF_SEL=0 (this is called 'follower mode'), the SC compares its clock against the clock coming from the other SC and uses the error signal resulting from this comparison to drive the VCXO. This cause the SC clock to follow the other SC clock. By setting the MAIN SC REF_SEL to 1 and the SPARE SC REF_SEL to 0, the two SCs will generate identical clocks signals. This is called phase lock and this is necessary to ensure a safe clock failover. When the SCs are both in 'leader mode' or both in 'follower mode', the phase lock is not guaranteed.
2) Failover Management Daemon (fomd) Background When fomd is started, its state is UNKNOWN and its initial task is to determine the role of the SC. In order to determine it's own role, fomd must determine the state of the opposite SC. This is accomplished by checking if the opposite SC is responding to a Remote Procedure Call (RPC) via the I2 network and/or is generating a hardware heartbeat If the opposite SC is running as MAIN, the starting fomd becomes SPARE. If the opposite SC is not producing a heartbeat, the starting fomd will assume the main role. It will change its state to BECOMING_MAIN, will start generating a heartbeat and will instruct the Hardware Access Daemon (hwad) to change its state to MAIN. When SMS is started simultaneously on both SCs, the check for a heartbeat will fail and both SCs will proceed to BECOMING_MAIN (and both SCs will instruct hwad to become MAIN). To prevent a split brain (both SCs are in the MAIN role), SC0 will try again to determine the role of SC1 after entering the BECOMING_MAIN state. If SC1 is not Spare at this time, SC0 will reset SC1 and will instruct hwad to restart the MAIN initialization to make sure it undoes anything the spare might have done before it got reset.
When hwad is becoming MAIN, it first configures the SC in 'follower mode' (setting REF_SEL to 0), then it reads the device presence registers to decide what boards are present and register these boards. If the SC is able to register all boards and if the clocks are phase locked, the SC will program the Smart Phased Lock Loop (SPLL) on each boards to select its clock as input.
4) Potential clock issue when SMS is started simultaneously on both SCs When both SCs are started simultaneously, the check by fomd for a heartbeat will fail and both SCs will instruct hwad to become MAIN. Both SCs will go into 'follower mode' and the phase lock is not guaranteed anymore. Hwad will check if the clocks are phase lock before reprogramming the SPLL on the boards but it may take a couple of seconds to reprogram the SPLLs. There is a small chance to loose the phase lock during the reprogramming of the SPLLs. In this case, if the SPLL was previously programmed to use the clock from the other SC, the domain using the reprogrammed board may dstop due to clock issue.
5) Example SC0 is MAIN, SC1 is SPARE (hence all boards should use SC0's clock) May 21 20:48:19 2005 cmc5asffsc1 fomd[725]: [8577 74116830583 NOTICE FailoverMgr.cc 3071] SC configured as Spare Around Jul 9 09:30, Failover is deactivated and SMS is stopped on both SCs Jul 9 09:31:00 2005 cmc5asffsc1 ssd[666]: [1302 4193069493821787 WARNING SSDApp.cc 209] SMS soft shutdown, signaling all components: signal=SIGTERM At Jun 9 09:40, SMS is started at (nearly) the same time on both SCs. SC0 failed to register all objects due to I2c read timeouts; hence SC0 will not try to reprogram the SPLLs ... but SC1 had no issue and will reprogram the SPLLs to use SC1's clock.
Keywords: dstop, SMS, , simultaneously, started, clock Previously Published As 82352 Attachments This solution has no attachment |
||||||||||||
|