Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1012044.1 : Sun Fire[TM] 12K/15K: Recognizing SMS split brain
PreviouslyPublishedAs 216503 Symptoms - Problem statement: Under rare circumstances SMS has been found to suffer a split brain condition. When this occurs both system controllers assume the Main role, and multiple (if not all) domains crash. - Symptoms: The split brain condition has only been observed when SMS is restarting on one of the SCs. Neither SC is able to detect the other and both proceed to assume the Main role. The simplest way to find this condition is to run the SMS ' showfailover -r ' command on both SCs. They should both report the Main role as shown below. xc46-sc1:sms-svc:17> showfailover -rMAIN It is also possible to see this condition by searching the platform message logs (/var/opt/SUNWSMS/adm/platform/messages) for the message " SC configured as Main ". In the example messages below, we can see that both SCs assumed the Main role very close in time. Neither states that it is " configured as Spare ". SC0----> Nov 4 02:18:36 2002 ssscpsfp-sc1 fomd[464]: [8576 2729512887571282 NOTICE FailoverMgr.cc 2103] SC configured as Main SC1----> Nov 4 02:17:03 2002 ssscpsfp-sc2 fomd[476]: [8576 181482868172 NOTICE FailoverMgr.cc 2885] SC configured as Main Other messages that may lead up to the assumption of the Main role include: Nov 4 02:09:13 2002 ssscpsfp-sc1 fomd[464]: [8599 2728949610724982 NOTICE FMHeartbeat.cc 232] Checking for SC heartbeat interrupts (can take up to 15 seconds) ... Nov 4 02:09:28 2002 ssscpsfp-sc1 fomd[464]: [8582 2728964770234452 NOTICE FMHealthMonitor.cc 184] Not detecting remote SC's heartbeat interrupts Subsequent messages typically show numerous console bus failures. This is because we have two SCs trying to access and control the same hardware. The messages might appear as follows: Nov 4 02:18:36 2002 ssscpsfp-sc1 hwad[435]: [1174 2729513002562554 ERR PciComm.cc 232] console bus device failed to respond correctly at address 213052daInternal Resolution There have been various split grain bugs in several versions of SMS. See the PTS website for a list of the SF12K/15K/E20K/E25K suggested patches. http://panacea.uk.oracle.com/twiki/bin/view/Products/Last_ProdPatchesFirmwareStarcat Internal Comments Previously Published As 50767 Product_uuid 29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server 077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server Attachments This solution has no attachment |
||||||||||||
|