Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1009222.1 : Sun Fire[TM] 15K/12K Servers: setkeyswitch ops report "[5358] Transmission or pcd(1M) handling of domain-down event failed: ecode=1711"
PreviouslyPublishedAs 212762
Applies to:Sun Fire 15K ServerAll Platforms GoalThe SMS (Short Message Service) CLI (command-line interface), setkeyswitch(1M) changes the position of the virtual keyswitch to the specified value. setkeyswitch is responsible for powering on or powering off boards and bringing up a domain.This short article documents the trouble-shooting process generally undertaken to isolate the root-cause behind failed setkeyswitch operations that report the error msg "[5358] Transmission or pcd(1M) handling of domain-down event failed: ecode=1711" SolutionThe earliest symptoms manifested via this error condition vary; for example, a Domain OS initiated Solaris[TM] reboot/init 6 operations appears to be in a hung condition. Under such circumstances, the SMS domain logs would document the following log messages:Jan 26 11:28:54 2005 v4u-15ka-sc1 dsmd[2491]-R(): [2536 4842682682472752 NOTICE DomainsPatrol.cc 724] Reset domain R request received, restarting domain. Upon further investigations into domain Rs' POST log dir (/var/opt/SUNWSMS/SMS1.4.1/adm/R/post) would report the following anomaly: -rw-rw-rw- 1 sms-dsmd sms 0 Jan 26 11:29 post050126.1129.00.log --> empty POST logs captured off the above domain reboot event & the domain's status would remain hung at "In Recovery": R v4u-15ka-r - In Recovery In addition, all attempts at recovering the domain's operations via the setkeyswitch CLI would yield the ecode=1711 error message: v4u-15ka-sc1:sms-svc:40> setkeyswitch -d R standby Looking through the SMS platform logs through the same time period yielded the following log extracts : Jan 26 11:32:09 2005 v4u-15ka-sc1 ssd[724]: [1310 4842877760286195 NOTICE StartupManager.cc 3239] software component shutdown successful: name=dxs-R As observed from the above SMS platform log extracts, the following conditions are recognized:
Given the above findings, one can reinforce the conclusions reached off the above observations via looking through the current contents of the PCD's database repository: # ls -l /var/opt/SUNWSMS/SMS1.4.1/.pcd/ As observed from the above, the 2 temporary files that PCD have setup to initiate checkpointing ops are actually non-populated (empty). Hence, given the above data presented and the fact that PCD had flagged a ENOSPC (errno 28) against its attempt to access parts of the PCD database, we can reasonably assume that the root-cause surrounds the issue of making available sufficient disk space to accommodate the two critical elements of SMS facilitating the Solaris reboot event:
The error condition was finally isolated to the following disk full condition at the root file system: v4u-15ka-sc1:sms-svc:63> df -k Final redress will entail free'ing up sufficient disk space to accommodate normal SMS operations managing & monitoring its resident domains. Product System Management Services 1.4.1 Software and above Sun Fire 12K/15K/20K/25K Internal section Keywords: starcat, hpost, pcd 1711, sysboard_info.tmp domain_info.tmp, checkpoint, chkpt, hang, reboot, hpost ecode=44, DSMD_EVENT_DOMAIN_STOP Previously Published As 80097 Attachments This solution has no attachment |
||||||||||||
|