![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1002125.1 : Sun Fire[TM] 15K/12K/E20K/E25K Servers: RCM Daemon hanging, causing DR operations to hang
PreviouslyPublishedAs 203025
Applies to:Sun Fire 15K ServerSun Fire E25K Server All Platforms SymptomsRemote Dynamic Reconfiguration(DR) operation(from the System Controller), or local DR operation(from the domain), works fine until a DR operation does not respond; reporting in the $SMSLOGGER/domain_Id/messages file, messages like:DCA/DCS communication errorand/or dca[...]-S(): [... ERR DCSInterface.cc 378] message receive failed: DCSInterface :: receiveResponse errCode:502In some cases, it may not be possible to kill the associated process(cfgadm, rcfgadm, deleteboard, showdevices). Note: for background information on the DR mechanism, please refer to Document 1003582.1 (What Happens in a Sun Fire[TM] 15K/12K DR Slot0 Detach Operation). Note: if none of the symptoms described below is true, and the signature of the rcm_daemon stack is not the same as described in this article, then it's more likely that you are facing a different issue. See the References section below for more troubleshooting steps. CauseThis is related to the libthread issue described into Document 1000512.1 (Applications Linked to "libthread" may Hang or Terminate Abnormally During Initialization - Solaris Bug 4730459): please note that this affects Solaris 8 (and below) only.NOTE: as Solaris 8 is currently EOL (check Oracle Lifetime Support Policy here for detailed information), in case of this issue you're firstly suggested to consider upgrading to a current Solaris release. Issue details In the case of a remote DR operation (run from the SC), it's possible to confirm, by trussing the commands, that this command is waiting for a Domain Configuration Agent (dca), which is waiting for a Domain Configuration Server (dcs). For example, a showdevices command, is waiting for an update from the dca process, via the door to dca, and the pipe (scdrN) to dca. A truss shows: 18541/4: 0.3145 creat("/var/opt/SUNWSMS/SMS1.4.1/pipes/C/scdr0", 0666) = 8 Then, the dca process is waiting for a dcs process: 29675/232: 13.3519 poll(0xFE3FBAF0, 1, 43200000) (sleeping...) The nature of fd=12 can be determined by using the pfiles command: # pfiles 29675 fd=12 is the socket connection between DCA and DCS. Note that dcs always uses the TCP port 665, as shown by the following: # grep sun-dr /etc/inetd.conf In this situation it's more likely that many dcs processes are running. Most of them are stuck, waiting for the rcm_daemon. This situation may be easily confirmed by:
For Example: # ptree 432 This confirms that dcs is waiting for the rcm_daemon via the door between the 2 processes. Getting a pstack output from the rcm_daemon should report the following stack: # pgrep rcm_daemon This thread is blocked in the kernel, waiting for a lock. Killing the rcm_daemon should help. The next DR operation should complete successfully but the same symptom might come back. As workaround please note that, for systems running rcm_daemon patch 116991-03 and later, rcm_daemon is linked with an alternate libthread that is not affected by the issue. SolutionThe workaround (once SunOS 5.8 rcm_daemon patch 116991-03 has been installed and it now links rcm_daemon with alternate libthread) can be done with a script:#!/bin/sh LD_LIBRARY_PATH=/usr/lib/lwp export LD_LIBRARY_PATH LD_LIBRARY_PATH_64=/usr/lib/lwp/64 export LD_LIBRARY_PATH_64 /usr/lib/rcm/rcm_daemon or via a command line # pkill -9 rcm_daemon # LD_LIBRARY_PATH=/usr/lib/lwp LD_LIBRARY_PATH_64=/usr/lib/lwp/64 /usr/lib/rcm/rcm_daemon On Solaris[TM] 8 Operating System, to verify that the rcm_daemon is using the alternate libthread, a pldd(1) command against the process, should report: For Example: # pgrep rcm 7204 # pldd 7204 7204: /usr/lib/rcm/rcm_daemon /usr/lib/libgen.so.1 /usr/lib/libelf.so.1 /usr/lib/libdl.so.1 /usr/lib/libcmd.so.1 /usr/lib/libdoor.so.1 /usr/lib/librcm.so.1 /usr/lib/lwp/libthread.so.1 /usr/lib/libnvpair.so.1 /usr/lib/libdevinfo.so.1 /usr/lib/libnsl.so.1 /usr/lib/libsocket.so.1 /usr/lib/libc.so.1 /usr/lib/libmp.so.2 /usr/platform/sun4u-us3/lib/libc_psr.so.1 /usr/lib/rcm/modules/SUNW_cluster_rcm.so /usr/lib/rcm/modules/SUNW_dump_rcm.so /usr/lib/rcm/modules/SUNW_filesys_rcm.so /usr/lib/rcm/modules/SUNW_ip_rcm.so /usr/lib/rcm/modules/SUNW_network_rcm.so /usr/lib/rcm/modules/SUNW_swap_rcm.so As a summary, this situation is due to a problem in the Solaris 8 OS libthread (more details into Document 1000512.1: Applications Linking to libthread May Hang); fix is Solaris 8 patch 108827-36 or later. Product Sun Fire E25K Server Sun Fire 15K Server Internal Section References:
Keywords: errCode:502, rcm, dcs, showdevices, rcm_daemon, DCA/DCS communication error Previously Published As 80582 Attachments This solution has no attachment |
||||||||||||
|