Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1002125.1 : Sun Fire[TM] 15K/12K/E20K/E25K Servers: RCM Daemon hanging, causing DR operations to hang
PreviouslyPublishedAs 203025 Symptoms Remote Dynamic Reconfiguration(DR) operation(from the System Controller), or local DR operation(from the domain), works fine until a DR operation does not respond; reporting in the $SMSLOGGER/domain_Id/messages file, messages like: DCA/DCS communication error and/or dca[...]-S(): [... ERR DCSInterface.cc 378] message receive failed: DCSInterface :: receiveResponse errCode:502 In some cases, it may not be possible to kill the associated process(cfgadm,
Note :
Technical Instruction <Document: 1003582.1> - What Happens in a Sun Fire[TM] 15K/12K DR Slot0 Detach Operation
Note :
Resolution In the case of a remote DR operation(from the SC), it's possible to confirm(by trussing the commands), that this command is waiting for a Domain Configuration Agent(dca), which is waiting for a Domain Configuration Server(dcs) : For Example : A showdevices command, is waiting for an update from the dca process, via the 18541/4: 0.3145 creat("/var/opt/SUNWSMS/SMS1.4.1/pipes/C/scdr0", 0666) = 8 18541/4: 0.3149 pipe() = 8 [9] [...] 18541/4: 1.5339 ioctl(8, I_RECVFD, 0xFE77BF24) (sleeping...) 18541/4: fd=9 uid=11 gid=20 18541/4: 0.3517 open("/var/opt/SUNWSMS/SMS1.4.1/doors/H/dca", O_RDONLY) = 7 18541/4: door_call(7, 0x00048CD8) (sleeping...) 18541/4: door_call(7, 0x00048CD8) (sleeping...) Then, the dca process is waiting for a dcs process: 29675/232: 13.3519 poll(0xFE3FBAF0, 1, 43200000) (sleeping...) 29675/232: fd=12 ev=POLLIN rev=0 The nature of fd=12 can be determined by using the pfiles command : # pfiles 29675 29675: dca -d C [...] 12: S_IFSOCK mode:0666 dev:308,0 ino:30682 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 10.2.1.1 port: 39601 peername: AF_INET 10.2.1.4 port: 665 fd=12 is the socket connection between DCA and DCS. # grep sun-dr /etc/inetd.conf sun-dr stream tcp wait root /usr/lib/dcs dcs sun-dr stream tcp6 wait root /usr/lib/dcs dcs # grep sun-dr /etc/services sun-dr 665/tcp # Remote Dynamic Reconfiguration In this situation it's more likely that many dcs processes are running. This situation may be easily confirmed by : Trussing the dcs process(es) should confirm that they are all waiting for update For Example : # ptree 432 155 /usr/sbin/inetd -s 432 dcs 7122 dcs 7122/1: 10.2993 open("/var/run/rcm_daemon_door", O_RDONLY) = 8 7122/1: 55.7444 door_call(8, 0xFFBEE978) (sleeping...) # pfiles 7122 7122: dcs [...] 8: S_IFDOOR mode:0400 dev:305,0 ino:40644 uid:0 gid:1 size:0 O_RDONLY door to rcm_daemon[7053] This confirms that dcs is waiting for the rcm_daemon via the door between the 2 Getting a pstack output from the rcm_daemon should report the following stack : # pgrep rcm_daemon 7053 # pstack 7053 7053: /usr/lib/rcm/rcm_daemon [...] ----------------- lwp# 5 / thread# 4 -------------------- ff09f3d8 lwp_mutex_lock (ff29cd10) ff287698 fork1 (ff29c000, a, 35cc8, ff29d670, 534d, 1) + 50 0001ac1c run_script (0, 35cc0, 0, 0, 2, 35ca0) + 154 0001b4c4 do_cmd (30910, fea0b62c, 30910, fea0b62c, 0, 35ca0) + 34 0001bf2c script_register_interest (35cd8, ffffffff, 0, 35ca0, 354c0, 0) + 98 000173fc rcmd_db_sync (308e8, 35c68, ffffffff, 19598, 19fbc, 0) + 7c 000195c0 rcmd_thr_incr (30e06, 89200, 6, fea0b798, 35f80, 0) + c4 00012bd8 event_service (fea0bc50, fea0bc54, 0, fea0bc88, 0, 0) + f4 ff2b40dc door_service (31f28, ff2c6000, b0, 31f28, 0, 4) + 64 ff09c9ec _door_return (0, 38, e0000, 1, 11, 72636d2e) + 68 [...] This thread is blocked in the kernel, waiting for a lock. Killing the rcm_daemon should help. The next DR operation should complete For system running rcm_daemon patch 116991-03 and later, rcm_daemon is now linked with the alternate libthread. Relief/Workaround This can be done with a script: #!/bin/sh LD_LIBRARY_PATH=/usr/lib/lwp export LD_LIBRARY_PATH LD_LIBRARY_PATH_64=/usr/lib/lwp/64 export LD_LIBRARY_PATH_64 /usr/lib/rcm/rcm_daemon or via a command line # pkill -9 rcm_daemon # LD_LIBRARY_PATH=/usr/lib/lwp LD_LIBRARY_PATH_64=/usr/lib/lwp/64 /usr/lib/rcm/rcm_daemon On Solaris[TM] 8 Operating System, to verify that the rcm_daemon is using the For Example : # pgrep rcm 7204 # pldd 7204 7204: /usr/lib/rcm/rcm_daemon /usr/lib/libgen.so.1 /usr/lib/libelf.so.1 /usr/lib/libdl.so.1 /usr/lib/libcmd.so.1 /usr/lib/libdoor.so.1 /usr/lib/librcm.so.1 /usr/lib/lwp/libthread.so.1 /usr/lib/libnvpair.so.1 /usr/lib/libdevinfo.so.1 /usr/lib/libnsl.so.1 /usr/lib/libsocket.so.1 /usr/lib/libc.so.1 /usr/lib/libmp.so.2 /usr/platform/sun4u-us3/lib/libc_psr.so.1 /usr/lib/rcm/modules/SUNW_cluster_rcm.so /usr/lib/rcm/modules/SUNW_dump_rcm.so /usr/lib/rcm/modules/SUNW_filesys_rcm.so /usr/lib/rcm/modules/SUNW_ip_rcm.so /usr/lib/rcm/modules/SUNW_network_rcm.so /usr/lib/rcm/modules/SUNW_swap_rcm.so For more details about the Alternate Libthread, see: Additional Information More technical details in: * CR4825286 - RCM Daemon hanging causing DR operations to hang As a summary, this situation is due to a problem in the Solaris 8 OS libthread. Waiting for the fix of the original rootcause, SunOS 5.8 rcm_daemon patch 116991-03 has been released and it now links rcm_daemon with alternate libthread. Product Sun Fire E25K Server Sun Fire 15K Server Internal Comments The following is strictly for the use of Sun employees:
Also, see: Internal BugID 6234740 - libthread`_co_timerset() may attempt to errCode:502, rcm, dcs, showdevices, rcm_daemon, DCA/DCS communication error Previously Published As 80582 Change History Date: 2006-01-19 User Name: 7058 Action: Update Canceled Comment: *** Restored Published Content *** SSH AUDIT Version: 0 Date: 2006-01-19 Attachments This solution has no attachment |
||||||||||||
|