Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1002125.1
Update Date:2012-07-30
Keywords:

Solution Type  Problem Resolution Sure

Solution  1002125.1 :   Sun Fire[TM] 15K/12K/E20K/E25K Servers: RCM Daemon hanging, causing DR operations to hang  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
203025


Applies to:

Sun Fire 15K Server
Sun Fire E25K Server
All Platforms

Symptoms

Remote Dynamic Reconfiguration(DR) operation(from the System Controller), or local DR operation(from the domain), works fine until a DR operation does not respond; reporting in the $SMSLOGGER/domain_Id/messages file, messages like:
DCA/DCS communication error
and/or
dca[...]-S(): [... ERR DCSInterface.cc 378] message receive failed: DCSInterface :: receiveResponse errCode:502
In some cases, it may not be possible to kill the associated process(cfgadm, rcfgadm, deleteboard, showdevices).

Note: for background information on the DR mechanism, please refer to Document 1003582.1 (What Happens in a Sun Fire[TM] 15K/12K DR Slot0 Detach Operation).

Note: if none of the symptoms described below is true, and the signature of the rcm_daemon stack is not the same as described in this article, then it's more likely that you are facing a different issue. See the References section below for more troubleshooting steps.

Cause

This is related to the libthread issue described into Document 1000512.1 (Applications Linked to "libthread" may Hang or Terminate Abnormally During Initialization - Solaris Bug 4730459): please note that this affects Solaris 8 (and below) only.

NOTE: as Solaris 8 is currently EOL (check Oracle Lifetime Support Policy here for detailed information), in case of this issue you're firstly suggested to consider upgrading to a current Solaris release.


Issue details

In the case of a remote DR operation (run from the SC), it's possible to confirm, by trussing the commands, that this command is waiting for a Domain Configuration Agent (dca), which is waiting for a Domain Configuration Server (dcs).

For example, a showdevices command, is waiting for an update from the dca process, via the door to dca, and the pipe (scdrN) to dca.

A truss shows:
18541/4:    0.3145 creat("/var/opt/SUNWSMS/SMS1.4.1/pipes/C/scdr0", 0666) = 8
18541/4: 0.3149 pipe() = 8 [9]
[...]
18541/4: 1.5339 ioctl(8, I_RECVFD, 0xFE77BF24) (sleeping...)
18541/4: fd=9 uid=11 gid=20

18541/4: 0.3517 open("/var/opt/SUNWSMS/SMS1.4.1/doors/H/dca", O_RDONLY) = 7
18541/4: door_call(7, 0x00048CD8) (sleeping...)
18541/4: door_call(7, 0x00048CD8) (sleeping...)

Then, the dca process is waiting for a dcs process:
29675/232:    13.3519 poll(0xFE3FBAF0, 1, 43200000)   (sleeping...)
29675/232: fd=12 ev=POLLIN rev=0

The nature of fd=12 can be determined by using the pfiles command:
# pfiles 29675
29675: dca -d C
[...]
12: S_IFSOCK mode:0666 dev:308,0 ino:30682 uid:0 gid:0 size:0
O_RDWR
sockname: AF_INET 10.2.1.1 port: 39601
peername: AF_INET 10.2.1.4 port: 665

fd=12 is the socket connection between DCA and DCS.

Note that dcs always uses the TCP port 665, as shown by the following:
# grep sun-dr /etc/inetd.conf
sun-dr stream tcp wait root /usr/lib/dcs dcs
sun-dr stream tcp6 wait root /usr/lib/dcs dcs
# grep sun-dr /etc/services
sun-dr 665/tcp # Remote Dynamic Reconfiguration

In this situation it's more likely that many dcs processes are running.
Most of them are stuck, waiting for the rcm_daemon.

This situation may be easily confirmed by:
  • trussing the dcs process(es),
  • getting a pstack(1) output from the rcm_daemon process.
Trussing the dcs process(es) should confirm that they are all waiting for update from the Reconfiguration Coordination Manager daemon (rcm_daemon):

For Example:
# ptree 432
155 /usr/sbin/inetd -s
432 dcs
7122 dcs

# truss -p 7122
...
7122/1: 10.2993 open("/var/run/rcm_daemon_door", O_RDONLY) = 8
7122/1: 55.7444 door_call(8, 0xFFBEE978) (sleeping...)

# pfiles 7122
7122: dcs
[...]
8: S_IFDOOR mode:0400 dev:305,0 ino:40644 uid:0 gid:1 size:0
O_RDONLY door to rcm_daemon[7053]

This confirms that dcs is waiting for the rcm_daemon via the door between the 2 processes.

Getting a pstack output from the rcm_daemon should report the following stack:
# pgrep rcm_daemon
7053
# pstack 7053

7053: /usr/lib/rcm/rcm_daemon
[...]
----------------- lwp# 5 / thread# 4 --------------------
ff09f3d8 lwp_mutex_lock (ff29cd10)
ff287698 fork1 (ff29c000, a, 35cc8, ff29d670, 534d, 1) + 50
0001ac1c run_script (0, 35cc0, 0, 0, 2, 35ca0) + 154
0001b4c4 do_cmd (30910, fea0b62c, 30910, fea0b62c, 0, 35ca0) + 34
0001bf2c script_register_interest (35cd8, ffffffff, 0, 35ca0, 354c0, 0) + 98
000173fc rcmd_db_sync (308e8, 35c68, ffffffff, 19598, 19fbc, 0) + 7c
000195c0 rcmd_thr_incr (30e06, 89200, 6, fea0b798, 35f80, 0) + c4
00012bd8 event_service (fea0bc50, fea0bc54, 0, fea0bc88, 0, 0) + f4
ff2b40dc door_service (31f28, ff2c6000, b0, 31f28, 0, 4) + 64
ff09c9ec _door_return (0, 38, e0000, 1, 11, 72636d2e) + 68
[...]

This thread is blocked in the kernel, waiting for a lock.

Killing the rcm_daemon should help. The next DR operation should complete successfully but the same symptom might come back.


As workaround please note that, for systems running rcm_daemon patch 116991-03 and later, rcm_daemon is linked with an alternate libthread that is not affected by the issue.

Solution

The workaround (once SunOS 5.8 rcm_daemon patch 116991-03 has been installed and it now links rcm_daemon with alternate libthread) can be done with a script:
#!/bin/sh
LD_LIBRARY_PATH=/usr/lib/lwp
export LD_LIBRARY_PATH
LD_LIBRARY_PATH_64=/usr/lib/lwp/64
export LD_LIBRARY_PATH_64

/usr/lib/rcm/rcm_daemon

or via a command line

# pkill -9 rcm_daemon
# LD_LIBRARY_PATH=/usr/lib/lwp
LD_LIBRARY_PATH_64=/usr/lib/lwp/64
/usr/lib/rcm/rcm_daemon

On Solaris[TM] 8 Operating System, to verify that the rcm_daemon is using the alternate libthread, a pldd(1) command against the process, should report:
"/usr/lib/lwp/libthread.so.1" instead of "/usr/lib/libthread.so.1".

For Example:

# pgrep rcm
7204
# pldd 7204
7204:   /usr/lib/rcm/rcm_daemon
/usr/lib/libgen.so.1
/usr/lib/libelf.so.1
/usr/lib/libdl.so.1
/usr/lib/libcmd.so.1
/usr/lib/libdoor.so.1
/usr/lib/librcm.so.1
/usr/lib/lwp/libthread.so.1
/usr/lib/libnvpair.so.1
/usr/lib/libdevinfo.so.1
/usr/lib/libnsl.so.1
/usr/lib/libsocket.so.1
/usr/lib/libc.so.1
/usr/lib/libmp.so.2
/usr/platform/sun4u-us3/lib/libc_psr.so.1
/usr/lib/rcm/modules/SUNW_cluster_rcm.so
/usr/lib/rcm/modules/SUNW_dump_rcm.so
/usr/lib/rcm/modules/SUNW_filesys_rcm.so
/usr/lib/rcm/modules/SUNW_ip_rcm.so
/usr/lib/rcm/modules/SUNW_network_rcm.so
/usr/lib/rcm/modules/SUNW_swap_rcm.so

As a summary, this situation is due to a problem in the Solaris 8 OS libthread (more details into Document 1000512.1: Applications Linking to libthread May Hang); fix is Solaris 8 patch 108827-36 or later.


Product
Sun Fire E25K Server
Sun Fire 15K Server

Internal Section

References:
  • More technical details in 4825286: RCM Daemon hanging causing DR operations to hang
  • Problem Resolution Document 1008803.1: Sun Fire[TM] 12K/15K: showdevices can hang if sd.conf is large or misleading
  • Problem Resolution Document 1008805.1: Sun Fire[TM] 12K/15K/E20K/E25K: Remote Dynamic Reconfiguration (DR) generates "DCA/DCS Communication Error" and showdevices is 'Unable to get device information from domain'
  • Technical Instruction Document 1003582.1: What Happens in a Sun Fire[TM] 15K/12K DR Slot0 Detach Operation
  • Technical Instruction Document 1004922.1: Sun Fire[TM] servers: Trouble-shooting RCM failures events in DR operations
  • Problem Resolution Document 1009124.1: Sun Fire[TM] 12K/15K: showdevices takes a long time to return
  • Problem Resolution Document 1012320.1: Sun Fire[TM] 12K/15K/20K/25K: Domain reports "sun-dr/tcp: bind: Address already in use"
Also, see: Internal BugID 6234740 - libthread`_co_timerset() may attempt to acquire _calloutlock twice

Keywords: errCode:502, rcm, dcs, showdevices, rcm_daemon, DCA/DCS communication error

Previously Published As 80582



Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback