Asset ID:	1-75-1311407.1
Update Date:	2011-05-11
Keywords:

Solution Type Troubleshooting Sure

Solution 1311407.1 : Sun Enterprise[TM] 10000: Dynamic Reconfiguration (DR) Troubleshooting

Applies to:

Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.

Purpose

This document provides troubleshooting steps for common DR issues.

Last Review Date

April 6, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

DR Attach: Unable to acquire ssp lock for hardware register update

Symptom:

A DR attach operation fails in init_attach.

Resolution:

Part of the attach operation is to update the Domain Mask in the CIC ASICs for the receiving domain with the new board. A lock is needed to perform this operation. Other items that can be holding the lock:

Recordstop/Arbstop in process
Bringup of another domain in process
Another DR operation in process
File access daemon (unlikely)

DR Drain: Debugging

1. Check/set the value of dr_mem_debug using adb. Debugging output is sent to the console.

# adb -kw
physmem 23af45
dr_mem_debug/D                                      <-- check value
dr_mem_debug:
dr_mem_debug:   0
dr_mem_debug/W -1                                   <-- write value (if needed)
dr_mem_debug:   0x0             =       0xffffffff

2. Capture the console output from the failing drain session. Pages that cannot be drained will be displayed on the console.

hold_pfns: page not held: 1211e500
hold_pfns: page not held: 1211e540
hold_pfns: page not held: 1211fe80
hold_pfns: page not held: 12133c80

3. In adb, dump the page structure for a failing page.


# adb -k
1211e500$<page

If the value of p_selock is 1, this may indicate swap is involved. DR must first bring the page back into memory before it can be relocated. However, if the system is experiencing memory pressure, they may not be available memory resources to complete a page in - other memory must be freed before the drain can continue. One possibility is to execute a lockfs -fs to flush ufs cache to disk and free some pages. Adding more swap space is another possibility. Killing non-essential processes can also be done.

If the value of p_selock is an address, go to Step 5.

4. If a value is in the p_vnode field, dump the vnode structure.

72869728$<vnode

The vop field indicates which virtual operation is in progress.

5. If p_selock is an address, it must first be adjusted to get the thread address by subtracting 8 from the high side. For example, if the field is c01a5e80, in adb:

c01a5e80-80000000=X
                401a5e80          <--- thread address

6. Using this address, dump the thread structure.

401a5e80$<thread

(output omitted)

0x401a5f24:     lwp             procp           next            prev
                0               1045d418        401a9e80        401a1e80

(output omitted)

7. Using the address in the procp field, the process holding the lock can be determined.

1045d418$<proc2u

(output omitted)

DR Detach Hangs/Failures

Detaching board... ioctl failed

Symptom:

A DR detach operation fails on I/O devices.

Resolution:

* If using AP 2.0, alternate paths must be failed over manually prior to a detach operation. * If using Solaris 7, a detach will fail if the dump device is configured on the board to be detached. Use dumpadm -d to relocate the dump device. * Layered opens may persist on the device. Veritas Volume Manager and other applications do this. It is possible to disable checking for layered opens. First make sure all I/O drivers (sd, isp, socal, sf, etc.) are up to date. Then, in the /etc/system file, add the following lines:

* Layered Open Checking
set sd:sd_check_layered_open=0
* Negative Layered Open Count
set sd:sd_neg_layered_open=0
      This will not work if the sd patch is not up to date.

It is also possible to set these values live using adb. A script is available here to do this. It is unsupported.

Locked pages owing to heavy I/O or System Load

Symptom:

The drain portion of a detach operation bails out with the above message. However, even with dr_mem_debug set to -1, no hold_pfns messages are reported to the console (see Debugging a DR Drain).

Resolution:

Since the are no "hold_pfns" messages displayed in the output, it can be safely assumed that the drain did not abort do to locked held pages by the kernel or an application. There are a few known circumstances where this results.

Insufficient swap space is configured
I/O locked pages. The dr_daemon/function does not have priviledge to release pages that are allocated for I/O. lockfs -fa can help to flush ufs pages and may need to be run multiple times.
Heavy system load on the domain. Memory usage is high or the system resources are heavily utilized. There is not the opportunity to flush memory or there is not enough available memory to relocate resident memory on the system board you are trying to drain.
The system board/memory which is attempted to be drained contains ISM segments that cannot be relocated into existing contiguous memory

Any of these conditions will prevent drain completion. Swap space can be addressed more easily, however it may be required to wait until system load is reduced before the drain will succeed.

DR: General Issues

I'm seeing GROW wakeups and SHRNK wakeups messages, with references to DR. What are these?

Symptom:

Messages similar to the following are seen.

Dec 29 09:57:37 oak unix: GROW wakeups:
Dec 29 09:57:37 oak unix: 0
Dec 29 09:57:37 oak last message repeated 3 times
Dec 29 09:57:37 oak unix: 98
Dec 29 09:57:37 oak unix: 43
Dec 29 09:57:37 oak unix: 0
Dec 29 09:57:37 oak unix:
Dec 29 09:57:37 oak unix: SHRNK wakeups:
Dec 29 09:57:37 oak unix: 0
Dec 29 09:57:37 oak last message repeated 6 times
Dec 29 09:57:37 oak unix:
Dec 29 09:57:37 oak unix: freemem 13803 Dec 29 09:57:37 oak unix: ckstat:391306883: compress Dec 29 09:57:37 oak unix: compress: chunk = 36 pages
Dec 29 09:57:37 oak unix: compress: freemem=13806, dr_kfreemem=2777, dr_kpages=42388
Dec 29 09:57:37 oak unix: compress selection: [0x10c95480-0x10c95d80) p=[0x10b1e-0x10b42), np=36
Dec 29 09:57:37 oak unix: compress selection: total npages = 36, nsegs = 1
Dec 29 09:57:37 oak unix: compress: adding 0 pages back to dr_kfreemem
Dec 29 09:57:37 oak unix: compress: added 36 pages to freemem
Dec 29 09:57:37 oak unix: DR: compressing caged kernel by 36 pages (42388 -> 42352)
Dec 29 09:57:37 oak unix: cage compress: [0x10c95480-0x10c95d80) p=[0x10b1e-0x10b42), np=36 Dec 29 09:57:37 oak unix: cage compress: total npages = 36, nsegs = 1
Dec 29 09:57:37 oak unix: cage_thread: compressed kern map by 36 pages.

Resolution:

When the caged kernel is enabled (see releases for OS specifics), the cage resizes itself according to the demands of the system. If more kernel pages are needed, the cage grows. If more user pages are needed, the cage shrinks. The message above are reporting resizing activity. A DR operation (attach/detach/drain) need not be active for the cage kernel to resize.

These messages are only logged when the kernel flag dr_mem_debug is set to -1. This can be done either in /etc/system or via adb. In practice, the dr_mem_debug flag needs only be set when debugging a drain problem. To turn off the flag and surpress these messages, perform the following steps as root:

# adb -kw physmem 5b291 dr_mem_debug/W 0 dr_mem_debug: 0x0 = 0x0 $q dr_mem_debug is only valid in Solaris 2.5.1 and Solaris 2.6.

The values of dr-max-mem from eeprom and the kernel do not agree - why?

Symptom:

eeprom reports a different value for dr-max-mem than the running kernel.

Resolution:

Such a situation is possible if the value of dr-max-mem has been changed using eeprom, but the domain has yet to be rebooted. For example, a system is booted with DR enabled. While the domain is up and running, eeprom dr-max-mem=0 is executed. This updates the eeprom files, but will not be reflected in the kernel until the next reboot.

What is required to enable/disable DR for each OS release?

Solaris 8

Parameter	Value	Description
dr-max-mem	obsolete	Resources (cpu, memory, I/O) can always be added (DR attach) regardless of the value of dr-max-mem
kernel_cage_enable*	0	Default. DR detach operations disabled.
kernel_cage_enable*	1	Activates the caged kernel and allows DR detach operations.
*kernel_cage_enable is set in the /etc/system file.

Solaris 7

Parameter	Value	Description
dr-max-mem	obsolete	Resources (cpu, memory, I/O) can always be added (DR attach)regardless of the value of dr-max-mem
kernel_cage_enable*	0	Default. DR detach operations disabled.
kernel_cage_enable*	1	Activates the caged kernel and allows DR detach operations.
*kernel_cage_enable is set in the /etc/system file.

Solaris 2.6

Parameter	Value	Description
dr-max-mem	0	DR attach/detach is disabled
	1,non-zero	DR attach/detach is enabled
	2	In large configurations, particularly those with TBs of storage, it is possible during boot to exhaust kernel resources before the VM system is fully active. Setting dr-max-mem=2 makes the number of DR kernel pages at boot time 5 times larger than the normal value. See Bug ID 4218687 for details.

Solaris 2.5.1

Parameter	Value	Description
non-zero	DR attach/detach is enabled. The value specifies the largest memory configuration permitted for the domain after all DR attaches have been completed. For example, a value of 16384 would allow a maximum of 16GB of memory for a domain (i.e. 4 fully loaded system boards). But setting the value unnecessarily large bloats the kernel and wastes memory that might otherwise be used more productively.
dr-max-mem	0	DR attach/detach is disabled

Attachments

This solution has no attachment