Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Troubleshooting Sure Solution 1311407.1 : Sun Enterprise[TM] 10000: Dynamic Reconfiguration (DR) Troubleshooting
In this Document
Applies to:Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/AInformation in this document applies to any platform. PurposeThis document provides troubleshooting steps for common DR issues.Last Review DateApril 6, 2011Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting DetailsDR Attach: Unable to acquire ssp lock for hardware register updateSymptom: A DR attach operation fails in init_attach. Resolution: Part of the attach operation is to update the Domain Mask in the CIC ASICs for the receiving domain with the new board. A lock is needed to perform this operation. Other items that can be holding the lock:
DR Drain: Debugging1. Check/set the value of dr_mem_debug using adb. Debugging output is sent to the console.
2. Capture the console output from the failing drain session. Pages that cannot be drained will be displayed on the console.
3. In adb, dump the page structure for a failing page.
If the value of p_selock is 1, this may indicate swap is involved. DR must first bring the page back into memory before it can be relocated. However, if the system is experiencing memory pressure, they may not be available memory resources to complete a page in - other memory must be freed before the drain can continue. One possibility is to execute a lockfs -fs to flush ufs cache to disk and free some pages. Adding more swap space is another possibility. Killing non-essential processes can also be done. If the value of p_selock is an address, go to Step 5. 4. If a value is in the p_vnode field, dump the vnode structure.
The vop field indicates which virtual operation is in progress. 5. If p_selock is an address, it must first be adjusted to get the thread address by subtracting 8 from the high side. For example, if the field is c01a5e80, in adb: c01a5e80-80000000=X 401a5e80 <--- thread address 6. Using this address, dump the thread structure.
(output omitted) 0x401a5f24: lwp procp next prev 0 1045d418 401a9e80 401a1e80 7. Using the address in the procp field, the process holding the lock can be determined.
(output omitted) DR Detach Hangs/FailuresDetaching board... ioctl failedSymptom: A DR detach operation fails on I/O devices. Resolution: * If using AP 2.0, alternate paths must be failed over manually prior to a detach operation. * If using Solaris 7, a detach will fail if the dump device is configured on the board to be detached. Use dumpadm -d to relocate the dump device. * Layered opens may persist on the device. Veritas Volume Manager and other applications do this. It is possible to disable checking for layered opens. First make sure all I/O drivers (sd, isp, socal, sf, etc.) are up to date. Then, in the /etc/system file, add the following lines: * Layered Open Checking set sd:sd_check_layered_open=0 * Negative Layered Open Count set sd:sd_neg_layered_open=0 This will not work if the sd patch is not up to date. It is also possible to set these values live using adb. A script is available here to do this. It is unsupported. Locked pages owing to heavy I/O or System LoadSymptom: The drain portion of a detach operation bails out with the above message. However, even with dr_mem_debug set to -1, no hold_pfns messages are reported to the console (see Debugging a DR Drain). Resolution: Since the are no "hold_pfns" messages displayed in the output, it can be safely assumed that the drain did not abort do to locked held pages by the kernel or an application. There are a few known circumstances where this results.
Any of these conditions will prevent drain completion. Swap space can be addressed more easily, however it may be required to wait until system load is reduced before the drain will succeed.
DR: General IssuesI'm seeing GROW wakeups and SHRNK wakeups messages, with references to DR. What are these?Symptom: Messages similar to the following are seen. Dec 29 09:57:37 oak unix: GROW wakeups: Resolution: When the caged kernel is enabled (see releases for OS specifics), the cage resizes itself according to the demands of the system. If more kernel pages are needed, the cage grows. If more user pages are needed, the cage shrinks. The message above are reporting resizing activity. A DR operation (attach/detach/drain) need not be active for the cage kernel to resize. These messages are only logged when the kernel flag dr_mem_debug is set to -1. This can be done either in /etc/system or via adb. In practice, the dr_mem_debug flag needs only be set when debugging a drain problem. To turn off the flag and surpress these messages, perform the following steps as root: # adb -kw physmem 5b291 dr_mem_debug/W 0 dr_mem_debug: 0x0 = 0x0
$q dr_mem_debug is only valid in Solaris 2.5.1 and Solaris 2.6. The values of dr-max-mem from eeprom and the kernel do not agree - why?Symptom: eeprom reports a different value for dr-max-mem than the running kernel. Resolution: Such a situation is possible if the value of dr-max-mem has been changed using eeprom, but the domain has yet to be rebooted. For example, a system is booted with DR enabled. While the domain is up and running, eeprom dr-max-mem=0 is executed. This updates the eeprom files, but will not be reflected in the kernel until the next reboot. What is required to enable/disable DR for each OS release?Solaris 8
Solaris 7
Solaris 2.6
Solaris 2.5.1
Attachments This solution has no attachment |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|