Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1003332.1 : Sun Fire[TM] Midframe/Midrange Servers: CPU/Memory Board Dynamic Reconfiguration (DR) Considerations
PreviouslyPublishedAs 204624
Applies to:Sun Fire V1280 ServerSun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server All Platforms GoalThis document provides guidance for using dynamic reconfiguration (DR) on CPU/Memory boards in Sun Fire[TM] Midframe/Midrange Server systems.Using DR for configuration changes, servicing CPU/Memory boards increases the overall application uptime. The Solaris[TM] Operating Environment (OE) and applications keep running while DR tasks are performed. SolutionTERMS:When describing DR operations, this document uses terms as defined by the Solaris[TM] OE DR command cfgadm.
Minimum Sun Fire Mid-Range Server Platform Requirements:------------------------+-----------------+--------------+-------------- All loaded third party device drivers have to fulfill the Device Driver specifications (See attached "Writing Device Drivers" article). Please refer to the third party reference documentation or check the following web page for Oracle/Sun certified third party cards. CONFIGURING A CPU/MEMORY BOARD INTO A RUNNING DOMAIN:High level overview of Procedure: STEP#1 . Match Firmware level of Domain and CPU/Memory board. STEP#2. Verify DR status of CPU/Memory board. STEP#3. Configure CPU/Memory into the running Domain. Detailed Procedure:STEP#1. The Firmware level of the CPU/Memory board and the running Domain have to match. Verify the Firmware level with the showboards -p prom command on the SSC: sunfire-sc0:SC> showboards -p prom Component Compatible Version STEP#2. Use the cfgadm command of the Solaris OE to get the status of the DR components (attachment points). An available CPU/Memory Board would give a disconnected/unconfigured/unknown status, as SB2 in the CLI example below. # cfgadm Ap_Id Type Receptacle Occupant Condition
STEP#3.
Use the cfgadm command
to configure the CPU/Memory into the Domain.
It's recommended to issue the command in a network session to the Domain. The Domain Console session (connection to the Domain via the SSC) should be used to monitor the POST output. The -o platform=diag= option specifies the diag level used for testing the CPU/Memory board prior configuring it into the Domain. The recommended level is default. On configuring a CPU/Memory board, the following messages are
logged in /var/adm/messages.
UNCONFIGURING A CPU/MEMORY BOARD IN A RUNNING DOMAIN:High level overview of Procedure: STEP#1. Check if the CPU/Memory board contains permanent memory. STEP#2. Check DR requirements CPU/Memory boards. STEP#3. Disconnect CPU/Memory from the running Domain Detailed Procedure:STEP#1. The requirements for disconnecting a CPU/Memory board are different for a board with permanent memory and without. Kernel memory and OBP are referred to as permanent memory. Use the cfgadm command to verify the location of permanent memory. # cfgadm -av | grep permanent N0.SB0::memory connected configured ok base address 0x0, 8388608 KBytes total, With the addition of split kernel cage, it is possible that multiple boards in the configuration will contain kernel memory. This means that there can be more boards in the domain that would need the Solaris[TM] OS to suspend in order to reallocate their kernel memory if they were DR detached. Please refer to STEP#2. The system will automatically check if requirements are fulfilled and abort the DR operation if not. The requirements can be checked prior to performing the DR operation or on abort of a DR operation for verification. Requirements to perform a disconnect operation if the board does and does NOT contain permanent memory:
APPLICATION RESOURCE
REQUIREMENTS
: Some applications require a minimum number of CPUs, Memory, etc. This should be verified prior to disconnecting. It's recommended to disconnect a CPU/Memory board when the system load is low. This will ensure a quick disconnect time length. If system is running an Oracle database, the ISM can cause significant delays in board removal.
MEMORY INTERLEAVING
: In order to disconnect a CPU/Memory board interleaving has to be set to within-board. The interleaving setting is verified with the showdomain command. sunfire-sc0:A> showdomain -p bootparams diag-level = quick In order to connect a CPU/Memory, memory Interleaving has to be
set to 'within-board' as well. Otherwise, the DR operation will
fail with the following message (5.12.5 fw revision and above):
BOUND PROCESSES
: No processes are bound to a CPU on the CPU/Memory board which should be disconnected. To check for CPU bound processes use the command pbind. # pbind process id 181: 0 In this example process id 181 is bound to CPU 0. If CPU 0 is on the CPU/Memory board which should be disconnected, the process must be bound to a different CPU. This can be done with pbind as well. # pbind -q displays lwp's (light weight processes) # pbind -u pid ... Unbinds process for onlineing to alternate processor. # pbind -b processor_id pid ... Binds all lwp's to a specified processor. If the requirements are fulfilled, proceed to step 3. If the CPU/Memory contains permanent memory, additional requirements have to be met. The system will automatically check for these conditions and the DR operation will abort if not met. BOARD: On disconnecting a CPU/Memory board with permanent memory, the memory on is copied by the DR process to a different CPU/Memory board. This requires a second CPU/Memory board in the Domain with the same or more amount of memory on it. You can verify the amount of physical memory on a CPU/Memory board by using either prtdiag within the Solaris[TM] OE or the showcomponent command. REAL TIME (RT) PROCESSES : During the disconnect operation of a CPU/Memory board with permanent memory the Solaris OE is quiesced. This is not tolerated by RT processes. RT processes either have to be stopped prior to the DR operation, or the DR operation cannot be performed. To check if RT processes are running, use the ps command. # ps -efc | grep RT UID PID PPID CLS PRI STIME TTY TIME CMD root 5639 5230 TS 48 19:45:30 pts/06 0:00 grep RT Real Time processes can be identified by the RT tag in the CLS column. In the above example, the midaemon with PID 367 is running in the RT class. You can either change the scheduling class from RT to TS or kill the offending process(es). However, you MUST first verify with the application vendor and/or through testing that changing the scheduling class of or killing a process will not adversely affect the applications operation during the DR operation. Ideally this would be done during DR certification. If you kill a process you will need to manually restart the process after completion of the DR operation. Changing the scheduling class or killing of a Real Time (RT) process would be necessary during a copy/rename (i.e. moving permanent memory from one system board to another) operation. Perform the following steps to change the scheduling class of the Real Time (RT) process to Timeshare (TS), perform the Dynamic Reconfiguration (DR) operation and then change the process's scheduling class back to RT. (A). Identify the RT process(es) (NOTE: The PID may
have already been identified in the DR operation):
(B). For each RT process run the following command for that process
(C). Perform DR operations after performing additional checks listed below. (D). After completing DR (STEP#3 below), change the process scheduling class back to RT for each process that was changed in step (B) above.
(E). Verify processes are running in RT again.
To check for ISM(Intimate Shared Memory) use the following command: # ipcs -im If ISM is present then you should check for the patches listed below, otherwise, DR can take up to 8+ hours to complete! This is b/c pending read/write queues come before DR threads. The patches below fix this issue by moving the DR threads to the top of the queue: Sol 8 needs 108528-29 , 117000-05 , or 117350-05 or later. Sol 9 needs 112233-12 or 117171-08 or later. CLUSTER 3.X : The quiesce of the Solaris OE during the disconnect operation of a CPU/Memory board with permanent memory is not tolerated by Sun[TM] Cluster 3.X. If the Domain is an active Node within a Cluster, it has to be taken out of the Cluster prior performing the DR operations or DR operation cannot be performed. SUN STOREDGE[TM] TRAFFIC MANAGER (STMS) : If STMS (also known as MPXIO ) is configured on the Domain, the disconnect operation of a CPU/Memory board with permanent memory may hang the Domain. With the Solaris[TM] 8 KU-15(and higher) and Solaris[TM] 9 KU-01(and higher), the system will automatically check for this condition and abort the DR operation. Use the modinfo command to check if MPXIO is loaded. # modinfo | grep mpxio 53 1025ec49 537c 1 mpxio (MDI Library v20040825-1.23) # modinfo | grep scsi_vhci Caution : Only if BOTH drivers are loaded, DO NOT use DR for disconnecting CPU/Memory boards with permanent memory. For details on solutions and workarounds see Bug ID 4618861 . STEP#3. If the keyswitch is set to the secure position the board will be removed but not powered off. Check the keyswitch position from the domain shell:
If the keyswitch is set to secure, change the position to
'on':
If the keyswitch is left in the secure position the board will be removed from the domain, but will have to be powered off manually and you will observe the following error message: # cfgadm -c disconnect N0.SB0 cfgadm: Hardware specific failure: poweroff N0.SB0: Internal error If all the requirements are checked and satisfied, initiate the
disconnect operation with the cfgadm command:
On disconnecting a CPU/Memory board, the following messages are logged in /var/adm/messages: Dec 909:12:43 domA genunix: /ssm@0,0/memory-controller@3,400000 (mc-us36) offline Internal Comments This is a living document. As features/requirement change, all attempts to keep this document current will be made. If while using its content, an oversight or discrepancy is noted, contact the submitter. Internally the following URLs are most useful; please review: http://panacea.uk.oracle.com/twiki/bin/view/Products/ProdInfoSunFirex8x0 http://panacea.uk.oracle.com/twiki/bin/view/Projects/ProjectHomeDR Previously Published As 49202 Attachments This solution has no attachment |
||||||||||||
|