Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1012392.1
Update Date:2012-07-31
Keywords:

Solution Type  Technical Instruction Sure

Solution  1012392.1 :   How to isolate a processor from a running system ?  


Related Items
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun Fire 4810 Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
217091


Applies to:

Sun Fire 12K Server - Version Not Applicable and later
Sun Fire 15K Server - Version Not Applicable and later
Sun Fire 3800 Server - Version Not Applicable and later
Sun Fire 4800 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
All Platforms

Goal

There are several ways to "remove" a processor from a running system but these operations have different goals and different consequences.

Processor isolation can be done by changing the operational status of a processor. To achieve that goal, a processor can be off-line or unconfigured.

The aim of this document is to present the differences between psradm -f, psradm -i and cfgadm -c unconfigure for the UltraSparc II, UltraSparc ( III, IV,  IV+ ) and SPARC64 ( VI,VII, VII+ ) processors. This document provides an overview of the different ways to isolate a cpu from a running system on various SPARC CPUs.

Using the appropriate status and the appropriate command can be very useful in many cases: troubleshooting, performances analysis and so on .. For example, a cpu can be offlined to see if that cpu has any role in a transient hardware failure. Once a cpu is confirmed to have a hardware issue, it can be isolated using cfgadm. A cpu can be dedicated to processing just user level/system level threads and isolated from processing interrupts. Depending on what one needs, these commands can be effectively used.

Fix

How to isolate a processor from a running system ?

From a manual pages point of view :

The role of the psradm command is to change processor operational status; to the off-line and no-intr status for instance.

The role of the cfgadm command is to dynamically reconfigure hardware resources; to unconfigure a processor.

. An off-line processor does not process any LWPs. Usually, an off-line processor is not interruptible by I/O devices in the system. On some processors or under certain conditions, it may not be possible to disable interrupts for an off-line processor. Thus, the actual effect of being off-line may vary from machine to machine.

. A no-intr processor processes LWPs but is not interruptible by I/O devices.

. A component is unconfigured when it is not available for use by the Solaris Operating Environment.

The default status of a processor is on-line :
An on-line processor processes LWPs (lightweight processes) and may be interrupted by I/O devices in the system.

Let's see what this means on various CPUs.

UltraSparc II/III :

Offlining a processor :

Both USII and USIII can be off-line in the same way. The subsequent consequences on the processor state are similar.

This can be done by using the psradm -f processor_id command.
Note that a processor may not be taken off-line if there are LWPs that are bound to the processor. At least one processor in the system must be able to process LWPs.

Example from a 4 procs SF15K domain :

Initial state :

# psrinfo
96      on-line   since 09/14/2004 22:18:00
97      on-line   since 09/14/2004 22:18:00
98      on-line   since 09/14/2004 22:18:00
99      on-line   since 09/14/2004 22:18:00
# psradm -f 97
# psrinfo
96      on-line   since 09/14/2004 22:18:00
97      off-line  since 09/24/2004 13:11:13
98      on-line   since 09/14/2004 22:18:00
99      on-line   since 09/14/2004 22:18:00

When a proc is off-line, it is excluded from scheduling.
The proc is reported as :

# echo "::cpuinfo -v" | mdb -k
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
97 3002383e000  2f    0    0  -1   no    no t-1028515 2a100333d40 (idle)''
|
RUNNING <--+
READY
QUIESCED
EXISTS
OFFLINE

compared to an on-line proc :

ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
98 3002383a000  1b    0    0  -1   no    no t-37   2a10032bd40 (idle)
|
RUNNING <--+
READY
EXISTS
ENABLE

As reported in the previous output, off-line processor is running the idle thread.

QUIESCED means that we'll stay in the idle loop -which means that the processor is made to spin in a tight loop and the cpu no longer processes any LWPs and does not handle device interrupts.

But the proc remains in the cpu_ready_set which means it will get all xt_all() (incl. E$ scrubber) cross traps and xc_all() cross calls and softints.

Notes on the no-intr status :
As the man page definition given above, a no-intr processor is no longer handling device interrupts. It will handle cross calls/traps and softints though.
This operation can be done by using the 'psradm -i processor_id' command.

Example from a 4 procs SF15K domain :

# psradm -i 99
# psrinfo
96      on-line   since 09/14/2004 22:18:00
97      off-line  since 09/24/2004 13:11:13
98      on-line   since 09/29/2004 11:05:39
99      no-intr   since 09/29/2004 11:05:43
# echo "::cpuinfo -v" | mdb -k
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
99 300238ce000   b    0    0   0   no    no t-0    30047c1c000 sleep
|
RUNNING <--+
READY
EXISTS         

compared to an on-line proc :

ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
98 3002383a000  1b    0    0  -1   no    no t-37   2a10032bd40 (idle)
|
RUNNING <--+
READY
EXISTS
ENABLE

no-intr processor is part of the scheduler; LWPs can be scheduled on the proc.
In the previous example, we can see that the processor is running a thread from the 'sleep' process.

In both cases, off-line and no-intr, only the status at the Solaris level has changed these state changes are not relevant in OBP. These cpus are also totally visible in Solaris because off-line and no-intr are changes to the state of the cpu inside Solaris.

# prtconf -vp | grep "name:  'SUNW,UltraSPARC-III"
name:  'SUNW,UltraSPARC-III+'
name:  'SUNW,UltraSPARC-III+'
name:  'SUNW,UltraSPARC-III+'
name:  'SUNW,UltraSPARC-III+'

Note that offlining a processor may fail due to several reasons. man psradm will give you all the various error conditions and the reason for each of these conditions.

At least one processor in the system must be able to process LWPs.

A processor may not be taken off-line if there are LWPs that are bound to the processor.

At least one processor must also be able to be interrupted.

It's noticeable that, although the memory controller resides on the processor with the USIII architecture, when a processor is off-line, the associated memory is still accessible.

A processor can be unconfigured :

As per the definition, the unconfigure operation consists in removing a resource from the system and so it cannot be used by Solaris.
Note that unconfiguring a processor may not be successful on all the USII/III platforms.

Ex from a Sun Fire 4800 :

# cfgadm -c unconfigure N0.SB4::cpu3
cfgadm: Hardware specific failure: unconfigure N0.SB4::cpu3: Can't unconfig cpu if mem online: /ssm@0,0/memory-controller@13,400000

Ex from an Enterprise 10000 :

# cfgadm -c unconfigure SB9::cpu3
cfgadm: Hardware specific failure: unconfigure SB9::cpu3: Operation not supported

The unconfigure operation can be done via the cfgadm -c unconfigure Ap_Id command.

# cfgadm -c unconfigure SB3::cpu0
# cfgadm -alv -s "match=partial,select=type(cpu)"
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
SB3::cpu0                      connected    unconfigured ok         cpuid 96, speed 1200 MHz, ecache 8 MBytes
Sep 24 12:57 cpu          n        /devices/pseudo/dr@0:SB3::cpu0
SB3::cpu1                      connected    configured   ok         cpuid 97, speed 1200 MHz, ecache 8 MBytes
Sep 24 12:52 cpu          n        /devices/pseudo/dr@0:SB3::cpu1
SB3::cpu2                      connected    configured   ok         cpuid 98, speed 1200 MHz, ecache 8 MBytes
Sep 24 12:52 cpu          n        /devices/pseudo/dr@0:SB3::cpu2
SB3::cpu3                      connected    configured   ok         cpuid 99, speed 1200 MHz, ecache 8 MBytes
Sep 24 12:52 cpu          n        /devices/pseudo/dr@0:SB3::cpu3

When you unconfigure a cpu, the cpu is removed from the scope of Solaris kernel and the cpu is not part of scheduling, or interrupt processing. Solaris device tree will no longer have this CPU(resource).
Ex : from a 4 procs domain but 1 is unconfigured

# psrinfo
97      off-line  since 09/24/2004 13:11:13
98      on-line   since 09/29/2004 11:05:39
99      no-intr   since 09/29/2004 11:05:43
# echo "ncpus/D" | mdb -k
physmem 4ddf90
ncpus:
ncpus:          3

Though the proc is no longer available to Solaris, the unconfigured proc is still seen via OBP. It is still seen from OBP because OBP device tree is not relfecting the change.

# prtconf -vp | grep "name:  'SUNW,UltraSPARC-III"
name:  'SUNW,UltraSPARC-III+'
name:  'SUNW,UltraSPARC-III+'
name:  'SUNW,UltraSPARC-III+'
name:  'SUNW,UltraSPARC-III+' 

Btw, cfgadm is using cpu_offline() as part of the removal process.

It's noticeable that, although the memory controller resides on the processor with the USIII architecture, when a processor is unconfigured, the associated memory is still accessible :

Original configuration :

# prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun Fire 15000
System clock frequency: 150 MHz
Memory size: 16384 Megabytes
========================= CPUs =========================
         CPU      Run    E$    CPU     CPU
Slot ID   ID       MHz    MB   Impl.    Mask
--------  -------  ----  ----  -------  ----
/SB11/P0  352      1200   8.0  US-III+  11.0
/SB11/P1  353      1200   8.0  US-III+  11.0
/SB11/P2  354      1200   8.0  US-III+  11.0
/SB11/P3  355      1200   8.0  US-III+  11.0
# cfgadm -alv
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
SB11                           connected    configured   ok         powered-on, assigned
Jun  5 11:44 CPU          n        /devices/pseudo/dr@0:SB11
SB11::cpu0                     connected    configured   ok         cpuid 352, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu0
SB11::cpu1                     connected    configured   ok         cpuid 353, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu1
SB11::cpu2                     connected    configured   ok         cpuid 354, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu2
SB11::cpu3                     connected    configured   ok         cpuid 355, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu3
SB11::memory                   connected    configured   ok         base address 0x1e000000000, 16777216 KBytes total, 1040312 KBytes permanent
Jun  5 11:51 memory       n        /devices/pseudo/dr@0:SB11::memory
c0                             connected    configured   unknown
[...]
# psrinfo
352     on-line   since 06/05/2007 11:44:41
353     on-line   since 06/05/2007 11:44:41
354     on-line   since 06/05/2007 11:44:41
355     on-line   since 06/05/2007 11:44:41
# cfgadm -c unconfigure SB11::cpu0
OS unconfigure dr@0:SB11::cpu0
# psrinfo
353     on-line   since 06/05/2007 11:44:41
354     on-line   since 06/05/2007 11:44:41
355     on-line   since 06/05/2007 11:44:41

No change in the memory configuration :

# cfgadm -alv
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
SB11                           connected    configured   ok         powered-on, assigned
Jun  5 11:58 CPU          n        /devices/pseudo/dr@0:SB11
SB11::cpu0                     connected    unconfigured ok         cpuid 352, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:58 cpu          n        /devices/pseudo/dr@0:SB11::cpu0
SB11::cpu1                     connected    configured   ok         cpuid 353, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu1
SB11::cpu2                     connected    configured   ok         cpuid 354, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu2
SB11::cpu3                     connected    configured   ok         cpuid 355, speed 1200 MHz, ecache 8 MBytes
Jun  5 11:44 cpu          n        /devices/pseudo/dr@0:SB11::cpu3
SB11::memory                   connected    configured   ok         base address 0x1e000000000, 16777216 KBytes total, 1040312 KBytes permanent
Jun  5 11:51 memory       n        /devices/pseudo/dr@0:SB11::memory

The amount of memory available to the domain is still the same :

# prtconf -pv | grep Memory
Memory size: 16384 Megabytes
# prtdiag -v | more
System Configuration:  Sun Microsystems  sun4u Sun Fire 15000
System clock frequency: 150 MHz
Memory size: 16384 Megabytes
[...]

UltraSparc IV / UltraSparc IV+ :

Reminder :
UltraSPARC IV processor is Sun's first CMP processor and consists of two UltraSPARC III+ cores on the same silicon die.
The on-chip memory controller provides up to 16 GB of DRAM per processor, shared between the two cores. Each core enjoys exclusive access to its own 8 MB half of the Level 2 cache (On-chip tags, off-chip data).
OBP and Solaris treat each core as an individual CPU.

The above reasoning is applicable to the UltraSPARC IV+ processors - it has 2 MB Level-2 Cache (On-chip tags and data) and 32 MB Level-3 Cache (On-chip tags, off-chip data) Exclusive of L2 cache

From a 4 procs (8 cores) SF15K domain :

# prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun Fire 15000
System clock frequency: 150 MHz
Memory size: 16384 Megabytes
========================= CPUs =========================
CPU      Run    E$    CPU     CPU
Slot ID   ID       MHz    MB   Impl.    Mask
--------  -------  ----  ----  -------  ----
/SB02/P0   64, 68  1050  16.0  US-IV    2.3
/SB02/P1   65, 69  1050  16.0  US-IV    2.3
/SB02/P2   66, 70  1050  16.0  US-IV    2.3
/SB02/P3   67, 71  1050  16.0  US-IV    2.3
# psrinfo
64      on-line   since 09/29/2004 22:29:06
65      on-line   since 09/29/2004 22:29:06
66      on-line   since 09/29/2004 22:29:07
67      on-line   since 09/29/2004 22:29:06
68      on-line   since 09/29/2004 22:29:06
69      on-line   since 09/29/2004 22:29:07
70      on-line   since 09/29/2004 22:29:07
71      on-line   since 09/29/2004 22:29:06

A core can be off-line :

# psradm -f 65
# psrinfo
64      on-line   since 09/29/2004 22:29:06
65      off-line  since 09/30/2004 10:45:15
66      on-line   since 09/29/2004 22:29:07
67      on-line   since 09/29/2004 22:29:07
68      on-line   since 09/29/2004 22:29:07
69      on-line   since 09/29/2004 22:29:07
70      on-line   since 09/29/2004 22:29:07
71      on-line   since 09/29/2004 22:29:07

In the same manner as USII and USIII processors, on-line and off-line cores states differ in the way that the off-line core is excluded from scheduling, is running the idle thread and may be interruptible by cross traps and cross calls.

In the following example, processor 65 and 69 are 2 cores from the same USIV cpu.

# echo "::cpuinfo -v" | mdb -k
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
65 30019610000  2f    0    0  -1   no    no t-1402587 2a100013d40 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
OFFLINE         
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
69 3001965e000  1b    0    0  -1   no    no t-37   2a100353d40 (idle)
|
RUNNING <--+
READY
EXISTS
ENABLE

A core cannot be unconfigured, but the cpu (2 cores) can be unconfigured :

Each CPU attachment point represents two CPUID numbers because, from a DR perspective, Solaris treats each core as a single entity.

# cfgadm -alv -s "match=partial,select=type(cpu)"
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
SB2::cpu0                      connected    configured   ok         cpuid 64 and 68, speed 1050 MHz, ecache 16 MBytes
Sep 29 22:36 cpu          n        /devices/pseudo/dr@0:SB2::cpu0
SB2::cpu1                      connected    configured   ok         cpuid 65 and 69, speed 1050 MHz, ecache 16 MBytes
Sep 29 22:36 cpu          n        /devices/pseudo/dr@0:SB2::cpu1
SB2::cpu2                      connected    configured   ok         cpuid 66 and 70, speed 1050 MHz, ecache 16 MBytes
Sep 29 22:36 cpu          n        /devices/pseudo/dr@0:SB2::cpu2
SB2::cpu3                      connected    configured   ok         cpuid 67 and 71, speed 1050 MHz, ecache 16 MBytes
Sep 30 10:46 cpu          n        /devices/pseudo/dr@0:SB2::cpu3
# cfgadm -c unconfigure SB2::cpu3
# cfgadm -alv -s "match=partial,select=type(cpu)"
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
SB2::cpu0                      connected    configured   ok         cpuid 64 and 68, speed 1050 MHz, ecache 16 MBytes
Sep 29 22:36 cpu          n        /devices/pseudo/dr@0:SB2::cpu0
SB2::cpu1                      connected    configured   ok         cpuid 65 and 69, speed 1050 MHz, ecache 16 MBytes
Sep 29 22:36 cpu          n        /devices/pseudo/dr@0:SB2::cpu1
SB2::cpu2                      connected    configured   ok         cpuid 66 and 70, speed 1050 MHz, ecache 16 MBytes
Sep 29 22:36 cpu          n        /devices/pseudo/dr@0:SB2::cpu2
SB2::cpu3                      connected    unconfigured ok         cpuid 67 and 71, speed 1050 MHz, ecache 16 MBytes
Sep 30 14:27 cpu          n        /devices/pseudo/dr@0:SB2::cpu3

So, 2 cores are now missing from the original configuration :

# psrinfo
64      on-line   since 09/29/2004 22:29:06
65      off-line  since 09/30/2004 10:45:15
66      on-line   since 09/29/2004 22:29:07
68      no-intr   since 09/30/2004 10:45:27
69      on-line   since 09/29/2004 22:29:07
70      on-line   since 09/29/2004 22:29:07
# echo "ncpus/D" | mdb -k
ncpus:
ncpus:          6

and, as usual, all the cores are visible at the OBP level.

# prtconf -vp | grep "SUNW,UltraSPARC-IV"
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV'
compatible: 'SUNW,UltraSPARC-IV' 

Obviously, when a processor (2 cores) is unconfigured, the associated memory is still accessible; the amount of memory available to the domain is still the same :

# prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun Fire 15000
System clock frequency: 150 MHz
Memory size: 16384 Megabytes
[...output omitted]

SPARC64 (VI, VII, VII+) :

Reminder :

SPARC64 cpu offers two or four SPARC V9 cores and two vertical threads (two CMT strands) per core. 5-12MB on-chip shared L2$, no external cache.
SPARC64 chips are either mounted on the MBU for Mid-Range Servers (M4000 + M5000) or on a CMU for High-End Servers (M8000 + M9000).

The memory controller (MAC) is off-chip.
Most of the physical resources such as ALU, instruction pipeline and so on are shared between strands.
Each strand has its own software visible registers (PC, nextPC, data registers, etc).
OBP and Solaris treat each strand as an individual processor.

The following reasoning is applicable to the M4000, M5000, M8000 and M9000 domains.
Notes about the processor status (running, ready, quiesced) are applicable to SPARC64 processors.

The processor numbering is based on the Logical System Board mapping therefore, the numbering is common to the Mid-Range Servers (M4000 + M5000) and High-End Servers (M8000 + M9000). See <Document 1005329.1><document:> for more details.

Since Solaris sees each strand as an individual processor, they are reported in the psrinfo output :

Example from a M9000-32 domain composed of one CMU : (4 * CPUM) * (2 * cores) * (2 * strands) => 16 processors

# echo "ncpus/D" | mdb -k
ncpus:
ncpus:          16
# prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise M9000 Server
System clock frequency: 960 MHz
Memory size: 32768 Megabytes
==================================== CPUs ====================================
      CPU              CPU            Run       L2$       CPU      CPU
LSB    Chip              ID            MHz        MB       Impl.    Mask
---    ----      --------------------  ----      ---       -----    ----
00      0          0,   1,   2,   3   2280      5.0          6      146
00      1          8,   9,  10,  11   2280      5.0          6      146
00      2         16,  17,  18,  19   2280      5.0          6      146
00      3         24,  25,  26,  27   2280      5.0          6      146
# psrinfo
0       on-line   since 05/23/2007 16:07:08
1       on-line   since 05/23/2007 16:07:09
2       on-line   since 05/23/2007 16:07:09
3       on-line   since 05/23/2007 16:07:09
8       on-line   since 05/23/2007 16:07:09
9       on-line   since 05/23/2007 16:07:09
10      on-line   since 05/23/2007 16:07:09
11      on-line   since 05/23/2007 16:07:09
16      on-line   since 05/23/2007 16:07:09
17      on-line   since 05/23/2007 16:07:09
18      on-line   since 05/23/2007 16:07:09
19      on-line   since 05/23/2007 16:07:09
24      on-line   since 05/23/2007 16:07:09
25      on-line   since 05/23/2007 16:07:09
26      on-line   since 05/25/2007 06:57:15
27      on-line   since 05/23/2007 16:07:09

Note : To determine the physical location of the component, a 'showboards -v' for the domain can be collected from the active XSCF.

XSCF> showboards -d 1
XSB  DID(LSB) Assignment  Pwr  Conn Conf Test    Fault
---- -------- ----------- ---- ---- ---- ------- --------
08-0 01(00)   Assigned    y    y    y    Passed  Normal

In this eaxmple, the processors listed in the prtdiag/psrinfo outputs are belonging to CMU#8 associated with LSB#0 of domain 1.

The information about the processors is also available from the main SP :

XSCF> showdevices -d 1
CPU:
----
DID XSB  id  state    speed  ecache
01  08-0 0   on-line   2280       5
01  08-0 1   on-line   2280       5
01  08-0 2   on-line   2280       5
01  08-0 3   on-line   2280       5
01  08-0 8   on-line   2280       5
01  08-0 9   on-line   2280       5
01  08-0 10  on-line   2280       5
01  08-0 11  on-line   2280       5
01  08-0 16  on-line   2280       5
01  08-0 17  on-line   2280       5
01  08-0 18  on-line   2280       5
01  08-0 19  on-line   2280       5
01  08-0 24  on-line   2280       5
01  08-0 25  on-line   2280       5
01  08-0 26  on-line   2280       5
01  08-0 27  on-line   2280       5
[...]

A processor can be off-line :

# psradm -f 2
# psrinfo
0       on-line   since 05/23/2007 16:07:08
1       on-line   since 05/23/2007 16:07:09
2       off-line  since 05/25/2007 07:09:05
3       on-line   since 05/23/2007 16:07:09
8       on-line   since 05/23/2007 16:07:09
[...]

In the same manner as UltraSparc processors, on-line and off-line cores states differ in the way that the off-line core is excluded from scheduling, is running the idle thread and may be interruptible by cross traps and cross calls.

In the following example, processor 0, 1, 2 and 3 are (2 cores * 2 strands) from the same SPARC64 VI CPUM.

# echo "::cpuinfo -v" | mdb -k
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
0 0000180c000  1b    0    0  -1   no    no t-58   2a10001fcc0 (idle)
|
RUNNING <--+
READY
EXISTS
ENABLE
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
1 3000405a000  1b    0    0  -1   no    no t-5961 2a1004c9cc0 (idle)
|
RUNNING <--+
READY
EXISTS
ENABLE
ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
2 3000405e000  2f    0    0  -1   no    no t-7468 2a100551cc0 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
OFFLINE

As seen in other SPARC cpus, when a SPARC64VI processor is off-line, it's still visible in OBP :

# prtconf -vp | grep "SPARC64"
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'
compatible: 'FJSV,SPARC64-VI'

And, of course, the memory is accessible if one or more processor are off-line :

 # prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise M9000 Server
System clock frequency: 960 MHz
Memory size: 32768 Megabytes

A SPARC64 VI processor cannot be unconfigured :

From a cfgadm perspective, we can each CPUM reported as an entity :

# cfgadm -a -s "cols=ap_id:info,select=type(cpu)"
Ap_Id                          Information
SB0::cpu0                      cpuid 0, 1, 2, and 3, speed 2280 MHz, ecache 5 MBytes
SB0::cpu1                      cpuid 8, 9, 10, and 11, speed 2280 MHz, ecache 5 MBytes
SB0::cpu2                      cpuid 16, 17, 18, and 19, speed 2280 MHz, ecache 5 MBytes
SB0::cpu3                      cpuid 24, 25, 26, and 27, speed 2280 MHz, ecache 5 MBytes

Dynamic Reconfiguration on the Sun SPARC Enterprise Mx000 Servers is a "SP initiated" model, therefore a CPUM (nor a core, nor a strands) cannot be unconfigured from Solaris :

# cfgadm -c unconfigure SB0::cpu2
May 25 07:13:48 mammothcar-b drmach: WARNING: Operation not supported
cfgadm: Hardware specific failure: unconfigure SB0::cpu2: Operation not supported

As a summary :

Since Solaris does not differentiate cores/strands/cpus, each entity appears to Solaris as a cpu and so the application level command psradm does not function different on different CPUs. Whether it is US-II, US-III, US-IV[+] or SPARC64 VI, its the same. Each entity seen as cpu is handled in the same way,

. cfgadm on the other hand is getting information from lowlevel device tree and is closer to the hardware. So it knows the difference between core/strand/chip/cpu etc. So this command will have a difference depending on the type of the cpu,

. off-line processors (Quiesced) are not completely idle (still running cross calls, traps and running the idle thread) and E$ scrubber continues to cross-trap to an offlined proc, off-line means a processor is not part of the scheduler, not taking device-interrupts but still taking software interrupts and part of cpu-ready-set taking part in demap cross calls,

. no-intr means the cpu is part of the scheduler and does not take device interrupts. Again software interrupts are an exception and the cpu is still taking them,

. Unconfigured procs are completely removed from Solaris scope but are still visible at the OBP level. The unconfigure operation is part of the DR process where a cpu is physicaly made to go back to a tight for loop and removed from solaris's device tree. Solaris no longer has any idea about the existence of this cpu. This is totally different from psradm which are done at user level and the os is controlling these,

. Possible states combinations are :

After boot/initialization 	: RUNNING, READY, EXISTS, ENABLE
Interrupts disabled 		: RUNNING, READY, EXISTS
Offline 			: RUNNING, READY, QUIESCED, EXISTS, OFFLINE 

. In all the cases, UltraSparc II, UltraSparc III, UltraSparc IV and UltraSparc IV+, if the system reboots/crashes/dstops/hangs, and/or if a system recovery occurs, the system will be brought back up in default mode : all processors configured and online.

. All the memory available before offlining/unconfiguring procs remains available after the operation.

What's new with Solaris 10

FMA (Fault Management Architecture) introduces 2 new states:

FAULTED : processor is offline due to fault
SPARE   : processor is offline as waiting 

Morever, psradm introduces 2 new options: -F and -s

-Fs to force processor into "spare" state
-Ff to force processor into "offline" state
-F to force processor into "faulted" state

The force option forces a processor to be offlined, set to faulted or set to spare even if there is processes bound to that processor. In this case, the binding is revoked for these processes.

In many respects the "spare" state is similiar to the "offline" state. The difference is that a processor in this state cannot be changed to a different state unless the user has the appropriate privilege. The "spare" state adds a meaningful semantic to distinguish offline processors on the system for purposes of automated resource management.



Internal Comments
Sun Internal information only.
Technical Background
@ Processor flag status are :
@ #define CPU_RUNNING 0x001 /* CPU running */
#define CPU_READY 0x002 /* CPU ready for cross-calls */
#define CPU_QUIESCED 0x004 /* CPU will stay in idle */
#define CPU_EXISTS 0x008 /* CPU is configured */
#define CPU_ENABLE 0x010 /* CPU enabled for interrupts */
#define CPU_OFFLINE 0x020 /* CPU offline via p_online */
#define CPU_POWEROFF 0x040 /* CPU is powered off */
#define CPU_FROZEN 0x080 /* CPU is frozen via CPR suspend */
#define CPU_SPARE 0x100 /* CPU offline available for use */
#define CPU_FAULTED 0x200 /* CPU offline diagnosed faulty */

@ From Solaris Internals - Architectures and techniques - Volume 1 :
CPU_RUNNING The CPU is running, able to execute kernel threads, handle interrupts, etc.
CPU_READY The CPU will take cross-calls and directed interrupts.
CPU_QUIESCED The CPU is not running kernel threads or interrupt threads.
CPU_EXISTS All installed CPUs when the system initializes (boots) will min-imally be in the EXISTS state.
CPU_ENABLE The CPU is enabled to take interrupts, but it is not part of the dispatcher s pool of online CPUs to schedule kernel threads. With this flag off, the CPU
may still take directed interrupts and cross-calls, but not interrupts that can be directed to another CPU.
CPU_OFFLINE The processor was taken offline by psradm(1M) or p_online(2) and is no longer scheduling kernel threads. The CPU will still take interrupts (this is the
difference between the offline state and quiesced state: a CPU in the quiesced state will not take interrupts). A CPU with bound threads cannot be taken offline.
CPU_POWEROFF (Solaris 2.6 and Solaris 7 only). The CPU has been powered off.

From cpu_offline() :

/* don't turn off last online CPU in partition */
if (ncpus_online <= 1 || curthread->t_bound_cpu == cp ||
pp->cp_ncpus <= 1 || cpu_intr_count(cp) < 2) {
return (EBUSY);
}
ncpus_online <= 1 : At least one processor in the system must be able to process LWPs.
curthread->t_bound_cpu == cp : A processor may not be taken off-line if there are LWPs that are bound to the processor.
cpu_intr_count(cp) < 2 : At least one processor must also be able to be interrupted.

SunSolve - technical informations available in the comments section :
5004304 - even after a CPU is offlined for ECC errors, it can still panic on ECC errors
4947174 - After offlining a CPU, a domain panic's at TL=0x2
[email protected]
psradm, cfgadm, off-line, no-intr, unconfigure, offline
Previously Published As 78333


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback