Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1482273.1
Update Date:2012-09-27
Keywords:

Solution Type  Problem Resolution Sure

Solution  1482273.1 :   SPARC T4 systems seeing "send_mondo" panics followed by a system hang  


Related Items
  • SPARC T4-1
  •  
  • SPARC T4-1B
  •  
  • SPARC T4-4
  •  
  • SPARC T4-2
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>CMT Servers
  •  




In this Document
Symptoms
Cause
Solution


Oracle Confidential (PARTNER). Do not distribute to customers.
Reason: security CR

Applies to:

SPARC T4-1B - Version All Versions to All Versions [Release All Releases]
SPARC T4-1 - Version All Versions to All Versions [Release All Releases]
SPARC T4-2 - Version All Versions to All Versions [Release All Releases]
SPARC T4-4 - Version All Versions to All Versions [Release All Releases]
SPARC

Symptoms

A SPARC T4-x system could see the following panic type with any single core / 8 cpus failing:

.
.
send mondo timeout [retries: 0x95173]  cpuids:  0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27
panic: failed to stop cpu32
panic: failed to stop cpu33
panic: failed to stop cpu34
panic: failed to stop cpu35
panic: failed to stop cpu36
panic: failed to stop cpu37
panic: failed to stop cpu38
panic: failed to stop cpu39

panic[cpu46]/thread=3005b6dfa60: send_mondo_set: timeout

000002a104dcccc0 unix:send_mondo_set+548 (10, ffff, 183c738, 1, 30009740600, 8)
  %l0-3: 0000000000000037 00008006fc8039ae 00000000010bcc00 00000000010bcc00
  %l4-7: 0000000001044c8c 00000000010bcc20 0000000000000000 0000000000000008
000002a104dccdb0 unix:xt_sync+1bc (2a104dcd1c0, 1044c00, 2a104dccec8, ffffbfffffffffff, fffffffffffffff8, 1913c08)
  %l0-3: 0000000000000000 00008006c0f0d998 000002a104dcce88 00000000018efbf0
  %l4-7: 000000000000003e 0000000000000000 4000000000000000 ffffbfffffffffff
.
.

and the system requiring a power reset to come back up (which can be verified in the hostconsole.log from the ILOM snapshot):

^M100% done: 386424 pages dumped, dump succeeded^M^M
rebooting...^M^M
Resetting...^M^M
[CPU 0:0:0] NOTICE:  Checking Flash File System^M
[CPU 0:0:0] NOTICE:  Initializing TOD: 2012/05/10 02:41:13^M
[CPU 0:0:0] NOTICE:  Loaded ASR status DB data. Ver. 3.^M


LDOM/OVM

1. In the event of a send_mondo_set timeout panic on a guest domain the hostconsole.log collected by snapshot will not include 
the panic stack, however it can still be used to verify whether a platform reset occurred prior to host recovery.

2. When diagnosing send_mondo_set timeout panics on systems configured with LDOMs/OVM care must be taken to ensure the 
CPUs reported as failed to stop do in fact map to the same single core, since domains will use a virtual CPU ID (VID) rather than 
the physical CPU ID (CID).

This can be checked using the ldm_list_-l.out output collected by explorer;

Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu72
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu73
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu74
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu75
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu76
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu77
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu78
Aug  5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu79

From the ldm output ({explorer}/sysconfig/ldm_list_-l.out) we can see these map to physical CPUs 136 > 143 - the output will also 
confirm the physical core.

VCPU
    VID    PID    CID    UTIL STRAND
    72     136    17     2.0%   100%
    73     137    17      12%   100%
    74     138    17     3.5%   100%
    75     139    17      12%   100%
    76     140    17     2.5%   100%
    77     141    17     1.2%   100%
    78     142    17     0.9%   100%
    79     143    17     0.9%   100%

3. Only those CPUs assigned to a given domain (Control or Guest) will be impacted by the panic, as a result you may see less than 
eight CPUs reported as failed to stop if less than the entire core is assigned to that domain;

send mondo timeout [retries: 0x99cbb]  cpuids:
 0x38

panic: failed to stop cpu56
panic: failed to stop cpu57
panic: failed to stop cpu58

From ldm_list_-l.out;

VCPU
    VID    PID    CID    UTIL STRAND
    0      0      0       24%   100%
    56     56     7       12%   100%
    57     57     7       23%   100%
    58     58     7       22%   100%

Cause

 N/A

Solution

This is NOT a hardware issue and so NO hardware of any kind should be replaced.

If you see this type of panic please open a Service Request (SR) to the VSP SPARC T4 Hardware Group in My Oracle Support (MOS) and either upload or attach the corefile, explorer, and ILOM snapshot for analysis.


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback