![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Problem Resolution Sure Solution 1482273.1 : SPARC T4 systems seeing "send_mondo" panics followed by a system hang
In this Document
Oracle Confidential (PARTNER). Do not distribute to customers. Applies to:SPARC T4-1B - Version All Versions to All Versions [Release All Releases]SPARC T4-1 - Version All Versions to All Versions [Release All Releases] SPARC T4-2 - Version All Versions to All Versions [Release All Releases] SPARC T4-4 - Version All Versions to All Versions [Release All Releases] SPARC SymptomsA SPARC T4-x system could see the following panic type with any single core / 8 cpus failing: . . send mondo timeout [retries: 0x95173] cpuids: 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 panic: failed to stop cpu32 panic: failed to stop cpu33 panic: failed to stop cpu34 panic: failed to stop cpu35 panic: failed to stop cpu36 panic: failed to stop cpu37 panic: failed to stop cpu38 panic: failed to stop cpu39 panic[cpu46]/thread=3005b6dfa60: send_mondo_set: timeout 000002a104dcccc0 unix:send_mondo_set+548 (10, ffff, 183c738, 1, 30009740600, 8) %l0-3: 0000000000000037 00008006fc8039ae 00000000010bcc00 00000000010bcc00 %l4-7: 0000000001044c8c 00000000010bcc20 0000000000000000 0000000000000008 000002a104dccdb0 unix:xt_sync+1bc (2a104dcd1c0, 1044c00, 2a104dccec8, ffffbfffffffffff, fffffffffffffff8, 1913c08) %l0-3: 0000000000000000 00008006c0f0d998 000002a104dcce88 00000000018efbf0 %l4-7: 000000000000003e 0000000000000000 4000000000000000 ffffbfffffffffff . . and the system requiring a power reset to come back up (which can be verified in the hostconsole.log from the ILOM snapshot): ^M100% done: 386424 pages dumped, dump succeeded^M^M
rebooting...^M^M
Resetting...^M^M
[CPU 0:0:0] NOTICE: Checking Flash File System^M
[CPU 0:0:0] NOTICE: Initializing TOD: 2012/05/10 02:41:13^M
[CPU 0:0:0] NOTICE: Loaded ASR status DB data. Ver. 3.^M
LDOM/OVM
1. In the event of a send_mondo_set timeout panic on a guest domain the hostconsole.log collected by snapshot will not include
the panic stack, however it can still be used to verify whether a platform reset occurred prior to host recovery.
2. When diagnosing send_mondo_set timeout panics on systems configured with LDOMs/OVM care must be taken to ensure the
CPUs reported as failed to stop do in fact map to the same single core, since domains will use a virtual CPU ID (VID) rather than
the physical CPU ID (CID).
This can be checked using the ldm_list_-l.out output collected by explorer;
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu72
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu73
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu74
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu75
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu76
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu77
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu78
Aug 5 01:22:07 somehost unix: [ID 350512 kern.notice] panic: failed to stop cpu79
From the ldm output ({explorer}/sysconfig/ldm_list_-l.out) we can see these map to physical CPUs 136 > 143 - the output will also
confirm the physical core.
VCPU
VID PID CID UTIL STRAND
72 136 17 2.0% 100%
73 137 17 12% 100%
74 138 17 3.5% 100%
75 139 17 12% 100%
76 140 17 2.5% 100%
77 141 17 1.2% 100%
78 142 17 0.9% 100%
79 143 17 0.9% 100%
3. Only those CPUs assigned to a given domain (Control or Guest) will be impacted by the panic, as a result you may see less than
eight CPUs reported as failed to stop if less than the entire core is assigned to that domain;
send mondo timeout [retries: 0x99cbb] cpuids:
0x38
panic: failed to stop cpu56
panic: failed to stop cpu57
panic: failed to stop cpu58
From ldm_list_-l.out;
VCPU
VID PID CID UTIL STRAND
0 0 0 24% 100%
56 56 7 12% 100%
57 57 7 23% 100%
58 58 7 22% 100%
CauseN/A SolutionThis is NOT a hardware issue and so NO hardware of any kind should be replaced. If you see this type of panic please open a Service Request (SR) to the VSP SPARC T4 Hardware Group in My Oracle Support (MOS) and either upload or attach the corefile, explorer, and ILOM snapshot for analysis. Attachments This solution has no attachment |
||||||||||||||||
|