Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1019357.1 : Sun Fire Server with Solaris 10 may Panic or Reset with lpost message, asynchronous event, fail to stop CPU or send_mondo timeout
PreviouslyPublishedAs 238746 Bug Id <SUNBUG: 6684726>, <SUNBUG: 6699498> Date of Preliminary Release 12-Jun-2008 Date of Resolved Release 01-Aug-2008 Sun Fire Server with Solaris 10 may Panic or Reset with lpost message, asynchronous event, fail to stop CPU or send_mondo timeout. (see below for full details) 1. ImpactLoss of application availability may occur due to a system panic or reset. This type of fault is typically diagnosed to be a hardware failure and may lead to unnecessary hardware replacement.2. Contributing FactorsThis issue may occur on the following releases and platforms:SPARC Platform
This issue is specific to the Midrange servers listed above and only seen with SPARC USIV+ 1.5GHz and 1.8GHz CPUs. It is considered possible to occur with USIV+ 1.95 GHz CPU. Current firmware versions of System Controller (ScApp) ScApp:5.20.9 (as delivered in patch 114527-10) and earlier are affected. It has been observed when running programs that access OBP from the OS. Examples of programs that access OBP from the OS are prtdiag, prtconf, cfgadm, picl, or other third party System management software. This issue is very timing dependent and expected to be rare. 3. SymptomsConsole logs and core files are useful in identifying whether the system is experiencingthis issue. ------------------------------------------- Console Logs Showing send_mondo panic: domainA console login: {/N0/SB2/P0/C1} @(#) lpost 5.20.8 2007/11/20 10:33 {/N0/SB2/P0/C1} Copyright 2007 Sun Microsystems, Inc. All rights reserved. {/N0/SB2/P0/C1} Use is subject to license terms. send mondo timeout [8307178 NACK 0 BUSY] IDSR 0x4000000000000000 cpuids: 0x208 panic: failed to stop cpu520 panic[cpu3]/thread=3005bc3e080: send_mondo_set: timeout 000002a10062e9c0 SUNW,UltraSPARC-IV+:send_mondo_set+454 (2a10062eba0, ... 1, 2a10062eab0, 0) %l0-3: aaaaaaaaaaaaaaaa 000000000000002f 000000000000002f 0000000000000209 %l4-7: 0000000001221400 00000007274a4ba9 4000000000000000 0000000000000040 000002a10062eaf0 unix:xt_some+194 (2a10062ed78, 2a10062ebf0, fffff7, fffffffffffffff8, 2a10062eba8, 0) ------------------------------------------- Console logs showing Asynchronous Event and failed to stop: domainA console login: {/N0/SB5/P2/C1} @(#) lpost 5.20.8 2007/11/20 10:33 {/N0/SB5/P2/C1} Copyright 2007 Sun Microsystems, Inc. All rights reserved. {/N0/SB5/P2/C1} Use is subject to license terms. {/N0/SB5/P2/C1} @(#) lpost 5.20.8 2007/11/20 10:33 {/N0/SB5/P2/C1} Copyright 2007 Sun Microsystems, Inc. All rights reserved. {/N0/SB5/P2/C1} Use is subject to license terms. {/N0/SB5/P2/C1} WARNING: Asynchronous Event. {/N0/SB5/P2/C1} Component under test: /N0/SB5/P2 CPU {/N0/SB5/P2/C1} AFSR1 EXT: 00000000.00000000 AFSR2 EXT:00000000.00000000 {/N0/SB5/P2/C1} tl tt tstate tpc tnpc {/N0/SB5/P2/C1} 01 63 00000044.80000605 000007ff.f000c370000007ff.f000c374 Apr 23 17:31:15 e13-sc1 Domain-A.SC: Active - Panicking panic: failed to stop cpu534 panic[cpu7]/thread=30029331920: bad kernel MMU miss at TL 2 4. WorkaroundWorkarounds for this issue are on a case by case basis and require consultation with Sun Services.An individual action plan will be developed for your environment. 5. ResolutionThis issue is addressed on the following releases and platforms:SPARC Platform
01-Aug-2008: Updated Contributing Factors and Resolution sections. Now Resolved. Product Sun Fire E2900 Server Sun Fire E4900/E6900 Server Sun Fire E4800/6800 Server Sun FireV1280 Server Netra 1280 Server Sun Netra 1290 Server Solaris 10 Operating System References<SUNPATCH: 114527-11><SUNPATCH: 137111-04> Internal Contributor/submitter [email protected] Internal Eng Responsible Engineer [email protected] Internal Services Knowledge Engineer [email protected] Internal Eng Business Unit Group SSG ES (Enterprise Systems) Internal Escalation ID 70258372, 1-23935265 Internal Sun Alert & FAB Admin Info WF 10-Jun-2008: karen, submitted yesterday. sending for review to Ken first, then 24hr review. WF 12-Jun-2008: karen, releasing Internal Resolution Patches 114527-11, 137111-04 Internal Comments Please send technical questions to the following email: [email protected] and CC the following persons: Internal Contributor/Submitter Internal Eng Responsible Engineer Internal Services Knowledge Engineer The fix will require a firmware upgrade to System Controller Application (ScApp) 5.20.10 in patch 114527-11 and the Solaris 10 kernel patch. As a workaround the command seen in the core file that caused the failure at a customer's site should be disabled to prevent repeat failures. This timing issue is load dependent, and a machine that has seen the issue has a higher potential for seeing the issue again. It is not recommended to proactively disable all commands in the Contributing Factors section, since the application mix required to initiate the failure is rare. For the latest information on this and all other Midrange Server Product issues see: http://panacea/twiki/bin/view/Products/MidRangeServerProdIssues ReferencesReferencesSUNPATCH:114527-11SUNPATCH:137111-04 ReferencesSUNPATCH:114527-11SUNPATCH:137111-04 Attachments This solution has no attachment |
||||||||||||
|