Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1020467.1 : How To manage "Unable to send ECC event message to System Controller" messages
PreviouslyPublishedAs 259008
Applies to:Sun Fire V1280 ServerSun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server All Platforms GoalDescriptionHow to deal with "Unable to send ECC event message to System Controller" messagesThis document discusses how to proceed in case you are continually getting messages in the /var/adm/messages file such as: May 10 03:10:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:11:28 system last message repeated 2 times May 10 03:11:53 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller May 10 03:11:58 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:13:58 system last message repeated 5 times May 10 03:14:12 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller May 10 03:14:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:17:28 system last message repeated 6 times May 10 03:17:29 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller May 10 03:17:58 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:18:28 system last message repeated 1 time May 10 03:18:58 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller May 10 03:18:58 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:24:58 system last message repeated 13 times May 10 03:25:25 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller May 10 03:25:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:28:58 system last message repeated 8 times May 10 03:29:27 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller May 10 03:29:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC May 10 03:31:58 system last message repeated 5 times SolutionCauseThese messages are caused by a flood of errors, for example a dimm causing many hundreds or thousands of CE's (Correctable Errors). The flood of errors on the domain is more then the domain to SC data path can handle.Note: The CE flood or storm is sometimes caused by FMA not retiring
pages correctly. It is important to install the latest FMA patches, for
example: Patch 139572-02 SunOS[TM] 5.10: fmd patch (or later) fixes Sun CR 6714311 Updated P2 fma/mem fmstat seems to hang after/during CE storm This bug causes page retirement to malfunction. Also: Patch 120011-14 SunOS 5.10: kernel patch (or later) and Patch 125369-12 SunOS 5.10: Fault Manager patch (or later) are quit important to have installed in order to avoid known issues that can lead to this condition. BackgroundData Transactions go into the error buffer, and the error buffer on the SC is getting full. By design, it only holds about 100 messages. Because Solaris can no longer write to the error buffer, we get the notices in /var/adm/messages which indicate "Unable to send ECC event message to System Controller".This issue is sometimes difficult to troubleshoot because the original error messages have to be examined to determine what event started the error storm. You should resolve the original error event (replace the dimm), but only after making sure to update the patches to assure that page retirement is functioning properly (see the NOTE on patches above). It is not advisable to replace hardware (memory DIMM) if the patches above are not installed. The patches should have prevented an storm in the first place by disabling faulty pages instead of allowing them to noisily fill up the error buffer with ECC errors. If the patches ARE installed, search for the dimm in error by examining the showerrorbuffer output - the dimm implicated by the "incoming" error is the root cause suspect (see Document 1002710.1 for details on this diagnosis): Date: Thu May 07 15:50:45 EDT 2009 NOTE: The service mode command clearerrorbuffer can be used to
clear the error buffer and prevent the "Unable to send" event messages from showing up again
in /var/adm/messages (unless the error storm persists). However, service mode requires that you contact Oracle Support Services to obtain a password and this special mode is only to be executed by Oracle badged employees. This is one reasons that using clearerrorbuffer is not really a viable solution to this problem. The main reason this isn't a viable solution is that this method to "resolve" the issue will wipe clean all the errors in the error buffer and could prevent you from being able to ID the dimm responsible for the noise in the first place. It is best to install the correct patches and/or replace the dimm in the first place. Internal Comments If it is needed to clear the error buffer, the following is performed: 1. Get into service mode Document 1010655.1 provides insight in working in service mode. Note: requires Oracle badge or use of Shared Shell - Customers should not do this themselves. 2. use the clearerrorbuffer command (example below): ssc0:SC[service]> clearerrorbuffer -h clearerrorbuffer -- clear the contents of the error buffer Usage: clearerrorbuffer clearerrorbuffer -h NOTE: If using the clearerrobuffer command, just know that it will empty the showerrorbuffer command output. You will not be able to use that data to ID a faulty dimm or source of errors. - The ECC storm discussed in this doc can lead to the behaviour described in Sun Alert 1019109.1 Systems With UltraSPARC IV+ Processors Running Solaris 9 or 10 May Experience &qot;send mondo timeout" Panic This behaviour has been experienced on system's using both ScApp 5.19.x & 5.20.x. Memory, CE, storm, ECC, showerrorbuffer, clearerrorbuffer, flood, page retirement, buffer Attachments This solution has no attachment |
||||||||||||
|