Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1010705.1
Update Date:2011-01-12
Keywords:

Solution Type  Problem Resolution Sure

Solution  1010705.1 :   Reboot of one domain, 2 others dstop with “Timeout on head of CI queue”  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
214778


Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
Sun SPARC Sun OS
All Platforms

Symptoms

A platform has been upgraded from sms 1.3 to sms 1.4.1. After the sms 1.3 upgrade to sms 1.4.1, the domains needed to be rebooted to take the new firmware in effect.

One domain has been shutdown with setkeyswitch and within a short time frame 2 other domains crashed with a dstop. The dstops look like this:

redxl> dumpf load dsmd.dstop.040916.1721.32
Created Thu Sep 16 17:21:34 2004

By hpost v. 1.4.1 Generic 117371-04 Jul 21 2004 12:53:56 executing as pid=6465
On ssc name: ssc0001.
Domain = 11=L = su01234 Platform = ssc0100 ChHostID = 5014936601194
Boards in dump: master SC CPs/CSBs[1:0]: 3
EXB[17:0]: 20000
Slot0[17:0]: 20000
Slot1[17:0]: 20000
-D option, -d
"DSMD DomainStop Dump"
0 errors occurred while creating this dump.

redxl> wfail
All master SDIs in this dump indicating valid error info [20000]
indicate the first error was Dstop for all of EXB EX17.
SDI EX17/S0 Master_Stop_Status0[31:0] = D00400CF
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX17/S0 Dstop0[31:0] = 3001A000
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[28]: D Slot0 asserted Error, enabled to cause Dstop (M)
Dstop0[29]: D 1E Slot1 asserted Error, enabled to cause Dstop (M)
EPLD IO17 Err1_Dom1: Mask= B0 Err= 40 1stErr= 40
Err1[6]: 1E+ Error reported by BBC0
BBC IO17/BB0 Device_Err_Stat[31:0] = 80008100
DevErr[ 8]: 1E Port 0 Safari device asserted error
PCI IOC IO17/P0 Safari_Err_Log[63:0] = 80000000 00000200
ErrLog[ 9]: ErrOut Timeout on head of CI queue
ErrLog[63]: Error Out asserted (S_ERROR_L pin)
FAIL Port IO17/P0: Dstop detected by BBC IO17/BB0.
Primary service FRU is Slot IO17.
DARB C0: enabled ports (expanders) [17:0]: 28F57
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 20000
DARB C1: enabled ports (expanders) [17:0]: 28F57
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 20000

Cause

It turns out, an out-of-date, sms-backup file was restored during the sms upgrade. Causing conflicting information in the .pcd files and in the platforms registers. So a part of the PCD, the Platform Configuration Database, the files in $SMSVAR/SMS/.pcd/ contain information inconsistent with the current domain layout reality. And this will cause problems during setkeyswitch operations.

Solution

Setkeyswitch off and on the domains, this will rewrite the PCD information.
This preferably done when the applications have been shutdown on the domains. This because it is very difficult to anticipate what is going to happen in the scenario.

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
Reboot of one domain, 2 others dstop with “Timeout on head of CI queue”

This is related to CASM bug 6592200
submitted Aug 13 2007.
If this timeout is received check the casm allocations to make
sure they are ok.

Each slot that has a system board should also have a casm assigned
Please escalate any cases that fit this bug

reboot, domain, dstop, keyswitch, pcd, corruption, inconsistency, smsbackup, smsrestore, setkeyswitch, crash, panic
Previously Published As 78300

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback