SPARCstorage Array Error Messages explanation

Asset ID:	1-71-1008712.1
Update Date:	2012-07-31
Keywords:

Solution Type Technical Instruction Sure

Solution 1008712.1 : SPARCstorage Array Error Messages explanation

Related Items


Sun SPARCstorage Array

Related Categories


PLA-Support>Sun Systems>DISK>HDD-JBOD>SN-DK: Disks and other JBODs
 .Old GCS Categories>Sun Microsystems>Storage - Disk>Modular Disk - Other

PreviouslyPublishedAs
211963

Description
This document explains SPARCstorage Array Error Messages.

Steps to Follow
Message:

Dec  9 03:31:40 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: FW -
disk drive (3, 3) failed
Self-explanatory. Replace disk drive.
Messages:
Dec  9 03:31:42 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3
(ssd48):
Dec  9 03:31:42 mtipoc05 unix:  Transport error:  timeout
Dec  9 03:31:42 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3
(ssd48):
Dec  9 03:31:42 mtipoc05 unix:  SCSI transport failed: reason 'ti
Dec  9 03:31:42 mtipoc05 unix: meout': retrying command
Dec  9 03:31:42 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3
(ssd48):
Dec  9 03:31:42 mtipoc05 unix:  SCSI transport failed: reason 're
Dec  9 03:31:42 mtipoc05 unix: set': retrying command
Dec  9 03:31:42 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3
(ssd48):
Dec  9 03:31:42 mtipoc05 unix:  SCSI transport failed: reason 'in
Dec  9 03:31:42 mtipoc05 unix: complete': retrying command
Disk is not responding to a process initiated by the transport (the SSA/ISP
firmware in this case). i.e. the disk is not responding to the ISP chip.
It is timing out so this is a sign that the disk drive is failing.
Message:
Dec  9 03:31:07 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3
(ssd48):
Dec  9 03:31:07 mtipoc05 unix:  Error for command 'write(10)' Err
Dec  9 03:31:07 mtipoc05 unix: or Level: Retryable
Dec  9 03:31:07 mtipoc05 unix:  Requested Block 1744512, Error Block: 1744512
Dec  9 03:31:07 mtipoc05 unix:  Sense Key: Hardware Error
Dec  9 03:31:07 mtipoc05 unix:  Vendor 'SEAGATE':
Dec  9 03:31:07 mtipoc05 unix:         ASC = 0x9 (track following error), ASCQ =
0x0, FRU = 0xfb
I/O Retryable Block errors. If retryable errors are persistent repair the blocks
with format. If successful everything will be Ok. If not, add them to the
bad block list of the drive.
Messages:
Dec  5 11:48:24 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: FW -
disk drive
Dec  5 11:49:53 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: FW -
disk drive (3, 3) is ok
The Fast-write feature has detected a problem and is backing off to allow
error-recovery to occur.
Later, it states that a disk-drive which was in error, is now good.
The fact that the Fast-Write feature is backing-off, then coming
back on again, means that disk-drives appear to the controller to be
going bad then coming good again within 20-30 seconds of each other.
Message:
Mar 13 14:43:56 mtinsc05 unix: ID[SUNWssa.soc.link.1010] soc2: message:  SSA
NOTICE: ISP3 - not a response packet
The SSA firmware is complaining about an entire ISP packet being incorrect.
This might mean that one of the ISPs on board the SPARCstorage Array controller
is defective. Therefore, the controller should be replaced.
Messages:
Dec 16 20:04:42 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3
(ssd48):
Dec 16 20:04:42 mtipoc05 unix:  disk not responding to selection
Dec 16 20:04:42 mtipoc05 unix:
It might be a bad disk drive.
Messages:
Dec 16 20:24:32 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516 (SUNW,pln1):
Dec 16 20:24:32 mtipoc05 unix:  timing out packet(s);  flushing trans
Dec 16 20:24:33 mtipoc05 unix: port (Timeout recovery being invoked...)
Dec 16 20:24:33 mtipoc05 unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre
Channel is OFFLINE
Dec 16 20:24:33 mtipoc05 unix: ID[SUNWssa.soc.link.6010] soc1: port 0: Fibre
Channel is ONLINE
Dec 16 20:24:33 mtipoc05 unix: ID[SUNWssa.soc.login.6010] soc1: Fibre Channel
login succeeded
Dec 16 20:25:50 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516 (SUNW,pln1):
Dec 16 20:25:50 mtipoc05 unix:  timing out packet(s);  flushing trans
Dec 16 20:25:51 mtipoc05 unix: port (Timeout recovery being invoked...)
Dec 16 20:25:51 mtipoc05 unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre
Channel is OFFLINE
Dec 16 20:25:51 mtipoc05 unix: ID[SUNWssa.soc.link.6010] soc1: port 0: Fibre
Channel is ONLINE
Dec 16 20:25:51 mtipoc05 unix: ID[SUNWssa.soc.login.6010] soc1: Fibre Channel
login succeeded
Dec 16 20:27:09 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516 (SUNW,pln1):
Dec 16 20:27:09 mtipoc05 unix:  timing out packet(s);  flushing trans
Dec 16 20:27:09 mtipoc05 unix: port (Timeout recovery being invoked...)
Dec 16 20:27:09 mtipoc05 unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre
Channel is OFFLINE
Dec 16 20:27:09 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0
(ssd35):
Dec 16 20:27:09 mtipoc05 unix:  Transport error:  Fibre Channel O
Dec 16 20:27:09 mtipoc05 unix: ffline
Dec 16 20:27:09 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0
(ssd35):
Dec 16 20:27:09 mtipoc05 unix:  Transport error:  Fibre Channel O
Dec 16 20:27:09 mtipoc05 unix: ffline
Dec 16 20:27:09 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0
(ssd35):
Dec 16 20:27:09 mtipoc05 unix:  SCSI transport failed: reason 'tr
Dec 16 20:27:09 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0
(ssd35):
Dec 16 20:27:09 mtipoc05 unix:  Transport error:  Fibre Channel O
Dec 16 20:27:09 mtipoc05 unix: ffline
Dec 16 20:27:09 mtipoc05 unix: an_err': giving up: transport retries exhausted
Explanation:
This is a "Timeout Recovery" loop, that is handled by the SSA driver.
The pluto has trouble accepting commands, and does not respond that it has
received or processed them.
The pln device driver, responsible for transporting commands, issues
three retries.  Currently, the number of retries is set to
two, so the total number of attempts is three.  The pluto hardware responds to
none of the retries because it is broken.  New commands issued to any disk
on that pluto get queued up in the pln driver.
The pln device driver issues a quick command to verify that the pluto
controller itself responds;  it asks the pluto controller: "Are you there?"
At the same time, it issues a similar command to each disk.
If the controller or ANY disk in that pluto is not present, (i.e. ANY
responses do not come back), the pln driver gives up waiting for them.
At this time, pln fails all commands queued for any disk on that pluto back to
the ssd driver, and from this time until that pluto is reinstated, all new
commands to that pluto are rejected immediately with messages like:
Dec 16 20:27:59 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@2,0
(ssd40):
Dec 16 20:27:59 mtipoc05 unix:  transport rejected (-2)
Failed commands are returned to ssd, which is told not to retry them because
the pln retries of above were exhausted.
Dec 16 20:29:51 mtipoc05 unix: pln1: offline state recovery check in progress...
The pln device driver also asks the controller and all disks
"Are you there?"
once every so often (configurable) after the SSA has been made unavailable.  If
at some point in the future the controller and ALL disks answer back within a
reasonable amount of time, the pluto is put back into service by the pln driver,
and normal operations will be resumed by that SSA.
Message:
Dec 17 00:17:25 mtipoc05 unix: WARNING:
/io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,2
(ssd47):
Dec 17 00:17:25 mtipoc05 unix:  device busy too long
These are "status-bit" -based messages,generated when no ARQ
(Auto-ReQuest sense) information has been able to be gathered by the
driver.  The driver then relies on some status bits inside the SCSI
packet to make a guesstimate of what was going on.  "device busy too
long" is kind of a generic device-busy not associated with
reservation-conflict or ISP-queues being full.
Messages:
Dec 17 03:36:19 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message:
Correctable Error in NVRAM loc 0x000173e8
Dec 17 03:36:19 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message:
Correctable Error in NVRAM loc 0x00014eb0
Dec 17 03:36:19 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message:
Correctable Error in NVRAM loc 0x0001c478
Dec 17 03:38:16 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message:
Correctable Error in NVRAM loc 0x000173e8
It looks like a controller is becoming faulty.
Message:
Mar 13 14:43:49 mtinsc05 unix: WARNING:
/io-unit@f,e2200000/sbi@0,0/SUNW,soc@3,0/SUNW,pln@a0000000,78a916/ssd@2,1
(ssd225): Mar 13 14:43:49 mtinsc05 unix:  Error for Command: read(10)
Error Level: Retryable Mar 13 14:43:49 mtinsc05 unix:  Requested Block:
3430034        Error Block: 3430034 Mar 13 14:43:49 mtinsc05 unix:  Vendor:
SEAGATE                 Serial Number: 00836544 Mar 13 14:43:49 mtinsc05 unix:
Sense Key: Aborted Command Mar 13 14:43:49 mtinsc05 unix:  ASC: 0xb2 (<vendor
unique code 0xb2>), ASCQ: 0x0, FRU: 0x0
This is a standard disk error message seen when a disk is responding to
ARQ (Auto-ReQuest sense) which is a process initiated by the transport
(the SSA/ISP firmware in this case).  i.e. a communication has gone on
about an error condition, between the ISP chip and the disk, and these are
the results.
The ASC code "0xb2" means that the ISP chip decided to reset the disk
and it is informing us of this event.  The fact that the error is
Retryable effectively means that the ISP failed the I/O and reset the
disk for us in the hope that the next attempt will succeed, perhaps
because the disk-firmware reported some form of internal soft error.

Product
SPARCstorage RSM
SPARCStorage Array Model 200
SPARCstorage Array Model 100

Internal Comments
From a series of post-mortems with AT&T WorldNet and escalation.

Troubleshoot, Troubleshooting
Previously Published As
17079

Change History
Date: 2003-05-20
User Name: Administrator
Action: Migration from KMSCreator
Comment: updated by : Matthew Shattuck
comment : Document Cleanup effort

date : Jun 13, 2002

updated by : Thom Chumley
comment : Not entered
date : May 13, 1998
Version: 0

Attachments

This solution has no attachment