![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1008712.1 : SPARCstorage Array Error Messages explanation
PreviouslyPublishedAs 211963 Description This document explains SPARCstorage Array Error Messages. Steps to Follow Message: Dec 9 03:31:40 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: FW - disk drive (3, 3) failed Self-explanatory. Replace disk drive. Messages: Dec 9 03:31:42 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3 (ssd48): Dec 9 03:31:42 mtipoc05 unix: Transport error: timeout Dec 9 03:31:42 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3 (ssd48): Dec 9 03:31:42 mtipoc05 unix: SCSI transport failed: reason 'ti Dec 9 03:31:42 mtipoc05 unix: meout': retrying command Dec 9 03:31:42 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3 (ssd48): Dec 9 03:31:42 mtipoc05 unix: SCSI transport failed: reason 're Dec 9 03:31:42 mtipoc05 unix: set': retrying command Dec 9 03:31:42 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3 (ssd48): Dec 9 03:31:42 mtipoc05 unix: SCSI transport failed: reason 'in Dec 9 03:31:42 mtipoc05 unix: complete': retrying command Disk is not responding to a process initiated by the transport (the SSA/ISP firmware in this case). i.e. the disk is not responding to the ISP chip. It is timing out so this is a sign that the disk drive is failing. Message: Dec 9 03:31:07 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3 (ssd48): Dec 9 03:31:07 mtipoc05 unix: Error for command 'write(10)' Err Dec 9 03:31:07 mtipoc05 unix: or Level: Retryable Dec 9 03:31:07 mtipoc05 unix: Requested Block 1744512, Error Block: 1744512 Dec 9 03:31:07 mtipoc05 unix: Sense Key: Hardware Error Dec 9 03:31:07 mtipoc05 unix: Vendor 'SEAGATE': Dec 9 03:31:07 mtipoc05 unix: ASC = 0x9 (track following error), ASCQ = 0x0, FRU = 0xfb I/O Retryable Block errors. If retryable errors are persistent repair the blocks with format. If successful everything will be Ok. If not, add them to the bad block list of the drive. Messages: Dec 5 11:48:24 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: FW - disk drive Dec 5 11:49:53 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: FW - disk drive (3, 3) is ok The Fast-write feature has detected a problem and is backing off to allow error-recovery to occur. Later, it states that a disk-drive which was in error, is now good. The fact that the Fast-Write feature is backing-off, then coming back on again, means that disk-drives appear to the controller to be going bad then coming good again within 20-30 seconds of each other. Message: Mar 13 14:43:56 mtinsc05 unix: ID[SUNWssa.soc.link.1010] soc2: message: SSA NOTICE: ISP3 - not a response packet The SSA firmware is complaining about an entire ISP packet being incorrect. This might mean that one of the ISPs on board the SPARCstorage Array controller is defective. Therefore, the controller should be replaced. Messages: Dec 16 20:04:42 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,3 (ssd48): Dec 16 20:04:42 mtipoc05 unix: disk not responding to selection Dec 16 20:04:42 mtipoc05 unix: It might be a bad disk drive. Messages: Dec 16 20:24:32 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516 (SUNW,pln1): Dec 16 20:24:32 mtipoc05 unix: timing out packet(s); flushing trans Dec 16 20:24:33 mtipoc05 unix: port (Timeout recovery being invoked...) Dec 16 20:24:33 mtipoc05 unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre Channel is OFFLINE Dec 16 20:24:33 mtipoc05 unix: ID[SUNWssa.soc.link.6010] soc1: port 0: Fibre Channel is ONLINE Dec 16 20:24:33 mtipoc05 unix: ID[SUNWssa.soc.login.6010] soc1: Fibre Channel login succeeded Dec 16 20:25:50 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516 (SUNW,pln1): Dec 16 20:25:50 mtipoc05 unix: timing out packet(s); flushing trans Dec 16 20:25:51 mtipoc05 unix: port (Timeout recovery being invoked...) Dec 16 20:25:51 mtipoc05 unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre Channel is OFFLINE Dec 16 20:25:51 mtipoc05 unix: ID[SUNWssa.soc.link.6010] soc1: port 0: Fibre Channel is ONLINE Dec 16 20:25:51 mtipoc05 unix: ID[SUNWssa.soc.login.6010] soc1: Fibre Channel login succeeded Dec 16 20:27:09 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516 (SUNW,pln1): Dec 16 20:27:09 mtipoc05 unix: timing out packet(s); flushing trans Dec 16 20:27:09 mtipoc05 unix: port (Timeout recovery being invoked...) Dec 16 20:27:09 mtipoc05 unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre Channel is OFFLINE Dec 16 20:27:09 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0 (ssd35): Dec 16 20:27:09 mtipoc05 unix: Transport error: Fibre Channel O Dec 16 20:27:09 mtipoc05 unix: ffline Dec 16 20:27:09 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0 (ssd35): Dec 16 20:27:09 mtipoc05 unix: Transport error: Fibre Channel O Dec 16 20:27:09 mtipoc05 unix: ffline Dec 16 20:27:09 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0 (ssd35): Dec 16 20:27:09 mtipoc05 unix: SCSI transport failed: reason 'tr Dec 16 20:27:09 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@1,0 (ssd35): Dec 16 20:27:09 mtipoc05 unix: Transport error: Fibre Channel O Dec 16 20:27:09 mtipoc05 unix: ffline Dec 16 20:27:09 mtipoc05 unix: an_err': giving up: transport retries exhausted Explanation: This is a "Timeout Recovery" loop, that is handled by the SSA driver. The pluto has trouble accepting commands, and does not respond that it has received or processed them. The pln device driver, responsible for transporting commands, issues three retries. Currently, the number of retries is set to two, so the total number of attempts is three. The pluto hardware responds to none of the retries because it is broken. New commands issued to any disk on that pluto get queued up in the pln driver. The pln device driver issues a quick command to verify that the pluto controller itself responds; it asks the pluto controller: "Are you there?" At the same time, it issues a similar command to each disk. If the controller or ANY disk in that pluto is not present, (i.e. ANY responses do not come back), the pln driver gives up waiting for them. At this time, pln fails all commands queued for any disk on that pluto back to the ssd driver, and from this time until that pluto is reinstated, all new commands to that pluto are rejected immediately with messages like: Dec 16 20:27:59 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@2,0 (ssd40): Dec 16 20:27:59 mtipoc05 unix: transport rejected (-2) Failed commands are returned to ssd, which is told not to retry them because the pln retries of above were exhausted. Dec 16 20:29:51 mtipoc05 unix: pln1: offline state recovery check in progress... The pln device driver also asks the controller and all disks "Are you there?" once every so often (configurable) after the SSA has been made unavailable. If at some point in the future the controller and ALL disks answer back within a reasonable amount of time, the pluto is put back into service by the pln driver, and normal operations will be resumed by that SSA. Message: Dec 17 00:17:25 mtipoc05 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0/SUNW,pln@a0000000,78a516/ssd@3,2 (ssd47): Dec 17 00:17:25 mtipoc05 unix: device busy too long These are "status-bit" -based messages,generated when no ARQ (Auto-ReQuest sense) information has been able to be gathered by the driver. The driver then relies on some status bits inside the SCSI packet to make a guesstimate of what was going on. "device busy too long" is kind of a generic device-busy not associated with reservation-conflict or ISP-queues being full. Messages: Dec 17 03:36:19 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: Correctable Error in NVRAM loc 0x000173e8 Dec 17 03:36:19 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: Correctable Error in NVRAM loc 0x00014eb0 Dec 17 03:36:19 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: Correctable Error in NVRAM loc 0x0001c478 Dec 17 03:38:16 mtipoc05 unix: ID[SUNWssa.soc.link.1010] soc1: message: Correctable Error in NVRAM loc 0x000173e8 It looks like a controller is becoming faulty. Message: Mar 13 14:43:49 mtinsc05 unix: WARNING: /io-unit@f,e2200000/sbi@0,0/SUNW,soc@3,0/SUNW,pln@a0000000,78a916/ssd@2,1 (ssd225): Mar 13 14:43:49 mtinsc05 unix: Error for Command: read(10) Error Level: Retryable Mar 13 14:43:49 mtinsc05 unix: Requested Block: 3430034 Error Block: 3430034 Mar 13 14:43:49 mtinsc05 unix: Vendor: SEAGATE Serial Number: 00836544 Mar 13 14:43:49 mtinsc05 unix: Sense Key: Aborted Command Mar 13 14:43:49 mtinsc05 unix: ASC: 0xb2 (<vendor unique code 0xb2>), ASCQ: 0x0, FRU: 0x0 This is a standard disk error message seen when a disk is responding to ARQ (Auto-ReQuest sense) which is a process initiated by the transport (the SSA/ISP firmware in this case). i.e. a communication has gone on about an error condition, between the ISP chip and the disk, and these are the results. The ASC code "0xb2" means that the ISP chip decided to reset the disk and it is informing us of this event. The fact that the error is Retryable effectively means that the ISP failed the I/O and reset the disk for us in the hope that the next attempt will succeed, perhaps because the disk-firmware reported some form of internal soft error. Product SPARCstorage RSM SPARCStorage Array Model 200 SPARCstorage Array Model 100 Internal Comments From a series of post-mortems with AT&T WorldNet and escalation. Troubleshoot, Troubleshooting Previously Published As 17079 Change History Date: 2003-05-20 User Name: Administrator Action: Migration from KMSCreator Comment: updated by : Matthew Shattuck comment : Document Cleanup effort date : Jun 13, 2002 updated by : Thom Chumley comment : Not entered date : May 13, 1998 Version: 0 Attachments This solution has no attachment |
||||||||||||
|