Tape - Multiple Drives Showing Status of Not functional

Asset ID:	1-72-1363076.1
Update Date:	2011-10-10
Keywords:

Solution Type Problem Resolution Sure

Solution 1363076.1 : Tape - Multiple Drives Showing Status of Not functional

Applies to:

Sun StorageTek 9940 Tape Drive - Version: Not Applicable and later   [Release: N/A and later ]
LTO Tape Drive - Version: Not Applicable and later    [Release: N/A and later]
Sun StorageTek T10000 Tape Drive - Version: Not Applicable and later    [Release: N/A and later]
Sun StorageTek 9840 Tape Drive - Version: Not Applicable and later    [Release: N/A and later]
Information in this document applies to any platform.

Symptoms

This instance involved HP LTO3 drives in an L180 library.
Various issues - the main one perceived by the customer was tapes being left in drives with FSC 3e24 recorded in the L180.

Here is the L180 FSC log.
The figure before the date is the number of these events since the library was last reset.
The date is that of the last event.

3e13 Warning DRIVE_02_00 2 9/27/2011 18:39:20 DRV Drive not responding
3e0a Warning DRIVE_02_00 1 9/27/2011 18:38:50 DRV TTI communication to drive failed
3e1c Warning DRIVE_09_00 1 9/27/2011 18:35:39 DRV drive reset detected
3e13 Warning DRIVE_01_00 3 9/27/2011 18:35:38 DRV Drive not responding
3e1c Warning DRIVE_00_00 1 9/27/2011 18:35:38 DRV drive reset detected
309e Warning NONE 3 9/27/2011 18:33:0 Cartridge access door is open. All robot activity in the library has been aborted.
3e24 Warning DRIVE_09_00 3 9/27/2011 18:25:31 DRV Wait for Rewind Failed - Loaded State reported during wait
3e24 Warning DRIVE_01_00 6 9/27/2011 18:25:5 DRV Wait for Rewind Failed - Loaded State reported during wait
3e24 Warning DRIVE_00_00 4 9/27/2011 18:23:23 DRV Wait for Rewind Failed - Loaded State reported during wait
3e24 Warning DRIVE_05_00 1 9/27/2011 18:8:12 DRV Wait for Rewind Failed - Loaded State reported during wait
3a32 Warning NONE 7 9/27/2011 17:41:24 OPI unable to get the Clean Warn Count Info From IFM.
3e1c Warning DRIVE_05_00 1 9/27/2011 17:6:59 DRV drive reset detected
3e1c Warning DRIVE_07_00 1 9/27/2011 17:6:58 DRV drive reset detected
3e1c Warning DRIVE_06_00 1 9/27/2011 17:6:58 DRV drive reset detected
3e13 Warning DRIVE_08_00 1 9/27/2011 17:6:57 DRV Drive not responding
3e14 Informational DRIVE_09_00 1 9/27/2011 17:6:50 DRV Drive not connected
3e14 Informational DRIVE_03_00 1 9/27/2011 17:4:39 DRV Drive not connected
3e14 Informational DRIVE_04_00 1 9/27/2011 17:4:38 DRV Drive not connected
3e14 Informational DRIVE_00_00 1 9/27/2011 17:4:34 DRV Drive not connected
3e14 Informational DRIVE_01_00 1 9/27/2011 17:4:33 DRV Drive not connected

FSC3e24 errors are seen -- but as usual they are a symptom not the problem.
FSC3e24 drive found to be not unloaded in time for a dismount command.
FSC3e24 on L180/700 is usually a server process died or connection was lost or a misconfiguration or something else got in the way of the life cycle of the mount.
It is not a hardware failure.

Also see is:
FSC3e13 Drive not responding - again a symptom,
and
FSC3e0a TTI communication to drive failed - another symptom.

Luckily we can see the cause of the mayhem right here on the L180 FSC log.
FSC 3e1c drive reset detected.
This will cause them to stop communication with the library for a while so you get other errors such as
3e13 Warning DRIVE_08_00 1 9/27/2011 17:6:57 DRV Drive not responding
3e14 Informational DRIVE_09_00 1 9/27/2011 17:6:50 DRV Drive not connected

The 3e24 Warning DRIVE_05_00 1 9/27/2011 18:8:12 DRV Wait for Rewind Failed - Loaded State reported during wait, are from events higher in the stack.

Drive resets are catastrophic.
Drive resets are attempts at recovery - not failures.
They are external - not from the drive or the library.
The drive must obey and reset itself regardless of what it is doing at the time.
Please do not replace library FRUs or drives.

But we know a reset like a timeout is not a failure.
It is a recovery action.

The customer was able to provide an explorer from a media server.
Again we see chaos caused by drives being reset.

Drives are reporting via SCSI sense that they have been reset.

Sep 25 05:32:31 scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Sep 25 06:12:36 scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Sep 25 06:52:41 scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0

Also the FC driver can see the drives being reset.
(This initiator did not do the resets).
.
Sep 23 23:30:07 FCP: WWN 0x500104f00059f3c9 reset successfully
Sep 23 23:33:42 FCP: WWN 0x500104f00059f3c9 reset successfully
Sep 23 23:33:43 FCP: WWN 0x500104f00059f3c9 reset successfully
Sep 23 23:37:06 FCP: WWN 0x500104f00059f3c9 reset successfully
Sep 23 23:40:26 FCP: WWN 0x500104f00059f3c9 reset successfully

Also there is a SAN configuration device allocation issues resulting reservation conflicts.

Sep 25 16:33:17 avrd[15582]: [ID 867986 daemon.notice] Reservation Conflict status from HP.ULTRIUM4-SCSI.009 (device 15)
Sep 26 04:01:16 avrd[15582]: [ID 205616 daemon.notice] Reservation Conflict status from HP.ULTRIUM4-SCSI.003 (device 3)

Also devices disappear from the fabric:
Sep 27 16:52:34 fctl: [ID 517869 kern.warning] WARNING: fp(0)::N_x Port with D_ID=1a0800, PWWN=500104f00059f3b7 disappeared from fabric
Sep 27 16:52:35 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=190800, PWWN=500104f00059f3bd disappeared from fabric
Sep 27 16:52:35 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=190900, PWWN=500104f00059f3ba disappeared from fabric
Sep 27 16:52:36 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=190a00, PWWN=500104f00059f3c0 disappeared from fabric
Sep 27 16:52:36 fctl: [ID 517869 kern.warning] WARNING: fp(0)::N_x Port with D_ID=1c1100, PWWN=500104f00059f3c3 disappeared from fabric
Sep 27 16:52:53 fctl: [ID 517869 kern.warning] WARNING: fp(0)::N_x Port with D_ID=1a0800, PWWN=500104f00059f3b7 reappeared in fabric
Sep 27 16:52:56 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=190900, PWWN=500104f00059f3ba reappeared in fabric
Sep 27 16:53:00 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=190800, PWWN=500104f00059f3bd reappeared in fabric
Sep 27 16:53:04 fctl: [ID 517869 kern.warning] WARNING: fp(6)::N_x Port with D_ID=190a00, PWWN=500104f00059f3c0 reappeared in fabric
Sep 27 16:53:06 fctl: [ID 517869 kern.warning] WARNING: fp(0)::N_x Port with D_ID=1c1100, PWWN=500104f00059f3c3 reappeared in fabric

There are other unusual non hardware errors occurring here too.

Sep 27 11:12:22 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,emlxs@1/fp@0,0/st@w500104f0009cabff,0 (st5):
Sep 27 11:12:22 scsi: [ID 107833 kern.notice] ASC: 0x2a (mode parameters changed), ASCQ: 0x1, FRU: 0x0
Sep 27 15:05:00 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,emlxs@1,1/fp@0,0/st@w500104f00059f3bd,0 (st14):
Sep 27 15:05:00 scsi: [ID 107833 kern.notice] ASC: 0x28 (medium may have changed), ASCQ: 0x0, FRU: 0x0

(These are both SCSI key 02 events (Unit attention) not media or hardware.

Some of these also occur on disks.
Are disks zoned in with the tapes?
Sep 27 01:26:34 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g4849544143484920373730313336323430303333 (ssd32):
Sep 27 01:26:34 scsi: [ID 107833 kern.notice] ASC: 0x2a (parameters changed), ASCQ: 0x0, FRU: 0x0

In the explorer and the FSC log, not seeing any tape drive or library hardware problem so please stop replacing hardware in the L180.

Cause

These drives are being hit on the head with resets.
Drives cannot hit themselves with an external reset..
Libraries cannot reset drives like this.
Something on the SAN is doing this.
The cause of resets, reservation conflicts and devices disappearing from the fabric and be difficult to resolve if there are several media servers, with access to the library.
The reservation conflicts probably indicate that either Netbackup SSO is not set up properly or some other application or utility is accessing the tape drives.
The device resets are usually a last resort recovery action by an initiator to regain control of a process.

Solution

Turns out the customer had designed a backup solution of epic proportions.
Whilst customers do sometimes drive libraries with more than one media servers it is usually a small number.
Vendors like HP anticipated this so they designed into their drives a capability into their drives a huge number of 32 initiators they could track.
The industry did not expect more than a few media servers.
No need. One media server can serve many clients.
That is still the case even with LTO5 drives - no one anticipates that anyone would ever design a solution with that many media servers.
Many clients, yes, but not media servers.

This customer built a solution with 170 media servers.
So it is doomed to failure.

This is a completely unacceptable configuration.
Media servers constantly disconnect and reconnect with devices.
In their world they believe they are the only sentient being in the universe so as a last resort a device reset may seem polite

Do not replace tape drives unless there is a genuine hardware problem.

Provided by Joe McNicholas.

Attachments

This solution has no attachment