Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1386502.1
Update Date:2012-05-11
Keywords:

Solution Type  Problem Resolution Sure

Solution  1386502.1 :   Replacement disk not recognized on X4540  


Related Items
  • Sun Fire X4500 Server
  •  
  • Sun Fire X4540 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x64>Server>SN-x64: AMD-STOR-SERVER
  •  




In this Document
Symptoms
Changes
Cause
Solution


Created from <SR 3-4853548441>

Applies to:

Sun Fire X4540 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire X4500 Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on x86 (32-bit)

Symptoms

The issue: Under very rare circumstance, replacement disk would not be recognized by the system, and when #cfgadm -al is issued customer would see sd instance under Attachment Point (Ap_Id) instead of the actual target.

adcmohtsm01{root}% raidctl -l
Controller: 1
Disk: 0.0.0
'
'
Disk: 0.7.0
Controller: 2
Disk: 0.0.0
Disk: 0.1.0
Disk: 0.2.0
Disk: 0.3.0
Disk: 0.5.0
Disk: 0.6.0
Disk: 0.7.0

Discrepancy in cfgadm -alv

Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 disk connected configured unknown
c2::dsk/c2t1d0 disk connected configured unknown
c2::dsk/c2t2d0 disk connected configured unknown
c2::dsk/c2t3d0 disk connected configured unknown
c2::dsk/c2t5d0 disk connected configured unknown
c2::dsk/c2t6d0 disk connected configured unknown
c2::dsk/c2t7d0 disk connected configured unknown
c2::sd30 disk connected unconfigured unknown  <<<<<<<<<<<<<
(SD Instance Number recorded as  Attachment Point ID- Stale Entry )

This issue is specific to customer and is termed to be a  administration issue. I thought its worth to be documented hence documenting it.

Changes

Customer had replaced the disk. 

Cause

Cause 1.
Quiet Possible disk replacement was not done as per the procedure
(refer to the Links below for procedure)
Customer should have simply un-configured the disk when the ready to remove light was
lit they should have removed the disk and replaced the  same with new disk

Verify that the blue LED on the disk turns off after one minute. If the blue LED does not turn off after one minute, you can have the OS  re-enumerate device nodes and links by typing:
# devfsadm -c.

Cause 2.
Assumptions:
Buggy disk firmware,  Disk controller (LSI) Firmware, SD, cfgadm driver patches


Solution

Most customers would not agree for  Down Time:  Hence possible ways to recover,  from  this situation is

Step a. 
     - Pull out the disk
     - do devfsadm -Cv     #command to clear all stale enteries
     - cfgadm -c unconfigure c2::sd30  # command to unconfigure the device

Step b.
       - Insert the disk
      - do devfsadm -v               [#command to  scan for new devices]
      - check format output to see if disk is getting detected or not
      - cfgadm -c configure c2    [#configure all the underlined device in controller 2]

If   Steps 'a' and  'b' step fails

Step c.
Redo the  below steps once gain
1.# cfgadm -c unconfigure c2::sd30     [command to unconfigure the device, if fails try option 2]
2.# cfgadm -x remove_device c2::sd30  [Forceful removal]
3.# cfgadm -al  #and see if the c2::sd30 is gone
4.# cfgadm -x insert_device c2::dsk/c2t4d0 or
                   if the above command fails run
4a # cfgadm -x insert_device c2
        verify the device c2::dsk/c2t4d0 is added by running cfgadm -al
5.# cfgadm -c configure c2::dsk/c2t4d0

If the above step fails Customer is requested to  reboot the server (cold reboot preferred)

Why:
Because devfsadm has not responded even after preforming the unconfigure command
the corruption persist  and entries of cx::sdxx  is  not removed

This shows something is messed up with cfgadm,  we request customer to either
reboot the server or cold power-cycle (preferred) the server. Only Resolution

If the issue does not get resolved following a reboot / cold powercylce engage  Oracle

TSC is requested to engage the drivers team SN-DK Storage Drivers (MOS Group)

Why:  Because its Driver mess-up.

Important Document Link
Replacing a drive that has not been explicitly failed by ZFS
Server Maintenance Documentation

Note: 1.
This types of issue are basically due to incorrect procedures followed during disk replacement  or a genuine driver bug (SN-DK Storage Divers)

Customer should be advised of keeping their.
- System Firmware, Disk Controller Firmware, Disk Firmware up-to date.


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback