Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1388529.1
Update Date:2012-03-23
Keywords:

Solution Type  Troubleshooting Sure

Solution  1388529.1 :   Sun Storage 7000 Unified Storage System: How to Troubleshoot ZFS Storage Pool Issues  


Related Items
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  




In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
      Introduction
     Framing the Problem


Applies to:

Sun Storage 7110 Unified Storage System - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun Storage 7210 Unified Storage System - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Information in this document applies to any platform.
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [not dependent]

Purpose

This document is provided to assist in troubleshooting ZFS issues in a Sun Storage 7000 Unified Storage System. As the Sun Storage 7000 heavily uses ZFS and its features, it is not easy to decide if an observed symptom is related to a pure ZFS issue or it is related to one feature of the appliance or it is even a performance topic.

It could also be the opposite way round, a performance discussion could end up in a ZFS related issue.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances
A good resource for for latest news about Sun Storage 7000 Unified Storage System is located on following documents: Document 1432269.2 and Document 1416406.1

Last Review Date

January 23, 2012

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Introduction

Most common scenarios for the area of problems related to ZFS pool issues can be divided into following major sections.

  • The Sun Storage 7000 Unified Storage System is not booting due to problems with system pool.
  • The data pool keeps the Sun Storage 7000 Unified Storage System so busy that it is not able to finish the boot sequence.
  • The Sun Storage 7000 Unified Storage System has booted, but there is trouble with the data pool
  • All the services are fine and data pool is healthy, but the system pool of the Sun Storage 7000 Unified Storage System has problems

Framing the the current situation is the first step to solve the problem the next step is most likely to open a service request for your Sun Storage 7000 Unified Storage System as it might be necessary to engage an Oracle Support Engineer doing some work on the Emergency shell remotely. To accomplish his job, the Oracle Support Engineer might ask you to start an Oracle Shared Shell session saving a lot of time, if you are already familiar with the use of it.
The following links will provide more information:



Framing the Problem

As mentioned before, first step in solving the situation is framing the problem. If you encounter one of the following situations, it could be related to known ZFS pools situation. The following list provides first questions and entry points to different solution paths :

1.) The Sun Storage 7000 Unified Storage System is not able to boot

There are different reasons for a system not to boot or appear not to boot and sometimes it is difficult to decide if the system is booting, hung during boot or really not booting. In such a situation it is very important to decide at what stage the boot process fails.
First things to verify:
  • is the console redirection of the ILOM used to monitor the boot process, neither a Monitor nor the Java console of the WebUI will show the complete boot
  • are the BIOS boot messages displayed at the console and are the BIOS settings available for changing
    If there are problems with the system before the BIOS is loaded you should refer to <Document:1386810.1>
  • are the BIOS settings like they should be, especially for the boot device
    Please refer to <Document:1357354.1> for 7x10 systems and <Document:1357409.1> for 7x20 systems
Now that the system is able to access and load the BIOS this allows to go forward and check if the system is able to find the boot devices and read the boot loader called GRUB.
  • if the system panics after loading GRUB and while loading the Solaris Kernel you could try booting from the second disk in the system pool
    SunOS Release 5.11 Version ak/[email protected],1-1.8 64-bit
    Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.

If the system is able to boot from the second disk in the system pool it proves that the other disk has a problem with loading the kernel and needs to be synced with a scrub run on the system pool.
Unfortunately neither the CLI nor the BUI allows currently running a scrub against the system pool, it requires to run commands on the Emergency Shell. As executing commands on the Emergency Shell recommends involvement of an Oracle Service Engineer it is advisable to open a Service Request at this point
  • if the system fails booting from the second system drive as well the only choice to continue is to open a Service Request.
At that stage it is recommended to open a Service Request with Oracle to engage an Oracle Service Engineer to assist resolving this situation. Please provide all gathered information into the Service Request to provide a good start for the ongoing work.

2.) The Sun Storage 7000 Unified Storage System boots, but does not get past the banner

The system seems to boot like normal, gets past the GRUB menu displays the Solaris banner and the banner of the Appliance Kit but is not able to finish and does not show a login prompt.

SunOS Release 5.11 Version ak/[email protected],1-1.8 64-bit
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.
Configuring devices.
Configuring network devices ... done.

Sun Storage 7310 Version ak/SUNW,[email protected],1-1.8
Copyright 2012 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.

Even with the system proceeding to the next stage it is not able to finish its boot process, which could mean the system is most likely stuck in one of the following situations
  • the system has passed loading the kernel and loaded the Appliance Kit as show in the box above and seems stuck, which is a good indication that the Sun Storage 7000 Unified Storage System is waiting for something to complete on the data pool(s)
  • the system passed loading kernel and Appliance Kit and suddenly panic's with a ZFS panic, which indicates problems with the integrity of at least one share in a data pool
In both situations an Oracle System Support Engineer is to be engaged by opening a Service Request with Oracle.



To overcome this situation and to gather more data it is recommended to boot the appliance without importing all data pools. A step by step procedure is available on the AmberRoadSupport Wiki, which is not accessible to all Oracle Employees.

3.) The Sun Storage 7000 Unified Storage System boots up all the way, but data pool has problems

The Sun Storage 7000 Unified Storage System has passed BIOS and GRUB, loaded kernel and Appliance Kit software and displays a login prompt. It seems all fine, but still there are chances to face problems or certain limitations with the system related to one or more data pools.

SunOS Release 5.11 Version ak/[email protected],1-1.8 64-bit
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.
Configuring devices.
Configuring network devices ... done.

Sun Storage 7310 Version ak/SUNW,[email protected],1-1.8
Copyright 2012 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.


s7310 console login:


  • The pool is filled with data and has reached the critical level of 82%. This level is known to cause performance impacts on the over all system performance and is a known limitation. This behavior is not unique to ZFS, but the level differs in dependence on the used filesystem.
  • By using a filesystem, data is deleted and new data is stored on the space available. With the time this creates fragmentation and causes the process looking for available space for allocation of data blocks to take more time.
    https://blogs.oracle.com/bonwick/entry/space_maps
    https://blogs.oracle.com/bonwick/en_US/entry/zfs_block_allocation
    https://blogs.oracle.com/relling/entry/space_maps_from_space
  • Snapshots are very common with ZFS and they are cheap when they are created. They are used for different purposes like manual or automatic snapshots, for NDMP backup and for remote replication. The creation of a snapshot is easy and cheap for ZFS, but deleting a snapshot could cause some drawbacks, especially if they are big in size. Reasons for snapshots to grow big are changes on the parent filesystem and/or included ZFS volumes referred as LUNs.
  • Some Appliance Kit Software updates include new feature coming along with improvements in ZFS. To get the benefit of those fixes and improvements, the version of the ZFS pool needs to be updated by applying the "Deferred Updates".
  • The data pool(s) are shown in a state different than Optimal, which means there is a problem with the pool, which could be a hard disk drive causing problems of different kinds or permanent error in the ZFS data structure. A healthy pool should look like this

    s7310:> configuration storage ls
    Properties:
                              pool = mixed
                            status = online
                           profile = mirror
                       log_profile = log_mirror
                     cache_profile = cache_stripe
                             scrub = resilver completed after 0h0m with 0 errors
The list above shows several different situations a data pool can face during its lifetime. Some situations may cause performance impact and some other situations are just based on faulty hardware. Dependent on the nature of the problem there are different ways to get support and information.
After a look into the System documentation available in the online documentation or on Oracle technet servers, the next logical step would be to search the 7000 Series ZFS Appliance Community.
If there is no suitable answer to a situation available through documentation or community, opening a Service Request is the next logical step. While opening a Service Request, as much information should be provided and a Support bundle (of the system showing the situation) should be uploaded. In case of a cluster configuration a bundle of the cluster peer should also be uploaded for review.
Below find some useful links to documentation and the 7000 Series ZFS Appliance Community
https://www.oracle.com/technetwork/documentation/oracle-unified-ss-193371.html
https://communities.oracle.com/portal/server.pt/community/7000_series_zfs_appliance/456
https://wikis.oracle.com/display/FishWorks/Fishworks

4.) The Sun Storage 7000 Unified Storage System boots up all the way, but system pool has problems

The Sun Storage 7000 Unified Storage System has passed BIOS and GRUB, loaded kernel and Appliance Kit software and displays a login prompt. All fine, the services are available and clients can access the data, but the system pool seems to have some problems.

SunOS Release 5.11 Version ak/[email protected],1-1.8 64-bit
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.
Configuring devices.
Configuring network devices ... done.

Sun Storage 7310 Version ak/SUNW,[email protected],1-1.8
Copyright 2012 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.


s7310 console login:
 
As the data pool(s) the system pool can suffer different issue similar to those for a data pool
  • pool usage has exceeded 82%
  • one of the mirrored hard disk drives has failed due to some error conditions
  • pool shows permanent errors
The system pool is not available for maintenance through BUI and CLI, so the only situations which can be solved without involving an Oracle Support Engineer are some limited cleanup operations like removing previous Software releases of the Appliance Software or deleting Analytics Data and Worksheets. A failed system hard disk drive can be replaced by an administrator but even this service operation requires to have a Service Request open with Oracle to get the spare part ordered and shipped.
The remaining service action might require to engage an Oracle Service Engineer and to provide remote access to the system.


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback