Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1401282.1
Update Date:2012-10-01
Keywords:

Solution Type  Troubleshooting Sure

Solution  1401282.1 :   Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)  


Related Items
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun ZFS Storage 7320
  •  
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  


To assist a user in resolving management BUI/CLI connectivity/responsiveness issues.

In this Document
Purpose
Troubleshooting Steps
 Symptoms:
 Causes and Resolutions:
 Excessive kernel virtual memory (exceeding the 32-bit VM limit)
 Excessive amount of 'old' analytics
 Excessive amount of 'old' log files
 Excessive use of contracts
 Excessive amount of AKD process (memory) heap fragmentation
 Further Assistance Required:
 Other useful information:
References


Applies to:

Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7310 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7210 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Purpose

The purpose of this document is to assist a user in resolving management BUI/CLI connectivity/responsiveness issues.  If ssh to the appliance drops the user into the emergency shell, the end user must open a support session to allow the Oracle System Support team remote access to the system to troubleshoot and fix this issue.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

 

Customers are not permitted to run commands at the emergency shell.

Troubleshooting Steps

Please validate that each troubleshooting step below is true for the affected environment.  The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary.  The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution.  Please do not skip a step.

Symptoms:

The usual symptoms of an Unresponsive Administrative Interface issue are:

  • Cannot login to BUI/CLI (on one, or both, nodes of a cluster)
  • Slow startup of system/management interfaces

See the following Internal-Only Documents for collecting useful data.

  • <Document 1401288.1> - Unified Storage System: Data collection for akd hang issues
  • How to analyse akd core files (Wiki Doc TBD)

Causes and Resolutions:

The initial step is to check the basic configuration/operation of the appliance management connectivity, please see:
<Document 1392845.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Loss of Network Connection to the Management Interface.

This is a general 'non-exhaustive' list all possible causes of 'BUI/CLI hang' conditions:

  •   AKD exceeding the 32-bit kernel virtual memory limit (~3.5Gb+)
  •   Excessive amount of 'old' analytics
  •   Excessive amount of 'old' log files
  •   Excessive use of contracts
  •   Excessive amount of AKD process (memory) heap fragmentation

To be added to ... 'non-exhaustive' list all possible causes of 'BUI/CLI hang' conditions:

  • SVM system hanging when svc.configd overflows it's heap
  • 'Temporary' hang when ZFS dataset destroy is running
  • Cluster 'peer' locking issue
  • Replication 'peer' locking issue
  • Faulty (?) hardware issue [clustron card, cables]
  • All 'other' known issues


For each of these causes, the details of known issues regarding each cause will be given below - along with the specific symptoms and the recommended actions for resolution.

PLEASE NOTE:  Since we are unable to login to the Administrative Interfaces (BUI or CLI), the customer may be unable to view the standard 'error/fault' reporting mechanisms:

  • FMA events  -> (BUI) Maintenance > Problems > Active Problems
  • Alert log  -> (BUI) Maintenance > Logs > Alerts
  • Fault log  -> (BUI) Maintenance > Logs > Faults
  • System log  -> (BUI) Maintenance > Logs > System
  • Audit log  -> (BUI) Maintenance > Logs > Audit
  • Phone Home log  -> (BUI) Maintenance > Logs > Phone Home

... to provide further diagnostic/context data to assist in isolating the cause of the issue.

In the vast majority of cases where a BUI/CLI hang is observed, you will need to engage Oracle System Support by opening a Service Request to assist in determining the root cause of the problem.

Excessive kernel virtual memory (exceeding the 32-bit VM limit)

For any system running Appliance Release versions earlier than 2010.Q3.4 or 2011.1, running (many) aksh scripts can exhaust the Appliance management daemon kernel memory.

See <Document 1334777.1> - Sun Storage 7000 Unified Storage System: System hang - aksh scripts can exhaust memory

Attempting to login to the CLI, generates a 'fatal error: no memory' message.

See <Document 1325025.1> - Sun Storage 7000 Unified Storage System: aksh fatal error: no memory

For any system running Appliance Release versions earlier than 2011.1, the creation and deletion activities for a large amount of VDI LUNs can cause a BUI/CLI hang condition.

See <Document 1408593.1> - Sun Storage 7000 Unified Storage System: Creation/deletion of large amount of VDI LUNs can cause BUI/CLI hang

To monitor the memory used by akd a workflow can be used.

See <Document 1391232.1> Sun Storage 7000 Unified Storage System: The workflow to check memory usage of the akd.

Excessive amount of 'old' analytics

Due to the detailed amount of information available when using analytics, and the 'always on' operation for the collection of the default set of analytics, collection of 'excessive' analytics data can eventually cause a 'hang' condition for the Appliance management interfaces (BUI/CLI).

See <Document 1401595.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected

A 'hang' condition for the Appliance management interfaces (BUI/CLI) may result due to a known analytics compilation bug.

See <Document 1468128.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to analytics compilation (CCP) bug

Excessive amount of 'old' log files

For any system running Appliance Release versions earlier than 2010.Q1.0, system libraries used by akd can exceed a 256 file descriptor limit if many (old) logfiles are present. This can cause a 'hang' condition for the Appliance management interfaces (BUI/CLI).

See <Document 1408493.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' amount of 'old' log files

Excessive use of contracts

Whenever a workflow terminates abnormally, it leaves a unused 'contract id'.   Eventually, the contract limit may be exceeded  and processes are unable to start.  Error messages may include "Resource temporarily unavailable".

See <Document 1410873.1> - Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion

Excessive amount of AKD process (memory) heap fragmentation

For any system running Appliance Release versions earlier than 2011.1.5.0, the akd process controlling the management interface can run out of memory because of memory fragmentation issues due to large number of oversize allocations.

See <Document 1494369.1> - Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"


====================================================================
Additional topics for content creation ...

Excessive kernel virtual memory (exceeding the 32-bit VM limit)
7004697 ak_stream_buffer allocation is conducive to heap fragmentation
7064392 7410 cluster (strl1) for abut 6 min not serving NFS (dup of 7004697)
7123344 akd hang on ak_job_cancel (workflow related)

'Temporary' hang when ZFS dataset destroy running
6938339 akd spinning in tight loop destroying zombie snapshots with holds

Cluster 'peer' locking issue
6768696 Mr. Freeze and ak_cio_disable wreak havoc with rm lock
6840270 akd can fail to learn it has taken the rm lock
6916485 RM lock tied in knots while making XDR-RPC call
6956503 rm locking deadlock due to the race between two peer server threads
7047128 BUI/CLI are not accessable. I tried to restart akd and this did not help (dup of 6956503)
7058331 deadlock between ak_peer_server and its assassins during replication

Replication 'peer' locking issue
6919370 rm lock deadlock creating replication target
7066043 rm deadlock in nas_repl_createTarget

Faulty hardware issue [clustron card, cables]
7014716  akd went into maintenance because of 3520/3524 mixed SIM code on 7420 new install
7054700  another discovery loop in pacs, fixed in Q3.4.2

SMF framework issue
7076205  SMF services down [svcs: Could not bind to repository server: repository server unavailable]

All 'other' known issues
7091568  akd slow and storage add taking long time (dup of 6525233)
6525233  vdev fullness can degrade performance, should cause zpool to become degraded
6915532  snapshot related activity causes akd to hang
6975601 changing shadow migration threads or cancel migration can lead to deadlock
6924824  destroying a dedup-enabled dataset bricks system

=====================================================================

Further Assistance Required:

At this point, if you have validated that each troubleshooting step above is true for your environment and the problem still exists, further troubleshooting is required.
You will need to engage Oracle Support by opening a Service Request to assist you further.

Please include all the relevant details and information - including examples of any errors that you see - along with an accurate problem description in the SR notes.

If possible, a current supportbundle (from both heads, if this a cluster system) should also be obtained and uploaded to Oracle.

The following links will provide more information:

  • <Document 1019887.1> - Sun Storage 7000 Unified Storage System: How to collect supportfile bundle using the BUI or CLI
  • <Document 1345655.1> - Sun Storage 7000 Unified Storage System: How to provide the correct Serial Number when opening an Oracle Service Request on a ZFS Storage Appliance or S7000 series NAS
It may be necessary for the Oracle Support Engineer to remotely run some 'emergency shell' commands. To accomplish this, the Oracle Support Engineer may request that you initiate an Oracle Shared Shell session. It would be useful if you are already familiar with this remote access tool - please see:

https://www.oracle.com/us/support/systems/premier/shared-shell-sun-systems-163755.html

 

Other useful information:

1. The Online Appliance Wiki documentation can be found at:

https://<appliance-ip-address>:215/wiki/index.php

2. To upgrade to the latest Appliance Firmware Release:

There are many improvements in later Appliance Firmware releases, please check the current Appliance Firmware revision and, if required, upgrade to the latest release:

See https://wikis.oracle.com/display/FishWorks/Software+Updates

3. If the BUI and CLI are completely hung, and you are unable to access the console via the Service Processor, then if you wish to reset the system and still gather some useful diagnostic information you can do this by issuing a NMI reset to the system.  This will cause the system to gather a kernel crash dump. The procedure to do this is documented in:

<Document 1173064.1> - Sun Storage 7000 Unified Storage System: How to generate NMI to collect a system core dump

 

Back to <Document 1416406.1> ZFS Storage Appliances Troubleshooting Resource Center.

References

<NOTE:1392845.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Loss of Network Connection to the Management Interface
<NOTE:1334777.1> - Sun Storage 7000 Unified Storage System: System hang - aksh scripts can exhaust memory
<NOTE:1325025.1> - Sun Storage 7000 Unified Storage System: aksh fatal error: no memory
<NOTE:1408593.1> - Sun Storage 7000 Unified Storage System: Creation/deletion of large amount of VDI LUNs can cause BUI/CLI hang
<NOTE:1401595.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected
<NOTE:1468128.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to analytics compilation (CCP) bug
<NOTE:1410873.1> - Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion
<NOTE:1391232.1> - Sun Storage 7000 Unified Storage System: The work flow to check memory usage of the akd.
@<NOTE:1401288.1> - Sun Storage 7000 Unified Storage System: Data collection for akd hang issues
<NOTE:1408493.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' amount of 'old' log files
<NOTE:1019887.1> - Sun Storage 7000 Unified Storage System: How to collect a supportbundle using the BUI or CLI
<NOTE:1345655.1> - Sun Storage 7000 Unified Storage System: How to provide the correct Serial Number when opening an Oracle Service Request on a ZFS Storage Appliance or S7000 series NAS
<NOTE:1173064.1> - Sun Storage 7000 Unified Storage System: How to generate NMI to collect a system core dump
<NOTE:1416406.1> - Sun ZFS Storage Appliances Troubleshooting Resource Center
<NOTE:1494369.1> - Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback