Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1001848.1
Update Date:2009-09-27
Keywords:

Solution Type  Problem Resolution Sure

Solution  1001848.1 :   Sun Fire[TM] 12K/15K Server: Terminating a hpost reset/recovery loop test cycle on a domain  


Related Items
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
202532


Symptoms

A failed hardware component can cause a Sun Fire[TM] 12K/15K Server (SF12K/SF15K) domain to loop in hpost as dsmd tries to identify the failed component via hpost testing.

When this happens, the hpost level is incremented to a higher level (more testing) after every unsuccessful boot attempt so that each successive hpost takes longer to complete.
Example is hpost -Q (i.e. 7), 16, 32, 64, 96, 127.

However, it is possible that a domain can appear to be looping in hpost, but the hpost level is NOT incrementing, i.e. hpost 16 is executed every time. This can be an indication of a problem during boot, rather than a hardware failure, the most common cause being an incorrect boot-device or boot path.



Resolution

Following is an Example of the recommended course of action to remedy this issue.

First, keyswitch off the domain

has-sc0:sms-svc:4> setkeyswitch -d A off

Wait for all boards to power down.

Then, from the SC as the sms-svc user, change the auto-boot? param to false via the setobpparams. In the following example we are making changes to OBP params for domain A.

has-sc0:sms-svc:2> setobpparams -d A auto-boot?=false

Note: It is recommended that the setobpparams command be run even if showobpparams already shows that 'auto-boot?' is set to false.

Then check via showobpparams to see if the changes have been made.

has-sc0:sms-svc:3> showobpparams -d A
auto-boot?=false
diag-switch?=true
fcode-debug?=false
use-nvramrc?=true
security-mode=none

Now, keyswitch the domain back on.

has-sc0:sms-svc:5> setkeyswitch -d A on

After powering on, the domain may go through a quick(-Q) hpost which may fail, depending on the previous failure cause. After the next hpost, the domain will go to OBP. Standard troubleshooting practices can now be followed to determine the cause: check post logs for hardware failures, etc.

Note: Please consult the man pages for more information on <showobpparams>, <setobpparams>, and <setkeyswitch>



Product
Sun Fire 12K Server
Sun Fire 15K Server

Internal Comments

See http://has.central for more information on hpost levels and timing.


auto-boot, hpost, loop test, setobpparams
Previously Published As
71598

Change History
Date: 2007-10-02
User Name: 97961
Action: Approved
Comment: - Converted to STM formatting for better readability
- Applied trademarking where it is missing
- Corrected use of trademarking
Version: 4
Date: 2007-10-02
User Name: 97961
Action: Accept
Comment:
Version: 0
Date: 2007-10-02
User Name: 101984
Action: Approved
Comment: Done review and added some changes, mainlt to verbatim. Technical content is correct.

Thanks
Morgan
Version: 0
Date: 2007-10-01
User Name: 101984
Action: Accept
Comment:
Version: 0
Date: 2007-10-01
User Name: 125045
Action: Approved
Comment: Back to Tech Review!
Version: 0
Date: 2007-10-01
User Name: 125045
Action: Rejected
Comment: had to fix my own error - Left the has.central link in the public section. Doh.
Version: 0
Date: 2007-10-01
User Name: 125045
Action: Accept
Comment:
Version: 0
Date: 2007-10-01
User Name: 125045
Action: Approved
Comment: Updated ordering of procedure for greater edge case coverage, also fixed a few typos and moved from internal to contract.
Version: 0
Date: 2007-10-01
User Name: 125045
Action: Update Started
Comment: update for minor typos and ordering.

also changing ordering for keyswitch / setobpparams
Version: 0
Date: 2003-11-10
User Name: 43660
Action: Approved
Comment: Minor grammatical changes. Changed title to be consistent with other docs.
Version: 0
Date: 2003-11-10
User Name: 116819
Action: Approved
Comment: Changed problem description text to differentiate boot failures from post failures.
Version: 0
Date: 2003-11-09
User Name: 106757
Action: Approved
Comment: Need review
Version: 0
Date: 2003-10-19
User Name: 116819
Action: Rejected
Comment: Clarify when to keyswitch
Version: 0
Date: 2003-10-19
User Name: 106757
Action: Approved
Comment: Please review
Version: 0
Date: 2003-09-24
User Name: 103287
Action: Rejected
Comment: Several reasons to send this document back to draft stage. I have emailed the author with detailed comments and offerred to work with him if he desires. Here's the reasons:

1) Title should be changed to reflect the real purpose of the article, which isn't that some domains go into a HPOST loop, but in fact how to stop the HPOST loop.
2) The problem description that this HPOST loop (incremental POST level) behavior is most commonly caused by a disk access problem is not correct. Disk access problems are at OBP, therefore would result in domain panic, or the domain to stay at OBP. Problems at HPOST cause the incremental post level loop cycle (as described in Infodoc 48395).
3) I think the focus of the article should be regarding interupting the HPOST loop process as described by turning off the error reset recovery flag. But, it needs to be explained why we want to do this. Like, we already know why the domain can't bringup properly and want to interupt the hposts so we can blacklist a part to get the domain back up quickly.

The main problem is that the document describes the disk access problem as causing the hpost loop, but in fact it should be something in HPOST which causes the issue. Second, the title needs to better reflect the true nature of the document. Ultimately this document should be a resource on how to disable the error/reset recovery bit, allowing the administrator to manually intervene. Reference to Infodoc 48395 should also be included.
Version: 0
Date: 2003-09-23
User Name: 106757
Action: Approved
Comment: Ready to review
Version: 0
Date: 2003-09-23
User Name: 106757
Action: Created
Comment:
Version: 0
Product_uuid
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback