SunFire[TM] 12K/15K/20K/25K: During POST Cycle, "lock

Asset ID:	1-71-1009071.1
Update Date:	2011-05-26
Keywords:

Solution Type Technical Instruction Sure

Solution 1009071.1 : SunFire[TM] 12K/15K/20K/25K: During POST Cycle, "lock_retries" Messages Appear

Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Goal

The HOST POST (hpost) application is responsible for probing, testing, and
configuring the hardware of a Sun Fire[TM] 12K-25K domain, preparing it for use by the OpenBoot[TM] PROM and the Solaris[TM] Operating Environment (Solaris[TM] OS).

The Sun Fire 12K/15K/20K/25K platform's /opt/SUNWSMS/bin/hpost executable houses both the power-on self-test(POST) as well as the necessary logic involved
in sequencing POST's operations.

Solution

On occasion, during the course of a POST run on a specific domain/SB (through hpost's application, its "-d <Domain_Id_or_Tag>" and/or "-H<exp>.<slot>" options), the following messages are captured in the resulting POST logs (located at /var/opt/SUNWSMS/SMS/adm/<domain-tag>/post):

stage cpu_lpost: Test all L1 CPU boards...
Performing ASIC config with bus config a/d/r = 333...
Slot0 in domain: 00001
Slot1 in domain: 00000
EXBs in use: 244E7
sgcpu.flash file: Version 5.14.2 Build 2.0 I/F 12 is newest supported
Fprom SB0/F0: NOTE: lpost_vercheck(): Using up-rev LPOST version 5.14.6
Build 1.0 I/F 12 (from: 5.14.2 Build 0 I/F 12)
Fprom SB0/F1: NOTE: lpost_vercheck(): Using up-rev LPOST version 5.14.6
Build 1.0 I/F 12 (from: 5.14.2 Build 0 I/F 12)
Proc SB0/P2: lock_retries = 1
Proc SB0/P3: lock_retries = 1
Proc SB0/P0: lock_retries = 1
Proc SB0/P1: lock_retries = 1
Proc SB0/P2: lock_retries = 2
Proc SB0/P3: lock_retries = 2
Proc SB0/P0: lock_retries = 2
Proc SB0/P1: lock_retries = 2
Proc SB0/P2: lock_retries = 3
Proc SB0/P3: lock_retries = 3
Proc SB0/P0: lock_retries = 3
Proc SB0/P1: lock_retries = 3
Proc SB0/P2: lock_retries = 4
Proc SB0/P3: lock_retries = 4
Proc SB0/P0: lock_retries = 4
Proc SB0/P1: lock_retries = 4
Proc SB0/P2: lock_retries = 5
Proc SB0/P3: lock_retries = 5
Proc SB0/P0: lock_retries = 5
Proc SB0/P1: lock_retries = 5
Proc SB0/P2: lock_retries = 1
Proc SB0/P3: lock_retries = 1
Proc SB0/P0: lock_retries = 1
stage nmb_cpu_lpost: Non-Mem Board Proc tests...
Performing ASIC config with bus config a/d/r = 333...
Slot0 in domain: 00001
Slot1 in domain: 00000
EXBs in use: 244E7
stage_cpu_lpost(): No NMB Boards in config. Skipping Stage nmb_cpu_lpost.

In general, the System Management Services (SMS) subsystem employs the services of the hardware access daemon (hwad) to access specific hardware and is normally expected to lock the JTAG/I2C/ Bootbus master, board, or system with which it is currently communicating to prevent multiple SMS services (that is, POST and the environmental status monitoring daemon (ESMD)) from interfering with each other.
The locking is usually facilitated through software mutexes enabled through services provided by the SMS hwad libraries.
The "lock_retries" messages (listed previously) arise when the hpost application attempts to acquire a lock (controlled by a software mutex on the System Controller's SMS subsystem) to a specific hardware (for example, SB0).

All SMS utilities and services (including hpost) need to adhere to software controlled locks (maintained as a hierarchy of mutex operations) to prevent inter-process deadlocks.

In the previous example, hpost requested that the lock requests be serviced by a NOWAIT mode; that is, if it cannot acquire the bootbus lock on a specific proc, hpost check another proc (instead of waiting). This procedure not only helps to optimize the polling cycle but also avoids potential pitfalls in serialization behind other processes that are holding onto the lock, or locks, at the same time.

In addition, while maintaining a retry count on these NOWAIT requests, facilities are provided for hpost to 'backoff' on a request for the lock with a TIMEOUT request (in the event that the same proc fails with the NOWAIT requests five times). The default timeout of three minutes is a .postrc tunable.

For example, hwad_lock_timeout_secs value specifies the timeout for requests to lock resources with respect to hwad operations overriding the default values.
The default timeout value almost never fails in a typical environment and further tuning of the preceding timeout value is strongly discouraged.

In conclusion, the "lock_retries" messages were reported when hpost attempted to acquire access to SB0 while some other process (for example, SMS's ESMD routine board temperature/voltage reporting tasks) was also in the process of accessing that same resource.
The sample POST log's excerpt shows that hpost successfully backed off to a TIMEOUT regime, and the POST run was allowed to complete successfully.

RESOLUTION: In such cases as described above, the "lock_retries" messages can be safely disregarded.

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Section

Keywords: hpost, post, lock_retries, starcat, hwad, mutex, libraries

Previously Published As 75786

Attachments

This solution has no attachment