Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1458411.1
Update Date:2012-05-22
Keywords:

Solution Type  Problem Resolution Sure

Solution  1458411.1 :   Netra CT810: problem to replace a faulted SPARC CP2160, restoring the video and audio card funcionalities  


Related Items
  • Sun Netra CP2160 Blade Server
  •  
  • Sun Netra CT810 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Usx/Blade/Netra>SN-SPARC: Netra Cxxxx
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-5352386063>

Applies to:

Sun Netra CT810 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Netra CP2160 Blade Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on SPARC (64-bit)

Symptoms

Customer environment configuration
- Netra CT810
- Satellite blades CP2160 (Sputnik+), p/n 375-3129 with different HW Revision / Dash level

ok banner
Netra CT 810 HD
SPARCengine CP2000 model 160 (UltraSPARC-IIi 648MHz), Keyboard Present
OpenBoot 4.0, 1024 MB memory installed, Serial #336594366.
Ethernet address 0:14:4f:10:5:be, Host ID: 941005be.

- 2 mezzanine cards installed into the satellite: Peritek Argus AD6-00729-A501 (graphics PMC card) and ACT Technico 8043-BFP (audio PMC card)
- Solaris 9 Update 8 (Solaris 9 9/05), NHAS 3.0, JLL 12.1

Original problem description
We had to replace a faulted satellite, restoring the video and audio card funcionalities

Cause

Problems encountered during the activity

Initially we replaced the faulted hw with the alternative satellite, p/n 375-3394, also re-mounting the original mezzanine cards.
Then customer downgraded the fw installed from the factory (1.0.21) on the new satellite to the older release (1.0.20), because Sun certified in 2007 that fw level on the customer sw layer and to align all the fw satellite releases into the CT chassis.
At the end we correctly set the OBP variables for the Netconsole enabling (Doc ID 1004140.1, see also note 1 and 2 at the end of the document), but the satellite stopped to work, no longer reaching the OBP and displaying error like this

Could not support 0xa7fffff8, bar size in PCI memory mapping
Could not support 0xa8000000, bar size in PCI memory mapping
Could not support 0xa8000000, bar size in PCI memory mapping
Could not support 0xfd500000, bar size in PCI memory mapping

We tried all the possible actions to restore the functionality of the board, without success:

  • directly placing a serial console into the front panel mini DIN connector, the <shift> ~# key sequence (note that the escape sequence depends on the console input device and the terminal emulator you are using with that device) did not help

N.B.: examples of escape sequence are ~b, ~.,~g, ~t, ~n  obviously tried with the goal to not only reset the OBP values, but also to test the board response and the MCNet-IPMI mode switching; all the escape sequences correctly worked on the other satellites, not on this one

  • hitting the abort button (note that things will be very slow in this state since the abort button disables all CPU caches, so, to issue 'icache on' to speed things up is a good further action) on front board panel did not help
  • holding down <control> n from the serial port, quickly after power on (note that you cannot send this sequence anytime, this must be done very shortly after the blade is reset because the firmware probes the input device shortly after it starts to initialize and, if the n sequence is detected, the firmware resets the NVRAM parameters to default), did not help
  • setting the switch sw4101, on the board, to the opposite setting it currently is at (if by chance there is an alternate image in the user flash from which you will boot off), did not help
  • swapping the satellite slot did not help

Unfortunately, the board has been corrupted.

Solution

Correct procedure followed in this case to avoid useless corruption and replacement

  • replaced the satellite with new p/n 375-3394, because the first one was been corrupted
  • installed the mezzanine cards
  • set the OBP variable following the official documentation
  • installed the dropin (see Dropin section to understand the meaning and how to install it)

N.B.: we also tried to replace the satellite with a new p/n 375-3129, because we supposed to have a sort of "“backward compatibility” at fw level with the newest p/n 375-3394, but this was an erroneous hipothesys that, instead, delayed the resolution and it did run into another issue, the DOA of the new 375-3129 tested

Conclusions and recommendations

  • there are not functional changes in the HW for CP2160's p/ns 375-3129 and 375-3394
  • never downgrade a firmware release that comes from the factory. The only time we update the factory installed firmware is when there is either a critical bug fix or a new hardware requirement

It is also not logical to think we would ship a blade with a new flash device installed with firmware that does not support that device.  No one should ever downgrade a firmware version installed in the factory unless they have confirmation it is safe to do so.   
In this case there is a new flash device that the old firmware driver does not support so these new boards were essentially broken as soon as you downgraded the firmware.

  • the changes made on the latest fw release are independent of the running OS, 375-3394's PCN has nothing which can affect also NHAS support

The customer had to re-qualify his environment/applications at 1.0.21, because 1.0.20 is not compatible with the new flash PROM; it's the responsibility of the customer to view the changes made via PCN and/or release notes (see attachments) then certify the changes themselves

  • proactively testing, before installing the dropin, manually enter the dropin commands by setting the auto-boot? to false, resetting and entering the commands.  Then boot.  If all is ok, install the dropin.


Dropin

In 2006 Sun engineering made a script in order to fix a bug, to correct intermittent cPCI bus errors (Simba panic/error). The dropin hard codes PCI configuration addresses and disables all the devices on the second PLX bridge on the Peritek PMC found as the cause of the random panic.

Manual installation instructions

Set the auto-boot? environment variable in OBP to false and reset the CP2160.

From the ok prompt enter the following OBP commands.  You may want cut and paste as the forth programming language is very syntax sensitive:

ok " /pci@1f,0/pci@1,1/pci@4/pci@4" $delete-device drop
ok select pcib
ok 0 22004 config-w!
ok ff 2201c config-b!
ok 0 2201d config-b!
ok ffff 22020 config-w!
ok 0 22022 config-w!
ok 0 2203e config-w!
ok end-select-dev
ok boot

The system should boot without resetting.  Note that the above changes are lost if you subsequently reset the system so please keep this in mind when we gather the data for this experiment.

Permanent dropin script installation instructions

ok add-dropin net:,/oci-peritek.di
Loading file: net:,\oci-peritek.di
Requesting Internet Address for 0:14:4f:16:df:fc
Requesting Internet Address for 0:14:4f:16:df:fc
Requesting Internet Address for 0:14:4f:16:df:fc
26a Bytes
Erasing FLASH...Done.
Programming FLASH... Done.
Verifying FLASH PROM Done. 

Notes

1> A graphics device has never been officially supported as one of the mux devices so although this may have worked in the past.

It has never been officially supported. In other words, you can run a graphics device but there is no support for having that graphics device part of the console mux.  

Note that does not mean it does not or will not work it just means that the QA group does not test that functionality.

2> Customer OBP settings example

ok printenv

Variable Name         Value                          Default Value
 
dhcp-clientid
multiplexer-output-devices  ttya ssp-serial          ttya ttye
multiplexer-input-devices  ttya ssp-serial           ttya ttye
front-phy?            false                          true
shutdown-temperature  80                             80
critical-temperature  75                             75
warning-temperature   70                             70
env-monitor           disabled                       disabled
ntp-server-addr       255.255.255.255                255.255.255.255
ntp-enable?           false                          false
auto-config-save?     true                           true
diag-passes           1                              1
diag-continue?        0                              0
diag-targets          0                              0
diag-verbosity        0                              0
post-on-sir?          false                          false
keyboard-click?       false                          false
keymap
scsi-initiator-id     7                              7
#power-cycles         25                             No default
system-board-serial#  005848                         No default
system-board-date     11/22/05                       No default
ttyb-rts-dtr-off      false                          false
ttyb-ignore-cd        true                           true
ttya-rts-dtr-off      false                          false
ttya-ignore-cd        true                           true
ttyb-mode             9600,8,n,1,-                   9600,8,n,1,-
ttya-mode             9600,8,n,1,-                   9600,8,n,1,-
cpci-probe-list       0,1,2,3,4,5,6,7,8,9,a,b, ...   0,1,2,3,4,5,6,7,8,9,a,b, ...
pcia-probe-list       1                              1
pcib-probe-list       1,2,3,4                        1,2,3,4
probe-delay           30                             30
mfg-mode              off                            off
diag-level            max                            max
watchdog-timeout      65535                          65535
watchdog-enable?      false                          false
fcode-debug?          false                          false
output-device         output-mux                     ttya
input-device          input-mux                      ttya
load-base             16384                          16384
auto-boot-retry?      true                           false
boot-command          boot                           boot
auto-boot?            false                          true
watchdog-reboot?      false                          false
network-boot-arguments
diag-file
diag-device           net                            net
boot-file
boot-device           net:dhcp,,,,,5 net2:dhcp ...   disk net
net-timeout           0                              0
ansi-terminal?        true                           true
screen-#columns       80                             80
screen-#rows          34                             34
local-mac-address?    true                           false
silent-mode?          false                          false
use-nvramrc?          false                          false
nvramrc
security-mode         none                           No default
security-password                                    No default
security-#badlogins   0                              No default
oem-logo                                             No default
oem-logo?             false                          false
oem-banner                                           No default
oem-banner?           false                          false
hardware-revision                                    No default
last-hardware-update                                 No default
diag-switch?          false                          false

References

<NOTE:1004140.1> - Netra[TM] CT410/CT810 Server: How to redirect console of CPU board to Alarm Card
HTTPS://SUPPORT.US.ORACLE.COM/HANDBOOK_INTERNAL/DEVICES/SYSTEM_BOARD/SYSBD_CP2160.HTML
@<NOTE:1348151.1> - Netra CT 410/810 Product Page (Makaha)
HTTPS://STBEEHIVE.ORACLE.COM/CONTENT/DAV/ST/VSP_DOCS_TOIS_PDFS/DOCUMENTS/VSP_TOIS/CT410/MA2160.SXIL

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback