Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1022287.1 : Update to Service Processor firmware to resolve hangs and related symptoms.
PreviouslyPublishedAs 279330 Bug Id <SUNBUG: 6921482> Product Sun Storage 7110 Unified Storage System Sun Storage 7210 Unified Storage System Sun Storage 7310 Unified Storage System Sun Storage 7410 Unified Storage System Date of Workaround Release 15-Apr-2010 Older versions of SP f/w can leak memory (see details below). ImpactOlder versions of the Service Processor firmware can leak memory, eventually resulting in a variety of issues as listed in the symptoms section.Contributing FactorsAbove listed platforms with Service Processor firmware not up to the levels described in this document are impacted by this issue.When present, the issues surface somewhere between 30 and 60 days of uptime. There is some variation in the time between failures, their severity, and even whether or not they occur at a particular site. The reasons for these variations are not known at this time. SymptomsCannot connect to Service Processor via serial or network.Service Processor absent from hardware details page in BUI Alert. Service Processor has stopped responding to requests. Directories, such as /SYS, missing from SP interface. Fans in server node running continuously at full speed. Slow throughput to system disks (due to fan vibration). Time out during software upgrade (due to system disks/fan vibration). Root CauseRoot Cause is attributed to a number of CRs for memory leaks on the Service Processor. Defects (memory leaks) in the Service Processor firmware lead to an out of memory condition, and an inability to respond to requests. The condition deteriorates until the Service Processor is reset.Corrective ActionWorkaround:The appliance software, as of version 2009.Q3, has a mechanism to reset the Service Processor every 60 days, or sooner if it becomes unresponsive. This is sufficient to prevent the issues on the majority of systems. For systems that experience the problems described above, use the following procedure: First, ensure the Service Processor is responding. This is best done by resetting the Service Processor. Use one of the following two methods: Enter "maintenance hardware select chassis-000 select sp reset" at the appliance kit shell. Download http://tsc-storage.us/products/AmberRoad/download/spreset.akwf, then install and run it from the Maintenance/Workflows screen in the BUI. Consult the appliance help under Maintenance/Workflows for assistance with this process. After executing the workflow, and ensuring that it ran successfully, delete it from the customer system. This process takes some time - on the order of five minutes. The main external indication that the reset has completed is that the fans spin down to a normal speed. You can also monitor progress for any of these operations via a serial connection to the SP. Next, verify that the Service Processor has been reset, via the Alert Log. You should see that the service processor either stopped, then resumed responding to requests, or simply resumed, in the case of a Service Processor that was previously unresponsive. Download the correct BIOS and Service Processor firmware for the system being serviced, as follows: For Sun Storage 7110, 7310, 7410: http://tsc-storage.us/products/AmberRoad/download/0ABMN064-r45008.pkg For Sun Storage 7210: http://tsc-storage.us/products/AmberRoad/download/0ABNF032-r45117.pkg Connect to the Service Processor via ssh using root credentials. Use this interface to shut down the head you are working on with "stop /SYS". Connect to the Service Processor IP address via browser and provide the root login credentials. Follow these steps to upgrade the Service Processor and BIOS: 1. Click on Maintenance tab 2. Firmware Upgrade will be the default and correct subtab 3. Click on "Enter Upgrade Mode" 4. Confirm this action with the pop up 5. Click on "Browse" and select the appropriate image from your local filesystem 6. Click on "Upload" 7. Wait for upload to complete and the verification to succeed 8. You will now see a Summary Table of the SP firmware and BIOS versions (Existing vs New). Confirm that "Preserve existing configuration" is checked for the SP Firmware 9. Click on "Start Upgrade" 10. Confirm this action with the pop up 11. Now wait for the upgrade to proceed. If the head was up at this point, it will be cleanly shutdown. Warning! Do not interrupt the update. Leave the browser undisturbed until the update is complete. 12. When finished, you will see "Upgrade Complete" and the SP will reboot. The SP firmware and BIOS will now have been updated to the correct 7000 version. Now you must configure some specific BIOS settings. Boot the head and enter setup with: -> start /SYS Are you sure you want to start /SYS (y/n)? y Starting /SYS -> start /SP/console Are you sure you want to start /SP/console (y/n)? y Serial console started. To stop, type ESC ( Once you see the initial BIOS banner, hit CONTROL-E a few times; this will trigger the BIOS Setup menu after the initialization. You can drop back to the SP with ESC-( NOTE: Escape, followed by shift 9 - at least open parenthesis is usually on shift 9. NOTE: If the initialisation hangs on a 7310/7410, and it is part of a cluster with the other head up and in service, disconnect the SAS cables to the J4400 JBODs, drop back to the SP and reset with: Serial console stopped. -> reset /SYS Are you sure you want to reset /SYS (y/n)? y Performing hard reset on /SYS -> start /SP/console Are you sure you want to start /SP/console (y/n)? y Serial console started. To stop, type ESC ( If you use this workaround, be very certain to reconnect the SAS cables immediately after correcting the BIOS settings. Once into the BIOS Setup screen, start by loading factory defaults. To do this, use the right arrow key to move over to the "Exit" menu. Down arrow to "Load Optimal Defaults" and, then again to confirm the popup asking "Load Opitmal Defaults". Now follow the specific instructions for the appropriate appliance: For Sun Storage 7110: Disable PCIPnP Option-ROM scanning for slots 1-5 Disable I/O allocation Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight: Scanning OPROM on PCI-E Slot1 Enabled Press return and select "Disabled". This will now appear as: Scanning OPROM on PCI-E Slot1 Disabled Repeat this for slots 2-5 (the last slot is off the bottom of the screen). You should now have: Scanning OPROM on PCI-E Slot0 Enabled Scanning OPROM on PCI-E Slot1 Disabled Scanning OPROM on PCI-E Slot2 Disabled Scanning OPROM on PCI-E Slot3 Disabled Scanning OPROM on PCI-E Slot4 Disabled Scanning OPROM on PCI-E Slot5 Disabled Just below these OPROM settings are a group of settings which allow IO allocation to be disabled per-slot. Disable PCI-E slots 1-4. Only slots 0 and 5 should be enabled. It should look like: IO Allocation on PCI-E Slot0 Enabled IO Allocation on PCI-E Slot1 Disabled IO Allocation on PCI-E Slot2 Disabled IO Allocation on PCI-E Slot3 Disabled IO Allocation on PCI-E Slot4 Disabled IO Allocation on PCI-E Slot5 Enabled On boot, you will see the following warning message from the BIOS: Warning: IO resource not allocated This is an expected message and does not indicate a failure. Exiting BIOS Setup Use right arrow to page over to "Exit". Press for the default "Save Changes and Exit", and again to confirm the action with the pop up. For Sun Storage 7210: Disable PCIPnP Option-ROM scanning for all slots Disable I/O allocation Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight: Scanning OPROM on PCI-E Slot0 Enabled Press return and select "Disabled". This will now appear as: Scanning OPROM on PCI-E Slot0 Disabled Repeat this for slot 1 and 2. You should now have: Scanning OPROM on PCI-E Slot0 Disabled Scanning OPROM on PCI-E Slot1 Disabled Scanning OPROM on PCI-E Slot2 Disabled Just below these OPROM settings are a group of settings which allow IO allocation to be disabled per-slot. Disable PCI-E slots 0 and 2. Only slot 1 should be enabled. It should look like: IO Allocation on PCI-E Slot0 Disabled IO Allocation on PCI-E Slot1 Enabled IO Allocation on PCI-E Slot2 Disabled On boot, you will see the following warning message from the BIOS: Warning: IO resource not allocated This is an expected message and does not indicate a failure. Exiting BIOS Setup Use right arrow to page over to "Exit". Press for the default "Save Changes and Exit", and again to confirm the action with the pop up. For Sun Storage 7310: Disable PCIPnP Option-ROM scanning for all slots Disable I/O allocation Configure boot drives Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight: Scanning OPROM on PCI-E Slot0 Enabled Press return and select "Disabled", followed by return. This will now appear as: Scanning OPROM on PCI-E Slot0 Disabled Repeat this for slots 1-2. You should now have: Scanning OPROM on PCI-E Slot0 Disabled Scanning OPROM on PCI-E Slot1 Disabled Scanning OPROM on PCI-E Slot2 Disabled Just below these OPROM settings are a group of settings which allow IO allocation to be disabled per-slot. Disable PCI-E slots 1 and 2. Only slot 0 should be enabled. It should look like: IO Allocation on PCI-E Slot0 Enabled IO Allocation on PCI-E Slot1 Disabled IO Allocation on PCI-E Slot2 Disabled Next, arrow over to the Boot menu. Select the last item: "Hard Disk Drives" and press return. The list should include only 2 drives (the 2 internal SATA drives) with labels like: SATA:11M-<drive model> SATA:12M-<drive model> If this list includes anything else (such as readzilla cache devices with a 'STEC MACH8' string, or JBOD attached drives) you'll need to remove them from the list by selecting the boot position and setting it to 'Disabled' for each of non-boot drives. If the list is full (with 16 drives) you will not be able to edit the list. However, the change to the OPROM settings above will cause the JBOD drives to disappear from the list on the next boot. You will need to exit and save changes and immediately re-enter the BIOS menu on the next boot (CTRL-E). Exiting BIOS Setup Once you've removed any readzilla cache or JBOD drive entries from the "Hard Disk Drives" list, perform the following; . Press ESC to exit the "Hard Disk Drives" menu, then arrow right to the "Exit" menu. . Press for the default "Save Changes and Exit", and return again to confirm the action with the pop up. On boot, you will see the following warning message from the BIOS: Warning: IO resource not allocated This is an expected message and does not indicate a failure. For Sun Storage 7410: Disable PCIPnP Option-ROM scanning for all slots Disable I/O allocation Configure boot drives Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight: Scanning OPROM on PCI-E Slot0 Enabled Press return and select "Disabled", followed by return. This will now appear as: Scanning OPROM on PCI-E Slot0 Disabled Repeat this for slots 1-5 (the last slot is off the bottom of the screen). You should now have: Scanning OPROM on PCI-E Slot0 Disabled Scanning OPROM on PCI-E Slot1 Disabled Scanning OPROM on PCI-E Slot2 Disabled Scanning OPROM on PCI-E Slot3 Disabled Scanning OPROM on PCI-E Slot4 Disabled Scanning OPROM on PCI-E Slot5 Disabled Just below these OPROM settings (they are actually off the bottom of the screen and you will need to scroll down) are a group of settings which allow IO allocation to be disabled per slot. Disable PCI-E slots 0-3, checking that slots 4 and 5 are Enabled. It should look like: IO Allocation on PCI-E Slot0 Disabled IO Allocation on PCI-E Slot1 Disabled IO Allocation on PCI-E Slot2 Disabled IO Allocation on PCI-E Slot3 Disabled IO Allocation on PCI-E Slot4 Enabled IO Allocation on PCI-E Slot5 Enabled Next, arrow over to the Boot menu. Select the last item: "Hard Disk Drives" and press return. The list should include only 2 drives (the 2 internal SATA drives) with labels like: SATA:11M-<drive model> SATA:12M-<drive model> If this list includes anything else (such as readzilla cache devices with a 'STEC MACH8' string, or JBOD attached drives) you will need to remove them from the list by selecting the boot position and setting it to 'Disabled' for each of non-boot drives. If the list is full (with 16 drives) you will not be able to edit the list. However, the change to the OPROM settings above will cause the JBOD drives to disappear from the list on the next boot. You will need to exit and save changes and immediately re-enter the BIOS menu on the next boot (CTRL-E). Exiting BIOS Setup Once you've removed any readzilla cache or JBOD drive entries from the "Hard Disk Drives" list, perform the following; . Press ESC to exit the "Hard Disk Drives" menu, then arrow right to the "Exit" menu. . Press for the default "Save Changes and Exit", andreturnagain to confirm the action with the pop up. On boot, you will see the following warning message from the BIOS: Warning: IO resource not allocated This is an expected message and does not indicate a failure. Resync SP Password. Finally, resync the SP password to match the root password of the NAS head. Have the customer complete this final step. Remember, you exit back to the SP using ESC-( -> cd /SP/users/root /SP/users/root -> set password Enter new password: ********* Enter new password again: ********* NOTE: The SP has a minimum password length of 8 which is not enforced by the appliance system software. ie, if the customer has the hopelessly simple password "abc", then this will be rejected by the SP. To resolve this, the customer will need set a new password from the appliance, which in turn will update the SP password directly. Resolution: In a future release, in-band Service Processor updates will be supported. At that point in time, the reset procedure will be removed from the appliance software, and avoiding these issues will be as simple as keeping the system software up to date. Identification of Affected Parts (how to): Connect via ssh to the Service Processor and supply root credentials. The SP version will be displayed as part of the logon banner. The current version for the 7110, 7310 and 7410 is 2.0.2.16. Version 2.0.2.15 is current for the 7210. Any prior version is susceptible to these issues. Note that checking the SP version via other means, such as the administrative BUI can be unreliable. Due to a bug in some releases, version 2.0.2.16 may also be displayed as 2.0.2.22. CommentsVersion 2.0.2.16 is the latest supported version of the Service Processor firmware for the 7110, 7310 and 7410. Version 2.0.2.15 is the latest supported version for the 7210. Newer versions should not be used unless specifically tested and released for the appliance. If a newer version is found, with the exception noted above in the "Identification of Affected Parts" section, you should escalate the case to TSC Backline, and additionally report the version found and serial number of the system to those listed as Contributor and Responsible Manager in the "Contacts" section below of this FAB.This procedure assumes that the Service Processor has been configured with an IP address. If this has not been done, refer to the appliance documentation under "Installation". There is no minimum system software requirement to run this procedure, however the customer should always follow the standard guideline of running no more than one major version behind the current release. This can be done in a "rolling" fashion on cluster systems, simply perform the procedure on one node at a time, the clustering software will move resources to the partner node. For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL: For Sun Authorized Service Providers go to: In addition to the above you may email: Internal Contributor/submitter [email protected] Internal Eng Responsible Engineer [email protected] Responsible Manager: [email protected] Internal Services Knowledge Engineer [email protected] Internal Eng Business Unit Group NWS (Network Storage) Internal Sun Alert & FAB Admin Info 06-Apr-2010: Completed draft and sent to Extended Review. 08-Apr-2010: On-hold awtg submitter corrections per feedback from Ext Rvw. 15-Apr-2010: Submitter provided corrections - sending to Publish. Attachments This solution has no attachment |
||||||||||||
|