Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1022238.1 : Sun Storage 7410 recovery procedure for mismatched network device names.
PreviouslyPublishedAs 278131 Bug Id <SUNBUG: 6811589> Product Sun Storage 7410 Unified Storage System Date of Resolved Release 04-Mar-2010 7410 recovery procedure for mismatched network device names (see details below). ImpactIf network interfaces in the two head nodes of a Sun Storage 7410 cluster have mismatched device names (between the heads), cascading problems can occur. Relocating the cards to identical slot locations after their initial discovery may not resolve the device name mismatch. If this condition is left unresolved, it will lead to cascading failure modes within the cluster software. These failure modes can include loss of network configuration ability (CR 6811589) and peer node reboots upon failback (due to the inability to import/export resources related to the mismatched network device names).Contributing FactorsThe network device name mismatch condition will occur if the following incorrect sequence is followed:1. Power down both Head_A and Head_B. 2. Install a NIC in slot(i) on Head_A where slot(i) is a valid slot for the NIC. Example: 10Gb NIC installed into PCIe slot 3. 3. Boot Head_A (the newly installed devices get assigned names, e.g., the 10Gb NIC is assigned device names nxge0 and nxge1). 4. Power down Head_A (imagine the administrator changes his mind on the slot location for the card). 5. Move the NIC from slot(i) to slot(j) where j is a different (but also valid) slot for the NIC. The names assigned to the NIC devices on Head_A will now change, e.g., they may now be nxge4 and nxge5. 6. Install the same NIC card model into slot(j) of Head_B. 7. Boot both heads. 8. At this point the device names of the network devices do not match between the two heads. Example: The newly discovered 10Gb NIC in Head_B is assigned device names nxge0 and nxge1, while the relocated card in Head_A retains the names nxge4 and nxge5. The correct procedure for installing NIC cards into 7410 cluster nodes is this: 1. Power down both Head_A and Head_B. 2. Install NIC in a valid slot (must be the same slot on each head). 3. Boot both Head_A and Head_B (order isn't important). 4. Confirm new network devices show up on each head and that the device names match, e.g., nxge0 and nxge1 appear on both heads after installing a 10Gb NIC card in PCIe slot 2. SymptomsCluster nodes have different device names for the same corresponding NIC ports across heads. Example: There is a single 2-port 10Gb NIC installed in PCIe slot2 on each head. In one head, the devices are named nxge0 and nxge1 while on the other head, they have the names nxge4 and nxge5. This happens with the above incorrect sequence of installing cards/booting/and re-locating cards. If the failure mode is allowed to persist, cascading symptoms in the cluster software will present themselves. These include BUI exceptions in the network configuration screen which block the ability to make network changes (CR 6811589), and automatic reboot of either OWNER or STRIPPED head during Failback (due to cluster's inability to import/export resources).Root CauseThe cluster software has the requirement that hardware be identical between both head nodes. This requirement extends to the device names which are assigned at discovery time. An incorrect installation sequence leads to mismatched device names across heads which the cluster software cannot (currently) tolerate.Corrective ActionWorkaround:Remove the NIC cards that contain the mismatched device names. The cluster can operate in this state (without the offending cards) until a suitable window for the factory reset recovery sequence can be scheduled. Resolution: A factory reset plus manual modification of the /etc/devices/path_to_inst file on each head is required to recover from this failure mode. The cluster will lose all configuration (as with any factory reset), but the storage pools (projects, shares, snapshots, etc) can be preserved. The sequence is as follows: Note: You must have a serial console connection to each head to proceed. 1. Unconfigure storage for all pools (pools can be imported after the factory reset). 2. Issue the 'factoryreset' command *simultaneously* from the maintenance system context in the CLI on both heads. 3. Allow both heads to reboot and come up to the "Press any key to begin configuration" stage. 4. Press a key on each console to start aktty setup phase (this differs from normal cluster install procedure but is required for the manual modification of the path_to_inst file in the next step). 5. Hit Esc+9 in both console windows, and enter 'bash' at the aktty# prompt to start a shell. 6. On each head, remove any lines from the file /etc/devices/path_to_inst that contain the device type of the mismatched devices, e.g., if the mismatched devices are nxge, then delete any lines with the string 'nxge' in them; if the mismatched devices are e1000g, delete any lines with the string 'e1000g' in them, then confirm that there are no more entries matching the offending device type. For example: bash# cd /etc/devices bash# cp path_to_inst path_to_inst.old bash# grep -v -w e1000g path_to_inst.old > path_to_inst bash# grep -w e1000g /etc/devices/path_to_inst bash# 7. Reboot each head using 'reboot'. For Example: bash# reboot 8. The heads should come back to the "Press any key to begin configuration" step. Begin normal cluster installation at this point, i.e., begin aktty configuration on one head while leaving the other at the "Press any key..." step. If pools were previously configured with projects and shares, they can be imported during the configure storage step. References: BugID: 6811589 Escalation ID: 70997414 For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL: For Sun Authorized Service Providers go to: In addition to the above you may email: Internal Contributor/submitter [email protected] Internal Eng Responsible Engineer [email protected] Responsible Manager: [email protected] Internal Services Knowledge Engineer [email protected] Internal Eng Business Unit Group NWS (Storage) Internal Sun Alert & FAB Admin Info Attachments This solution has no attachment |
||||||||||||
|