Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1470457.1
Update Date:2012-08-29
Keywords:

Solution Type  Technical Instruction Sure

Solution  1470457.1 :   Pillar Axiom: After Microsoft Patch Update - Microsoft Cluster Quorum Is Accessable By All Cluster Nodes.  


Related Items
  • Pillar Axiom 600 Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Pillar Axiom>SN-DK: Ax600
  •  




In this Document
Goal
Fix


Created from <SR 3-5846022106>

Applies to:

Pillar Axiom 600 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Goal

Enter the goal of the document. What does the customer want to accomplish?

After a Microsoft update to Microsoft Windows 2003, one node of the MS cluster would corrupt the SCSI reservations on the quorum device and allow all nodes to access it resulting in the cluster failure. The issue occurrs after restarting the node as it will try to join the existing Cluster.  However, this fails due to an RPC error trying to communicate with the running node.


NOTE: This particular issue if not fixed, causes the MaxRep Engine to continually loop on trying to access the active node in the cluster for IO


Messages on cluster (from Windows server logs, MPS Report): 

50010 00000c2c.00001024::2012/06/18-09:49:12.929 WARN [JOIN] JoinVersion data for sponsor YKR-MXSVR03 is invalid, status 1722.
50011 00000c2c.00001028::2012/06/18-09:49:12.929 INFO [JOIN] Sponsor YKR-MXSVR01 is not available (JoinVersion), status=1722.
50012 00000c2c.00001028::2012/06/18-09:49:12.929 WARN [JOIN] JoinVersion data for sponsor YKR-MXSVR01 is invalid, status 1722.
50013 00000c2c.0000102c::2012/06/18-09:49:14.059 INFO [JOIN] Sponsor 172.16.100.56 is not available (JoinVersion), status=1722.
50014 00000c2c.0000102c::2012/06/18-09:49:14.059 WARN [JOIN] JoinVersion data for sponsor 172.16.100.56 is invalid, status 1722.
50015 00000c2c.00000cb4::2012/06/18-09:49:14.059 INFO [JOIN] Got out of the join wait, CsJoinThreadCount = 1.
50016 00000c2c.00000cb4::2012/06/18-09:49:14.059 ERR  [JOIN] Unable to connect to any sponsor node.
50017 00000c2c.00000cb4::2012/06/18-09:49:14.059 WARN [INIT] Failed to join cluster, status 53

 The firewall prevented the restarted node to communicate using TCP 139/445 and UDP 137/138), and Remote Procedure Call (TCP port 135).

Fix

Explanation from the Microsoft Escalation Engineer to avoid loss of SCSI reservation on the Active Cluster Node is to ensure the correct order of steps outlined below while performing maintenance in the Windows Cluster environment:

  • Disable completely the Firewall service and stop the service
  • Disable the Cluster service and stop the service
  • Upgrade APM for Windows to the latest version
  • Firmware / driver updates using the server vendor Maintenance CD (Smart Update Manager for HP, Dell OpenManage Server Update Utility, etc.)
  • Run a Windows Update and apply the latest mpio patches
  • Restart the cluster node
  • Enable and start Cluster service
  • Make sure your firewall is configured correctly (see MSDN articles about Firewall and Cluster) before enabling and starting the Firewall service
  • Failover to the updated node
  • Verify access to the Cluster resources
  • Upgrade the rest of the nodes using the same method

If still having issues, obtain MPS Reports and run Network Monitor or Wireshark to capture the network from the running nodes (both team and heartbeat interfaces) while bringing online the updated node (run also a capture from there as well).

The customer should send the logs to Microsoft and copy Oracle Axiom Support for further diagnostics.


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback