![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1388821.1 : Exadata system is very slow - connect errors : ossnet: connection failed to server : a Case Study
Exadata system is very slow - connect errors : ossnet: connection failed to server : a Case Study Applies to:Oracle Exadata Hardware - Version 11.2.0.1 to 11.2.1.2.1 [Release 11.2]Exadata Database Machine V2 - Version All Versions and later Exadata Database Machine X2-2 Full Rack - Version All Versions and later Exadata Database Machine X2-2 Half Rack - Version All Versions and later Exadata Database Machine X2-2 Hardware - Version All Versions and later Information in this document applies to any platform. Exadata ossnet poor performance over network PurposeNote: The Network or Infiniband does not have to be the underlying source of the ossnet: connection failed to server messages: Here are a few associated with the ossnet: connection failed to server message in an Exadata configuration:
After ruling out other potential sources of this error and a review of the OSWatcher logs this problem pointed to IO saturation as a possible underlying source of the ossnet messages
Troubleshooting Steps
3-4874627731 ...
Tue Dec 06 17:49:06 2011 OSPID: 26595: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1953) Tue Dec 06 17:49:06 2011 OSPID: 26590: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1953) Tue Dec 06 17:49:06 2011 OSPID: 26588: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1953) Tue Dec 06 18:00:07 2011 OSPID: 19278: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1955) ... OSW iostat for Cells ------------------------ Reviewed the osw for at least 4 of the 6 cells. The Iostat shows similar picture, including the devices sda and sdb Device rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util Points of interest The disk utilization of some disks is at ~100% -- the above chart is a sample showing highest utilization The average request size is around 1MB (avrg-sz)*512, The disks are getting around 120 IOPs. Evaluating these numbers will depend on disk type: HighCapacity (HC, 2TB) or HighPerformance (HP, 650 GB). for HC max IOPS 148 IOs/s for HP max IOPs 297 IOs/s * In terms of throughput are getting around 107MB/s (rsec/s)*512 The expected numbers for these two disk types are: for HC 85MB/sec for HP 152MB/sec More investigation narrows the source of the problem to a few potential operations during the time of the problem AWR and ASH reports helped identify two notable sources of high IO which included operations using high degrees of parallelsim at the time of the problem(s): - Informatica load with 16-32 DOP - RMAN allocating 32 channels at the same time IO saturation was suspected as the cause of the OSSNET error messages. However, further research determined the IO saturation was due to the Informatica load which constantly referenced a large key lookup table. PROBLEM SOURCE: The problem source was excessive IO caused by poor / inefficient Informatica queries for a heavily used lookup table which had several million rows - The lookup table was heavily used leading to the excessive IO RESOLUTION: The poor performance and OSSNET messages were resolved by adding indexes to the lookup table making the Informatica query less IO intensive and more efficient. Attachments This solution has no attachment |
||||||||||||
|