Exadata system is very slow - connect errors : ossnet: connection failed to server : a Case Study

Asset ID:	1-75-1388821.1
Update Date:	2012-09-28
Keywords:

Solution Type Troubleshooting Sure

Solution 1388821.1 : Exadata system is very slow - connect errors : ossnet: connection failed to server : a Case Study

Applies to:

Oracle Exadata Hardware - Version 11.2.0.1 to 11.2.1.2.1 [Release 11.2]
Exadata Database Machine V2 - Version All Versions and later
Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Exadata Database Machine X2-2 Half Rack - Version All Versions and later
Exadata Database Machine X2-2 Hardware - Version All Versions and later
Information in this document applies to any platform.
Exadata
ossnet
poor performance over network

Purpose

Note: The Network or Infiniband does not have to be the underlying source of the ossnet: connection failed to server messages:
The errors can be a symptom of a more basic problem that impacts the network.
Perhaps the most important lesson of this note is to confirm that there are several potential sources for the ossnet: connection failed to server
and the solution can range from bug patches to simple tuning of configuration changes

Here are a few associated with the ossnet: connection failed to server message in an Exadata configuration:

Unpublished bug 9338087
- The fix was included in Exadata patch 11.2.1.3.1.

Symtom: Cellsrv would not accept new connections
Cause: cellsrv was unable to keep up with the rate of many connection requests The fix - Improved efficiency in handling incoming connection requests

Unpublished Bug 9176360 - REMOTESENDPORTS IN THE IMPLICIT FENCING FOR NO DISKMON
- The fix was included in Exadata patch 11.2.1.3

Cellsrv internal resource memory leak when a client trys to get access Cellsrv before a disk monitor instance (DSKM) getting initialized

REDISCOVERY INFORMATION:

* On the storage cell, CELLSRV will fail with 7445.
* Also the alert.log will be full of the following message:

"...Information: implicit fencing: AntMaster reid is not presented, diskmon has not yet registered with Cellsrv
Information: Cellsrv dropping OpenDisk request for implicit fencing, 
host nhedwhhpxdb03pd.nhg.local[pid:23780], disk MDATA_CD_1_nhedwhhpxss07pd, 
reid cid=49fd62900e784f4dbf82d69b0f2247d7,icin=148859737,nmn=3,lnid=148859737,
gid=11,gin=1,gmn=1,umemid=1,opid=61,opsn=177,lvl=process..."

* Storage cell may run out of memory or swap.

Unpublished Bug 8867420/ Unpublished Bug 8801965/ Unpublished Bug 8536204 - DISKMON CRASH AND RESTART TO CELL RESULTS IN NODE-LEVEL IMPLICIT FENCE
- Fixed in 11.2

This note describes another source for ossnet error messages and provides an analysis that may assist in researching this type of error message in the future.

This case study is based on a real problem Jaime diagnosed leading the user to discover the source of their problem.

After ruling out other potential sources of this error and a review of the OSWatcher logs this problem pointed to IO saturation as a possible underlying source of the ossnet messages

Troubleshooting Steps

3-4874627731

From the ALERT.LOG

...
Tue Dec 06 17:49:06 2011
OSPID: 26595: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1953)
Tue Dec 06 17:49:06 2011
OSPID: 26590: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1953)
Tue Dec 06 17:49:06 2011
OSPID: 26588: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1953)
Tue Dec 06 18:00:07 2011
OSPID: 19278: connect: ossnet: connection failed to server 192.168.10.1, result=5 (login: sosstcpreadtry failed) (difftime=1955)
...

 
 OSW iostat for Cells
 ------------------------
 
 Reviewed the osw for at least 4 of the 6 cells. The Iostat shows similar picture, including the devices sda and sdb

Device rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
-----------------------------------------------------------------------------------------
sdc   86.40 0.00 107.60 84.40 207232.00 1511.40 1087.21 29.05 223.01 4.89 93.96
sdd   83.80 0.00 124.40 2.40 177922.80   67.20 1403.71 28.69 226.32 6.71 85.10
sde 102.40 0.00 103.80 0.20 210054.40    3.20 2019.78 44.16 416.77 9.62 100.04*
sdf   89.60 1.80 108.80 1.00 220131.20 350.40 2008.03 274.38 2854.84 9.11 100.04*
sdg 125.20 63.20 106.00 2.60 216614.40 1426.40 2007.74 210.20 1570.18 9.21 100.04*
sdh   52.00 0.00 56.20 0.40 114559.80    6.40 2024.14   7.70 139.08 9.04 51.14
sdi   85.80 0.60 103.40 2.80 206310.40   38.80 1943.02 132.58 1347.19 9.42 100.04*
sdj   84.20 0.00 100.80 0.20 203852.80    3.20 2018.38 30.71 429.58 9.17 92.66
sdk   89.40 0.00 99.60 1.00 189184.00 172.80 1882.27 16.05 163.55 8.83 88.84
sdl   93.60 0.00 85.60 4.40 174208.00   35.80 1936.04 18.14 185.35 8.56 77.04
sdm    0.00 0.00   0.00 0.00    0.00    0.00 0.00   0.00 0.00 0.00 0.00
...
...
...

 Points of interest

 The disk utilization of some disks is at ~100%   --  the above chart is a sample showing highest utilization
  The average request size is around 1MB (avrg-sz)*512,
  The disks are getting around 120 IOPs.

Evaluating these numbers will depend on disk type: HighCapacity (HC, 2TB) or HighPerformance (HP, 650 GB).
 
    for HC max IOPS 148 IOs/s
    for HP max IOPs 297 IOs/s
 
 * In terms of throughput are getting around 107MB/s (rsec/s)*512
 
 The expected numbers for these two disk types are:
 
    for HC 85MB/sec
    for HP 152MB/sec

More investigation narrows the source of the problem to a few potential operations during the time of the problem

AWR and ASH reports helped identify two notable sources of high IO which included operations using high degrees of parallelsim at the time of the problem(s):
 
   - Informatica load with 16-32 DOP
   - RMAN allocating 32 channels at the same time
 
 IO saturation was suspected as the cause of the OSSNET error messages. 
 However, further research determined the IO saturation was due to the Informatica load which constantly referenced a large key lookup table.
 
 PROBLEM SOURCE: The problem source was excessive IO caused by poor / inefficient Informatica queries for a heavily used lookup table which had several million rows
 - The lookup table was heavily used leading to the excessive IO
 
 RESOLUTION: The poor performance and OSSNET messages were resolved by adding indexes to the lookup table making the Informatica query less IO intensive and more efficient.

Attachments

This solution has no attachment