Sun Storage 7000 Unified Storage System: Replication source node gives error "ak_notification_wait failed: remote side doesn't recognize notification cookie"

Asset ID:	1-72-1451919.1
Update Date:	2012-05-03
Keywords:

Solution Type Problem Resolution Sure

Solution 1451919.1 : Sun Storage 7000 Unified Storage System: Replication source node gives error "ak_notification_wait failed: remote side doesn't recognize notification cookie"

Applies to:

Sun Storage 7210 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7310 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.
Replication configuration might be sensible to changes in resources assignments, when in cluster configurtion.

Symptoms

Replication from source to target.
The source reports the error "ak_notification_wait failed: remote side doesn't recognize notification cookie".
On the target there is no error and replication seems to work fine.

Changes

The issue started after unconfiguring and configuring cluster.

Cause

In this customer case, there had been a cluster "unconfigure". The result of this was resources owned by "head 1" were transferred to "head 2" when "head 1" was unconfigured out of the cluster (as part of a motherboard replacement).

Replication makes use of the notification subsystem to :

know when replication is finished -- target calls back to source to say "replication is done", as replication is really driven by the target.
if the target reboots, this callback never happens, so the source needs to be able to check with the target.

This means when we start a new replication update, the source contacts the targets and passes a "notification cookie" across.

When the target has finished, it calls back to the source and uses the cookie to say which replication has finished.

If the target never calls back, then the source will timeout and call the target to say "do you still have the cookie ?"

When calling back to the IP address of the replication interface on the source there's an additional bit of validation, which is the target checks the source is who it expects. However, as ownership of the interface has changed, this test fails. The cookie is then thrown away on the target.

At this point replication has finished successfully, but the target has not been able to notify the source. The source then goes through the timeout cycle and calls the target, only to find the cookie is no longer there.

Solution

On source replication side (cluster configuration), transfer the ownership back :

Go into the cluster screen : Configuration->CLUSTER
In the BUI and selecting the target for :
This is an example, so change accordingly.
net/aggr1
net/aggr2
zfs/pool-0
Once you've changed these 3 to be assigned to the other cluster node, hit APPLY. This will pop up a box asking if you want to failback. Hit "APPLY". This will will transfer the ownership of these resources to the other cluster node. From now on, replication notification should work fine.

References

@ <BUG:7121594> - UNABLE TO COLLECT EXTRA FILES

Attachments

This solution has no attachment