How to: Resolve DRBD split-brain recovery manually
After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).
At this point, unless you configured DRBD to automatically recover from split brain, you must manually intervene by selecting one node whose modifications will be discarded (this node is referred to as the split brain victim).
This intervention is made with the following commands:
Below is the implementation of above technote
On secondary site
Regulal Flow
on secondary site
drbdadm secondary Ora_Exp
drbdadm disconnect Ora_Exp
drbdadm -- --discard-my-data connect Ora_Exp
on primary site
drbdadm connect Ora_Exp
drbdadm status
If above does not work
on secondary site
root@server-1b:~>% drbdadm secondary Ora_Exp
root@server-1b:~>% drbdadm disconnect Ora_Exp
root@server-1b:~>% drbdadm -- --discard-my-data connect Ora_Exp
root@server-1b:~>% drbdadm invalidate Ora_Exp
On primary site
root@server-1a:~>% drbdadm status
root@server-1a:~>% drbdadm status
root@server-1a:~>% drbdadm connect Ora_Exp
See progress and log:
/sys/kernel/debug/drbd/resources/<resource_name>/connections/<server_name>/0/proc_drdb/
See progress
root@server-1a:~>% drbdadm status
Ora_Exp role:Primary
disk:UpToDate
server-1b role:Secondary
peer-disk:UpToDate
Ora_Exp role:Primary
disk:UpToDate
server-1b role:Secondary
peer-disk:UpToDate
Ora_Online role:Primary
disk:UpToDate
server-1b role:Secondary
peer-disk:UpToDate
disk:UpToDate
server-1b role:Secondary
peer-disk:UpToDate
db1 role:Primary
disk:UpToDate
server-1b role:Secondary congested:yes ap-in-flight:96 rs-in-flight:14336
replication:SyncSource peer-disk:Inconsistent done:81.03
disk:UpToDate
server-1b role:Secondary congested:yes ap-in-flight:96 rs-in-flight:14336
replication:SyncSource peer-disk:Inconsistent done:81.03
db2 role:Primary
disk:UpToDate
server-1b role:Secondary congested:yes ap-in-flight:32 rs-in-flight:14336
replication:SyncSource peer-disk:Inconsistent done:86.60
In this example, Ora_Exp and Ora_Online were already synced.
disk:UpToDate
server-1b role:Secondary congested:yes ap-in-flight:32 rs-in-flight:14336
replication:SyncSource peer-disk:Inconsistent done:86.60
In this example, Ora_Exp and Ora_Online were already synced.
db1 and db2 are in process of sync.
The numbers 81.03 and 86.60 are percent of the synced disk.
once the percent is 100% - the 2 sites are in sync
on site A
Ora_Exp role:Primary
disk:UpToDate
server-1b role:Secondary
peer-disk:UpToDate
on site B
Ora_Exp role:Secondary
disk:UpToDate
server-1a role:Primary
peer-disk:UpToDate
Example:
commands on secondary site
commands on primary site
drbdadm secondary Ora_Exp
drbdadm disconnect Ora_Exp
drbdadm -- --discard-my-data connect Ora_Exp
drbdadm connect Ora_Exp
drbdadm status
drbdadm secondary Ora_Online
drbdadm disconnect Ora_Online
drbdadm -- --discard-my-data connect Ora_Online
drbdadm connect Ora_Online
drbdadm status
drbdadm secondary db1
drbdadm disconnect db1
drbdadm -- --discard-my-data connect db1
drbdadm connect db1
drbdadm status
drbdadm secondary db2
drbdadm disconnect db2
drbdadm -- --discard-my-data connect db2
drbdadm connect db2
drbdadm status
drbdadm status
drbdadm status
drbdadm status
How to: Make a node Primary
On to be Primary site:
root@server-1a:~>% drbdadm status
ogg role:Secondary
disk:UpToDate
server-1b connection:Connecting
root@server-1a:~>% drbdadm primary ogg
root@server-1a:~>% drbdadm disconnect ogg
root@server-1a:~>% drbdadm connect ogg
root@server-1a:~>% drbdadm status
ogg role:Primary
disk:UpToDate
server-1b connection:Connecting
drbdadm primary
Promote the resource´s device into primary role.
You need to do this before any access to the device, such as creating or mounting a file system.
You need to do this before any access to the device, such as creating or mounting a file system.
drbdadm secondary
Brings the device back into secondary role.
Reference
https://manpages.ubuntu.com/manpages/xenial/en/man8/drbdadm.8.html
=================
Correct status
=================
root@SRV901G:~>% drbdadm status
Ora_Exp role:Primary
disk:UpToDate
SRV902G role:Secondary
peer-disk:UpToDate
Ora_Online role:Primary
disk:UpToDate
SRV902G role:Secondary
peer-disk:UpToDate
db1 role:Primary
disk:UpToDate
SRV902G role:Secondary
peer-disk:UpToDate
db2 role:Primary
disk:UpToDate
SRV902G role:Secondary
peer-disk:UpToDate
root@SRV902G:~>% drbdadm status
Ora_Exp role:Secondary
disk:UpToDate
SRV901G role:Primary
peer-disk:UpToDate
Ora_Online role:Secondary
disk:UpToDate
SRV901G role:Primary
peer-disk:UpToDate
db1 role:Secondary
disk:UpToDate
SRV901G role:Primary
peer-disk:UpToDate
db2 role:Secondary
disk:UpToDate
SRV901G role:Primary
peer-disk:UpToDate
=================
Not Correct Status
=================
root@SRVDBD901G:~>% drbdadm status
Ora_Exp role:Primary
disk:UpToDate
SRVDBD902G connection:StandAlone
Ora_Online role:Primary
disk:UpToDate
SRVDBD902G connection:StandAlone
db1 role:Primary
disk:UpToDate
SRVDBD902G connection:StandAlone
db2 role:Primary
disk:UpToDate
SRVDBD902G connection:StandAlone
root@SRVDBD902G:~>% drbdadm status
Ora_Exp role:Secondary
disk:UpToDate
SRVDBD901G connection:StandAlone
Ora_Online role:Secondary
disk:UpToDate
SRVDBD901G connection:StandAlone
db1 role:Secondary
disk:UpToDate
SRVDBD901G connection:StandAlone
db2 role:Secondary
disk:UpToDate
SRVDBD901G connection:StandAlone
In case the connection between sites cannot be restored, need to rebuild drbd metadata
1. check /var/log/messages
Possible error:
Jun 5 14:42:05 my_host kernel: drbd Ora_Exp my_host: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Jun 5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: drbd_sync_handshake:
Jun 5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: self A1AFF6AACC929146:4740BC40E4819C9C:54BF4A72763A34A6:0CA1C9B60B25B87E bits:8184550 flags:122
Jun 5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: peer D3401D4CCA4EA0C8:54BF4A72763A34A6:4740BC40E4819C9C:F0CF106B5ED25C7C bits:676 flags:120
Jun 5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: uuid_compare()=unrelated-data by rule 100
Jun 5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4: Unrelated data, aborting!
The reason of the issue is that during split brain situation data has diverged beyond the point that the Peers no longer recognize each other’s generation identifiers.
The solution was to create the metadata again, and sync data.
Run the following commands:
On Secondary node:
drbdadm down Ora_Exp
drbdadm wipe-md Ora_Exp
drbdadm create-md Ora_Exp
drbdadm up Ora_Exp
drbdadm disconnect Ora_Exp
drbdadm -- --discard-my-data connect Ora_Exp
Right after above, execute on Primary node:
drbdadm connect Ora_Exp
drbdadm status
drbdadm status
No comments:
Post a Comment