Pages

Sunday, August 7, 2022

How to: Resolve DRBD split-brain recovery manually

How to: Resolve DRBD split-brain recovery manually

After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).

At this point, unless you configured DRBD to automatically recover from split brain, you must manually intervene by selecting one node whose modifications will be discarded (this node is referred to as the split brain victim). 
This intervention is made with the following commands:

Below is the implementation of above technote
On secondary site



Regulal Flow 
on secondary site
drbdadm secondary Ora_Exp
drbdadm disconnect Ora_Exp
drbdadm -- --discard-my-data connect Ora_Exp

on primary site
drbdadm connect Ora_Exp
drbdadm status



If above does not work
on secondary site
root@server-1b:~>% drbdadm secondary Ora_Exp
root@server-1b:~>% drbdadm disconnect Ora_Exp
root@server-1b:~>% drbdadm -- --discard-my-data connect Ora_Exp
root@server-1b:~>% drbdadm invalidate Ora_Exp


On primary site
root@server-1a:~>% drbdadm status

root@server-1a:~>% drbdadm connect Ora_Exp



See progress and log:
/sys/kernel/debug/drbd/resources/<resource_name>/connections/<server_name>/0/proc_drdb/

See progress
root@server-1a:~>% drbdadm status
Ora_Exp role:Primary
  disk:UpToDate
  server-1b role:Secondary
    peer-disk:UpToDate

Ora_Online role:Primary
  disk:UpToDate
  server-1b role:Secondary
    peer-disk:UpToDate

db1 role:Primary
  disk:UpToDate
  server-1b role:Secondary congested:yes ap-in-flight:96 rs-in-flight:14336
    replication:SyncSource peer-disk:Inconsistent done:81.03

db2 role:Primary
  disk:UpToDate
  server-1b role:Secondary congested:yes ap-in-flight:32 rs-in-flight:14336
    replication:SyncSource peer-disk:Inconsistent done:86.60


In this example, Ora_Exp and Ora_Online were already synced.
db1 and db2 are in process of sync.
The numbers 81.03 and 86.60 are percent of the synced disk.

once the percent is 100% - the 2 sites are in sync
on site A
Ora_Exp role:Primary
  disk:UpToDate
  server-1b role:Secondary
    peer-disk:UpToDate

on site B
Ora_Exp role:Secondary
  disk:UpToDate
  server-1a role:Primary
    peer-disk:UpToDate



Example:
commands on secondary site
commands on primary site

drbdadm secondary Ora_Exp
drbdadm disconnect Ora_Exp
drbdadm -- --discard-my-data connect Ora_Exp
drbdadm connect Ora_Exp
drbdadm status
 
drbdadm secondary Ora_Online
drbdadm disconnect Ora_Online
drbdadm -- --discard-my-data connect Ora_Online
drbdadm connect Ora_Online
drbdadm status

drbdadm secondary db1
drbdadm disconnect db1
drbdadm -- --discard-my-data connect db1
drbdadm connect db1
drbdadm status

drbdadm secondary db2
drbdadm disconnect db2
drbdadm -- --discard-my-data connect db2
drbdadm connect db2
drbdadm status
 
drbdadm status
drbdadm status





How to: Make a node Primary
On to be Primary site:
root@server-1a:~>% drbdadm status
ogg role:Secondary
  disk:UpToDate
  server-1b connection:Connecting
root@server-1a:~>% drbdadm primary ogg
root@server-1a:~>% drbdadm disconnect ogg
root@server-1a:~>% drbdadm connect ogg
root@server-1a:~>% drbdadm status
ogg role:Primary
  disk:UpToDate
  server-1b connection:Connecting


drbdadm primary
Promote the resource´s device into primary role.
You need to do this before any access 
to the device, such as creating or mounting a file system.

drbdadm secondary
Brings the device back into secondary role.

Reference
https://manpages.ubuntu.com/manpages/xenial/en/man8/drbdadm.8.html


=================
Correct status
=================
root@SRV901G:~>% drbdadm  status
Ora_Exp role:Primary
  disk:UpToDate
  SRV902G role:Secondary
    peer-disk:UpToDate

Ora_Online role:Primary
  disk:UpToDate
  SRV902G role:Secondary
    peer-disk:UpToDate

db1 role:Primary
  disk:UpToDate
  SRV902G role:Secondary
    peer-disk:UpToDate

db2 role:Primary
  disk:UpToDate
  SRV902G role:Secondary
    peer-disk:UpToDate

root@SRV902G:~>% drbdadm  status
Ora_Exp role:Secondary
  disk:UpToDate
  SRV901G role:Primary
    peer-disk:UpToDate

Ora_Online role:Secondary
  disk:UpToDate
  SRV901G role:Primary
    peer-disk:UpToDate

db1 role:Secondary
  disk:UpToDate
  SRV901G role:Primary
    peer-disk:UpToDate

db2 role:Secondary
  disk:UpToDate
  SRV901G role:Primary
    peer-disk:UpToDate

=================
Not Correct Status
=================
root@SRVDBD901G:~>% drbdadm status
Ora_Exp role:Primary
  disk:UpToDate
  SRVDBD902G connection:StandAlone

Ora_Online role:Primary
  disk:UpToDate
  SRVDBD902G connection:StandAlone

db1 role:Primary
  disk:UpToDate
  SRVDBD902G connection:StandAlone

db2 role:Primary
  disk:UpToDate
  SRVDBD902G connection:StandAlone


root@SRVDBD902G:~>%  drbdadm status
Ora_Exp role:Secondary
  disk:UpToDate
  SRVDBD901G connection:StandAlone

Ora_Online role:Secondary
  disk:UpToDate
  SRVDBD901G connection:StandAlone

db1 role:Secondary
  disk:UpToDate
  SRVDBD901G connection:StandAlone

db2 role:Secondary
  disk:UpToDate
  SRVDBD901G connection:StandAlone




In case the connection between sites cannot be restored, need to rebuild drbd metadata
1. check /var/log/messages
   Possible error: 
Jun  5 14:42:05 my_host kernel: drbd Ora_Exp my_host: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Jun  5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: drbd_sync_handshake:
Jun  5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: self A1AFF6AACC929146:4740BC40E4819C9C:54BF4A72763A34A6:0CA1C9B60B25B87E bits:8184550 flags:122
Jun  5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: peer D3401D4CCA4EA0C8:54BF4A72763A34A6:4740BC40E4819C9C:F0CF106B5ED25C7C bits:676 flags:120
Jun  5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4 my_host: uuid_compare()=unrelated-data by rule 100
Jun  5 14:42:05 my_host kernel: drbd Ora_Exp/0 drbd4: Unrelated data, aborting!

The reason of the issue is that during split brain situation data has diverged beyond the point that the Peers no longer recognize each other’s generation identifiers.

The solution was to create the metadata again, and sync data. 

Run the following commands:

On Secondary node:
drbdadm down Ora_Exp
drbdadm wipe-md Ora_Exp
drbdadm create-md Ora_Exp
drbdadm up Ora_Exp
drbdadm disconnect Ora_Exp
drbdadm -- --discard-my-data connect Ora_Exp

Right after above, execute on Primary node:
drbdadm connect Ora_Exp

drbdadm status
drbdadm status

No comments:

Post a Comment