Dev: sbd: Improve the process of leveraging maintenance mode #1950

liangxin1300 · 2025-10-22T14:14:08Z

Problem

#1744 leverage maintenance mode when needs to restart cluster, but there are still some problems when resources are running:

Configuration changed before hinting, might lead to inconsistent

# crm sbd configure watchdog-timeout=45
INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 90
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 131
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 119, it was 71
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

# crm sbd purge 
INFO: Stop sbd resource 'stonith-sbd'(stonith:fence_sbd)
INFO: Remove sbd resource 'stonith-sbd'
INFO: Disable sbd.service on node sle16-1
INFO: Disable sbd.service on node sle16-2
INFO: Move /etc/sysconfig/sbd to /etc/sysconfig/sbd.bak on all nodes
INFO: Delete cluster property "stonith-timeout" in crm_config
INFO: Delete cluster property "priority-fencing-delay" in crm_config
WARNING: "stonith-enabled" in crm_config is set to false, it was true
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Pacemaker fatal exit when adding diskless sbd on a running cluster with resources running

 # crm cluster init sbd -S -y
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: Loading "knet-default" profile from /etc/crm/profiles.yml
INFO: Configuring diskless SBD
WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice.
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Enable sbd.service on node sle16-1
INFO: Enable sbd.service on node sle16-2
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting
WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0

Broadcast message from systemd-journald@sle16-1 (Thu 2025-10-23 10:54:11 CEST):

pacemaker-controld[5674]:  emerg: Shutting down: stonith-watchdog-timeout configured (30) but SBD not active


Message from syslogd@sle16-1 at Oct 23 10:54:11 ...
 pacemaker-controld[5674]:  emerg: Shutting down: stonith-watchdog-timeout configured (30) but SBD not active
ERROR: cluster.init: Failed to run 'crm configure property stonith-watchdog-timeout=30': ERROR: Failed to run 'crm_mon -1rR': crm_mon: Connection to cluster failed: Connection refused

Solution

Drop the function restart_cluster_if_possible
Introduced a new function utils.able_to_restart_cluster to check if the cluster can be restarted. Call it before changing any configurations.
Add leverage maintenance mode in sbd device remove and sbd purge commands

Add sbd via sbd stage while resource is running

 # crm cluster init sbd -S -y
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: Loading "knet-default" profile from /etc/crm/profiles.yml
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting
INFO: Done (log saved to /var/log/crmsh/crmsh.log on sle16-1)

# Leverage maintenance mode
# crm -F cluster init sbd -S -y
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: Loading "knet-default" profile from /etc/crm/profiles.yml
INFO: Set cluster to maintenance mode
WARNING: "maintenance-mode" in crm_config is set to true, it was false
INFO: Configuring diskless SBD
WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice.
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Enable sbd.service on node sle16-1
INFO: Enable sbd.service on node sle16-2
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
...........                                                                                                                                                                                                                            
INFO: END Waiting for cluster
WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0
WARNING: "stonith-enabled" in crm_config is set to true, it was false
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 41
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 71, it was 60s
INFO: Set cluster from maintenance mode to normal
INFO: Delete cluster property "maintenance-mode" in crm_config
INFO: Done (log saved to /var/log/crmsh/crmsh.log on sle16-1)

Purge sbd while resource is running

# crm sbd purge 
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Add device

 # crm sbd device add /dev/sda6
INFO: Configured sbd devices: /dev/sda5
INFO: Append devices: /dev/sda6
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Remove device

# crm sbd device remove /dev/sda6
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
INFO: Remove devices: /dev/sda6
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Configure sbd while DLM is running

# crm sbd configure watchdog-timeout=40
INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 80
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

# Leverage maintenance mode
# crm -F sbd configure watchdog-timeout=40
INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 80
INFO: Set cluster to maintenance mode
WARNING: "maintenance-mode" in crm_config is set to true, it was false
WARNING: Please stop DLM related resources (gfs2-clone) and try again
INFO: Set cluster from maintenance mode to normal
INFO: Delete cluster property "maintenance-mode" in crm_config

- Drop the function `restart_cluster_if_possible` - Introduced a new function `utils.able_to_restart_cluster` to check if the cluster can be restarted. Call it before changing any configurations. - Add leverage maintenance mode in `sbd device remove` and `sbd purge` commands

codecov · 2025-10-23T13:59:21Z

Codecov Report

❌ Patch coverage is 41.02564% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.73%. Comparing base (2b481a7) to head (ca17414).

Files with missing lines	Patch %	Lines
crmsh/utils.py	18.75%	13 Missing ⚠️
crmsh/ui_sbd.py	62.50%	6 Missing ⚠️
crmsh/sbd.py	50.00%	2 Missing ⚠️
crmsh/xmlutil.py	33.33%	2 Missing ⚠️

Additional details and impacted files

Flag	Coverage Δ
integration	`55.17% <5.12%> (-0.03%)`	⬇️
unit	`52.91% <41.02%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
crmsh/sbd.py	`86.20% <50.00%> (-0.31%)`	⬇️
crmsh/xmlutil.py	`70.24% <33.33%> (-0.12%)`	⬇️
crmsh/ui_sbd.py	`84.88% <62.50%> (+0.12%)`	⬆️
crmsh/utils.py	`67.17% <18.75%> (-0.39%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liangxin1300 force-pushed the 20251022_improve_leverage_maintenance_mode branch 3 times, most recently from 6fe9092 to e17caa0 Compare October 23, 2025 06:21

liangxin1300 changed the title ~~Dev: sbd: Improve leverage maintenance mode~~ Dev: sbd: Improve the process of leveraging maintenance mode Oct 23, 2025

liangxin1300 mentioned this pull request Oct 23, 2025

[crmsh-5.0] Dev: sbd: Improve the process of leveraging maintenance mode #1951

Draft

liangxin1300 added 2 commits October 23, 2025 19:42

Dev: unittests: Adjust unit test for previous commit

ca17414

liangxin1300 force-pushed the 20251022_improve_leverage_maintenance_mode branch from e17caa0 to ca17414 Compare October 23, 2025 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev: sbd: Improve the process of leveraging maintenance mode #1950

Dev: sbd: Improve the process of leveraging maintenance mode #1950

liangxin1300 commented Oct 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dev: sbd: Improve the process of leveraging maintenance mode #1950

Are you sure you want to change the base?

Dev: sbd: Improve the process of leveraging maintenance mode #1950

Conversation

liangxin1300 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Configuration changed before hinting, might lead to inconsistent

Pacemaker fatal exit when adding diskless sbd on a running cluster with resources running

Solution

Add sbd via sbd stage while resource is running

Purge sbd while resource is running

Add device

Remove device

Configure sbd while DLM is running

Uh oh!

codecov bot commented Oct 23, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liangxin1300 commented Oct 22, 2025 •

edited

Loading