Skip to content

Conversation

liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Oct 22, 2025

Problem

#1744 leverage maintenance mode when needs to restart cluster, but there are still some problems when resources are running:

Configuration changed before hinting, might lead to inconsistent

# crm sbd configure watchdog-timeout=45
INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 90
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 131
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 119, it was 71
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

# crm sbd purge 
INFO: Stop sbd resource 'stonith-sbd'(stonith:fence_sbd)
INFO: Remove sbd resource 'stonith-sbd'
INFO: Disable sbd.service on node sle16-1
INFO: Disable sbd.service on node sle16-2
INFO: Move /etc/sysconfig/sbd to /etc/sysconfig/sbd.bak on all nodes
INFO: Delete cluster property "stonith-timeout" in crm_config
INFO: Delete cluster property "priority-fencing-delay" in crm_config
WARNING: "stonith-enabled" in crm_config is set to false, it was true
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Pacemaker fatal exit when adding diskless sbd on a running cluster with resources running

 # crm cluster init sbd -S -y
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: Loading "knet-default" profile from /etc/crm/profiles.yml
INFO: Configuring diskless SBD
WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice.
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Enable sbd.service on node sle16-1
INFO: Enable sbd.service on node sle16-2
WARNING: Resource is running, need to restart cluster service manually on each node
WARNING: Or, run with `crm -F` or `--force` option, the `sbd` subcommand will leverage maintenance mode for any changes that require restarting sbd.service
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting
WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0

Broadcast message from systemd-journald@sle16-1 (Thu 2025-10-23 10:54:11 CEST):

pacemaker-controld[5674]:  emerg: Shutting down: stonith-watchdog-timeout configured (30) but SBD not active


Message from syslogd@sle16-1 at Oct 23 10:54:11 ...
 pacemaker-controld[5674]:  emerg: Shutting down: stonith-watchdog-timeout configured (30) but SBD not active
ERROR: cluster.init: Failed to run 'crm configure property stonith-watchdog-timeout=30': ERROR: Failed to run 'crm_mon -1rR': crm_mon: Connection to cluster failed: Connection refused

Solution

  • Drop the function restart_cluster_if_possible
  • Introduced a new function utils.able_to_restart_cluster to check if the cluster can be restarted. Call it before changing any configurations.
  • Add leverage maintenance mode in sbd device remove and sbd purge commands

Add sbd via sbd stage while resource is running

 # crm cluster init sbd -S -y
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: Loading "knet-default" profile from /etc/crm/profiles.yml
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting
INFO: Done (log saved to /var/log/crmsh/crmsh.log on sle16-1)

# Leverage maintenance mode
# crm -F cluster init sbd -S -y
INFO: Loading "default" profile from /etc/crm/profiles.yml
INFO: Loading "knet-default" profile from /etc/crm/profiles.yml
INFO: Set cluster to maintenance mode
WARNING: "maintenance-mode" in crm_config is set to true, it was false
INFO: Configuring diskless SBD
WARNING: Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for 2-node cluster, should be combined with QDevice.
INFO: Update SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd: 15
INFO: Update SBD_WATCHDOG_DEV in /etc/sysconfig/sbd: /dev/watchdog0
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
INFO: Enable sbd.service on node sle16-1
INFO: Enable sbd.service on node sle16-2
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
...........                                                                                                                                                                                                                            
INFO: END Waiting for cluster
WARNING: "stonith-watchdog-timeout" in crm_config is set to 30, it was 0
WARNING: "stonith-enabled" in crm_config is set to true, it was false
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 41
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Already synced /etc/sysconfig/sbd to all nodes
WARNING: "stonith-timeout" in crm_config is set to 71, it was 60s
INFO: Set cluster from maintenance mode to normal
INFO: Delete cluster property "maintenance-mode" in crm_config
INFO: Done (log saved to /var/log/crmsh/crmsh.log on sle16-1)

Purge sbd while resource is running

# crm sbd purge 
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Add device

 # crm sbd device add /dev/sda6
INFO: Configured sbd devices: /dev/sda5
INFO: Append devices: /dev/sda6
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Remove device

# crm sbd device remove /dev/sda6
INFO: Configured sbd devices: /dev/sda5;/dev/sda6
INFO: Remove devices: /dev/sda6
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

Configure sbd while DLM is running

# crm sbd configure watchdog-timeout=40
INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 80
WARNING: Please stop all running resources and try again
WARNING: Or run this command with -F/--force option to leverage maintenance mode
WARNING: Understand risks that running RA has no cluster protection while the cluster is in maintenance mode and restarting

# Leverage maintenance mode
# crm -F sbd configure watchdog-timeout=40
INFO: No 'msgwait-timeout=' specified in the command, use 2*watchdog timeout: 80
INFO: Set cluster to maintenance mode
WARNING: "maintenance-mode" in crm_config is set to true, it was false
WARNING: Please stop DLM related resources (gfs2-clone) and try again
INFO: Set cluster from maintenance mode to normal
INFO: Delete cluster property "maintenance-mode" in crm_config

@liangxin1300 liangxin1300 force-pushed the 20251022_improve_leverage_maintenance_mode branch 3 times, most recently from 6fe9092 to e17caa0 Compare October 23, 2025 06:21
@liangxin1300 liangxin1300 changed the title Dev: sbd: Improve leverage maintenance mode Dev: sbd: Improve the process of leveraging maintenance mode Oct 23, 2025
- Drop the function `restart_cluster_if_possible`
- Introduced a new function `utils.able_to_restart_cluster` to check if
  the cluster can be restarted. Call it before changing any configurations.
- Add leverage maintenance mode in `sbd device remove` and `sbd purge` commands
@liangxin1300 liangxin1300 force-pushed the 20251022_improve_leverage_maintenance_mode branch from e17caa0 to ca17414 Compare October 23, 2025 13:30
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 41.02564% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.73%. Comparing base (2b481a7) to head (ca17414).

Files with missing lines Patch % Lines
crmsh/utils.py 18.75% 13 Missing ⚠️
crmsh/ui_sbd.py 62.50% 6 Missing ⚠️
crmsh/sbd.py 50.00% 2 Missing ⚠️
crmsh/xmlutil.py 33.33% 2 Missing ⚠️
Additional details and impacted files
Flag Coverage Δ
integration 55.17% <5.12%> (-0.03%) ⬇️
unit 52.91% <41.02%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crmsh/sbd.py 86.20% <50.00%> (-0.31%) ⬇️
crmsh/xmlutil.py 70.24% <33.33%> (-0.12%) ⬇️
crmsh/ui_sbd.py 84.88% <62.50%> (+0.12%) ⬆️
crmsh/utils.py 67.17% <18.75%> (-0.39%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant