-
Notifications
You must be signed in to change notification settings - Fork 191
Module graceful shutdown support #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #667
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
sonic-chassisd/scripts/chassisd
Outdated
module.set_module_state_transition(v2, module_name, "startup") | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is the module state transition from here cleared?
sonic-chassisd/scripts/chassisd
Outdated
wants_up = (admin_state != 'down') | ||
not_online = (str(operational_state).lower() | ||
!= str(ModuleBase.MODULE_STATUS_ONLINE).lower()) | ||
if wants_up and not_online and v2: | ||
try: | ||
module.set_module_state_transition(v2, module_name, "startup") | ||
self.log_info(f"Marked startup transition for {module_name} at boot") | ||
except Exception as e: | ||
self.log_error(f"Failed to set startup transition for {module_name}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this section, this function was only present to handle the case were the CONFIG_DB entry is not present, in which case dark mode is implied and we power off the DPUs
sonic-chassisd/scripts/chassisd
Outdated
ModuleTransitionFlagHelper().set_transition_flag(module_name) | ||
try_get(self.module_updater.chassis.get_module(module_index).set_admin_state, admin_state, default=False) | ||
# Only run pre-shutdown on DOWN path | ||
try_get(module.module_pre_shutdown, default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module_pre_shutdown is supposed to be called if we need to power off the DPU, or if we need to stay in dark mode, as the exisitng code was present in the same format, please align
sonic-chassisd/scripts/chassisd
Outdated
# Clear transition flag in STATE_DB via ModuleBase centralized API | ||
try: | ||
module_obj = self.chassis.get_module(module_index) | ||
module_obj.clear_module_state_transition(self._state_v2, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy Clearing on operational state change is a problem, this can be intentional or unintentional, moreover, there is a transition from online to offline to online when we do a reboot, this will clear the transition during that state change, which is not the expected behavior
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
dfad0fc
to
5b2c0f6
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
7836e1f
to
3169b9d
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
eedae8c
to
ddf58cc
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Provide support for SmartSwitch DPU module graceful shutdown.
Description
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBase
helpers:set_module_state_transition(db, name, transition_type)
clear_module_state_transition(db, name)
get_module_state_transition(db, name) -> dict
is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
Eliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE
(replacesCHASSIS_MODULE_INFO_TABLE
).Ownership & lifecycle
The initiator of an operation (
startup
/shutdown
/reboot
) sets:state_transition_in_progress=True
transition_type=<op>
transition_start_time=<utc-iso8601>
The platform (
set_admin_state()
) is responsible for clearing:state_transition_in_progress=False
transition_end_time=<epoch>
(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json
; else constants.Typical production values used:
startup: 180s
,shutdown: 180s
(≈graceful_wait 60s + power 120s
),reboot: 120s
.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()
—not in ModuleBase.Boot behavior
chassisd
on start:set_initial_dpu_admin_state()
which marks transitions via ModuleBase before calling platformset_admin_state()
.gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLE
and triggers only when:state_transition_in_progress=True
andtransition_type=shutdown
.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …
)is_module_state_transition_timed_out()
→ auto-clear then proceed.startup
/shutdown
; platform clears on completion.Redis robustness
hset(mapping=...)
usage.Race reduction & consistency
transition_start_time
; clears may add an end stamp.Change scope
Please refer to the HLD and related PRs:
HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: sonic-net/sonic-host-services#255
sonic-platform-common: sonic-net/sonic-platform-common#567
Module graceful shutdown support #4031 sonic-net/sonic-utilities#4031
How Has This Been Tested?
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU