Skip to content

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented Aug 25, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Description

  • Single source of truth for transitions

    • All components now use sonic_platform_base.module_base.ModuleBase helpers:

      • set_module_state_transition(db, name, transition_type)
      • clear_module_state_transition(db, name)
      • get_module_state_transition(db, name) -> dict
      • is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
    • Eliminates duplicated logic and race-prone direct Redis writes.

  • Correct table everywhere

    • Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
    • HLD mismatch addressed in code (HLD fix tracked separately).
  • Ownership & lifecycle

    • The initiator of an operation (startup/shutdown/reboot) sets:

      • state_transition_in_progress=True
      • transition_type=<op>
      • transition_start_time=<utc-iso8601>
    • The platform (set_admin_state()) is responsible for clearing:

      • state_transition_in_progress=False
      • optionally transition_end_time=<epoch> (or similar end stamp).
    • CLI pre-clears only when a prior transition is timed out.

  • Timeouts & policy

    • Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

    • Typical production values used:

      • startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
    • Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

  • Boot behavior

    • chassisd on start:

      1. Clears stale flags once (centralized sweep).
      2. Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
      3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
  • gNOI shutdown daemon

    • Listens on CHASSIS_MODULE_TABLE and triggers only when:

      • state_transition_in_progress=True and transition_type=shutdown.
    • Never clears the flag (ownership stays with the platform).

    • Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

  • CLI (config chassis modules …)

    • Uses ModuleBase APIs for all set/get/timeout checks.
    • If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
    • Sets transition at the start of startup/shutdown; platform clears on completion.
    • Fabric card flow retained; edits are surgical.
  • Redis robustness

    • Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
    • Consistent HGETALL/HSET paths; resilient to connector differences.
  • Race reduction & consistency

    • Centralized writes prevent multi-writer races.
    • All transition writes include transition_start_time; clears may add an end stamp.
    • Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
  • Change scope

    • Minimal, targeted diffs.
    • No background tasks added, no broad refactors beyond transition handling.
    • Behavior changes are limited to making transition semantics correct and uniform across repos.

Please refer to the HLD and related PRs:
HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: sonic-net/sonic-host-services#255
sonic-platform-common: sonic-net/sonic-platform-common#567
Module graceful shutdown support #4031 sonic-net/sonic-utilities#4031

How Has This Been Tested?
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Comment on lines 1435 to 1436
module.set_module_state_transition(v2, module_name, "startup")
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the module state transition from here cleared?

Comment on lines 1509 to 1517
wants_up = (admin_state != 'down')
not_online = (str(operational_state).lower()
!= str(ModuleBase.MODULE_STATUS_ONLINE).lower())
if wants_up and not_online and v2:
try:
module.set_module_state_transition(v2, module_name, "startup")
self.log_info(f"Marked startup transition for {module_name} at boot")
except Exception as e:
self.log_error(f"Failed to set startup transition for {module_name}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this section, this function was only present to handle the case were the CONFIG_DB entry is not present, in which case dark mode is implied and we power off the DPUs

ModuleTransitionFlagHelper().set_transition_flag(module_name)
try_get(self.module_updater.chassis.get_module(module_index).set_admin_state, admin_state, default=False)
# Only run pre-shutdown on DOWN path
try_get(module.module_pre_shutdown, default=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module_pre_shutdown is supposed to be called if we need to power off the DPU, or if we need to stay in dark mode, as the exisitng code was present in the same format, please align

# Clear transition flag in STATE_DB via ModuleBase centralized API
try:
module_obj = self.chassis.get_module(module_index)
module_obj.clear_module_state_transition(self._state_v2, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Clearing on operational state change is a problem, this can be intentional or unintentional, moreover, there is a transition from online to offline to online when we do a reboot, this will clear the transition during that state change, which is not the expected behavior

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants