Skip to content

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented May 14, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Description:

  • Single source of truth for transitions

    • All components now use sonic_platform_base.module_base.ModuleBase helpers:

      • set_module_state_transition(db, name, transition_type)
      • clear_module_state_transition(db, name)
      • get_module_state_transition(db, name) -> dict
      • is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
    • Eliminates duplicated logic and race-prone direct Redis writes.

  • Correct table everywhere

    • Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
    • HLD mismatch addressed in code (HLD fix tracked separately).
  • Ownership & lifecycle

    • The initiator of an operation (startup/shutdown/reboot) sets:

      • state_transition_in_progress=True
      • transition_type=<op>
      • transition_start_time=<utc-iso8601>
    • The platform (set_admin_state()) is responsible for clearing:

      • state_transition_in_progress=False
      • optionally transition_end_time=<epoch> (or similar end stamp).
    • CLI pre-clears only when a prior transition is timed out.

  • Timeouts & policy

    • Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

    • Typical production values used:

      • startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
    • Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

  • Boot behavior

    • chassisd on start:

      1. Clears stale flags once (centralized sweep).
      2. Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
      3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
  • gNOI shutdown daemon

    • Listens on CHASSIS_MODULE_TABLE and triggers only when:

      • state_transition_in_progress=True and transition_type=shutdown.
    • Never clears the flag (ownership stays with the platform).

    • Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

  • CLI (config chassis modules …)

    • Uses ModuleBase APIs for all set/get/timeout checks.
    • If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
    • Sets transition at the start of startup/shutdown; platform clears on completion.
    • Fabric card flow retained; edits are surgical.
  • Redis robustness

    • Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
    • Consistent HGETALL/HSET paths; resilient to connector differences.
  • Race reduction & consistency

    • Centralized writes prevent multi-writer races.
    • All transition writes include transition_start_time; clears may add an end stamp.
    • Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
  • Change scope

    • Minimal, targeted diffs.
    • No background tasks added, no broad refactors beyond transition handling.
    • Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw
Copy link
Contributor

hdwhdw commented May 19, 2025

Do you mind pasting the steps and output for testing (commands) in the PR description

try:
msg = json.loads(line)
dpu_ip = msg["dpu_ip"]
port = msg.get("port", 50052)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam No longer exists

msg = json.loads(line)
dpu_ip = msg["dpu_ip"]
port = msg.get("port", 50052)
method = msg.get("method", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HALT method is 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam No longer this is valid and we use HALT method 3

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

if [[ "$subtype" == "SmartSwitch" && "$is_dpu" != "True" ]]; then
exit 0
else
echo "gnoi-reboot-daemon is intended for SmartSwitch platforms only."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be debug level log, we don't want to see this on other platforms unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam Not valid anymore

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rameshraghupathy
Copy link
Contributor Author

graceful-shutdow-log.pdf

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot September 29, 2025 19:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (1)

tests/gnoi_shutdown_daemon_test.py:1

  • Replace bare assert statement with self.assertEqual(mock_exec.call_count, 0) for better test failure messages and consistency with unittest framework.
import unittest

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

# gNOI helpers
# ############

def execute_gnoi_command(command_args, timeout_sec=REBOOT_RPC_TIMEOUT_SEC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like execute any command. Maybe we should bake in docker exec gnoi_client.

@vvolam vvolam requested review from Copilot and hdwhdw October 8, 2025 21:49
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +570 to +579
def test_shutdown_skips_when_port_closed(self):
with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStub2), \
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=False), \
patch("gnoi_shutdown_daemon._cfg_get_entry",
side_effect=lambda table, key:
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \
patch("gnoi_shutdown_daemon.logger") as mock_logger:
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test method test_shutdown_skips_when_port_closed is defined outside of the TestGnoiShutdownDaemon class (line 570). It should be indented to be a method of the class.

Copilot uses AI. Check for mistakes.

Comment on lines +615 to +627
import gnoi_shutdown_daemon as d
db = MagicMock()
db.pubsub.return_value = _mk_pubsub_once2()
mock_sonic.return_value = db

try:
d.main()
except Exception:
pass

mock_exec.assert_not_called()
self.assertTrue(any("ip not found" in str(c.args[0]).lower()
for c in (mock_logger.log_error.call_args_list or [])))
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test method test_shutdown_missing_ip_logs_error_and_skips is defined outside of the TestGnoiShutdownDaemon class (line 607). It should be indented to be a method of the class.

Suggested change
import gnoi_shutdown_daemon as d
db = MagicMock()
db.pubsub.return_value = _mk_pubsub_once2()
mock_sonic.return_value = db
try:
d.main()
except Exception:
pass
mock_exec.assert_not_called()
self.assertTrue(any("ip not found" in str(c.args[0]).lower()
for c in (mock_logger.log_error.call_args_list or [])))
import gnoi_shutdown_daemon as d
db = MagicMock()
db.pubsub.return_value = _mk_pubsub_once2()
mock_sonic.return_value = db
try:
d.main()
except Exception:
pass
mock_exec.assert_not_called()
self.assertTrue(any("ip not found" in str(c.args[0]).lower()
for c in (mock_logger.log_error.call_args_list or [])))

Copilot uses AI. Check for mistakes.

Comment on lines +630 to +639
def test_shutdown_reboot_nonzero_does_not_poll_status(self):
with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStub2), \
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=True), \
patch("gnoi_shutdown_daemon._cfg_get_entry",
side_effect=lambda table, key:
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \
patch("gnoi_shutdown_daemon.logger") as mock_logger:
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test method test_shutdown_reboot_nonzero_does_not_poll_status is defined outside of the TestGnoiShutdownDaemon class (line 630). It should be indented to be a method of the class.

Copilot uses AI. Check for mistakes.

Comment on lines +658 to +681
def test_reboot_transition_type_success(self):
"""Test that reboot transition type is handled correctly and clears transition on success"""

class _MBStubReboot:
def __init__(self, *a, **k):
pass

@staticmethod
def get_module_state_transition(*_a, **_k):
return {"state_transition_in_progress": "True", "transition_type": "reboot"}

@staticmethod
def clear_module_state_transition(db, name):
return True

with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStubReboot), \
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=True), \
patch("gnoi_shutdown_daemon._cfg_get_entry",
side_effect=lambda table, key:
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \
patch("gnoi_shutdown_daemon.logger") as mock_logger:
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test method test_reboot_transition_type_success is defined outside of the TestGnoiShutdownDaemon class (line 658). It should be indented to be a method of the class.

Copilot uses AI. Check for mistakes.

Comment on lines +711 to +734
def test_reboot_transition_clear_failure(self):
"""Test that reboot transition logs warning when clear fails"""

class _MBStubRebootFail:
def __init__(self, *a, **k):
pass

@staticmethod
def get_module_state_transition(*_a, **_k):
return {"state_transition_in_progress": "True", "transition_type": "reboot"}

@staticmethod
def clear_module_state_transition(db, name):
return False # Simulate failure

with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStubRebootFail), \
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=True), \
patch("gnoi_shutdown_daemon._cfg_get_entry",
side_effect=lambda table, key:
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \
patch("gnoi_shutdown_daemon.logger") as mock_logger:
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test method test_reboot_transition_clear_failure is defined outside of the TestGnoiShutdownDaemon class (line 711). It should be indented to be a method of the class.

Copilot uses AI. Check for mistakes.

Comment on lines +772 to +794
import gnoi_shutdown_daemon as d
db = MagicMock()
pubsub = MagicMock()
pubsub.get_message.side_effect = [
{"type": "pmessage", "channel": "__keyspace@6__:CHASSIS_MODULE_TABLE|DPU0", "data": "set"},
Exception("stop"),
]
db.pubsub.return_value = pubsub
mock_sonic.return_value = db

mock_exec.side_effect = [
(0, "OK", ""), # Reboot command
(0, "not complete", ""), # RebootStatus - never returns complete
]

try:
d.main()
except Exception:
pass

# Check for timeout warning
all_logs = " | ".join(str(c) for c in mock_logger.method_calls)
self.assertIn("Status polling of halting the services on DPU timed out for DPU0", all_logs)
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test method test_status_polling_timeout_warning is defined outside of the TestGnoiShutdownDaemon class (line 759). It should be indented to be a method of the class.

Suggested change
import gnoi_shutdown_daemon as d
db = MagicMock()
pubsub = MagicMock()
pubsub.get_message.side_effect = [
{"type": "pmessage", "channel": "__keyspace@6__:CHASSIS_MODULE_TABLE|DPU0", "data": "set"},
Exception("stop"),
]
db.pubsub.return_value = pubsub
mock_sonic.return_value = db
mock_exec.side_effect = [
(0, "OK", ""), # Reboot command
(0, "not complete", ""), # RebootStatus - never returns complete
]
try:
d.main()
except Exception:
pass
# Check for timeout warning
all_logs = " | ".join(str(c) for c in mock_logger.method_calls)
self.assertIn("Status polling of halting the services on DPU timed out for DPU0", all_logs)
import gnoi_shutdown_daemon as d
db = MagicMock()
pubsub = MagicMock()
pubsub.get_message.side_effect = [
{"type": "pmessage", "channel": "__keyspace@6__:CHASSIS_MODULE_TABLE|DPU0", "data": "set"},
Exception("stop"),
]
db.pubsub.return_value = pubsub
mock_sonic.return_value = db
mock_exec.side_effect = [
(0, "OK", ""), # Reboot command
(0, "not complete", ""), # RebootStatus - never returns complete
]
try:
d.main()
except Exception:
pass
# Check for timeout warning
all_logs = " | ".join(str(c) for c in mock_logger.method_calls)
self.assertIn("Status polling of halting the services on DPU timed out for DPU0", all_logs)

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@hdwhdw hdwhdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some nits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be bash? Can it be python? Inline script is difficult to maintain.

entry = _cfg_get_entry("DHCP_SERVER_IPV4_PORT", f"bridge-midplane|{dpu_name.lower()}")
return entry.get("ips@")

def get_gnmi_port(dpu_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest get_dpu_gnmi_port so not to be confused with local NPU gnmi port.

enforcer = TimeoutEnforcer(db, module_base, interval_sec=5)
enforcer.start()

while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this loop is large and can create issues for debugging and maintaining. Maybe some logic can be extracted out. For example, the gnoi reboot->poll reboot status can be extracted into a class and condensed into a single function. But it is up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants