-
Notifications
You must be signed in to change notification settings - Fork 132
Module graceful shutdown support #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #255
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Do you mind pasting the steps and output for testing (commands) in the PR description |
scripts/gnoi-reboot-daemon
Outdated
try: | ||
msg = json.loads(line) | ||
dpu_ip = msg["dpu_ip"] | ||
port = msg.get("port", 50052) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam No longer exists
scripts/gnoi-reboot-daemon
Outdated
msg = json.loads(line) | ||
dpu_ip = msg["dpu_ip"] | ||
port = msg.get("port", 50052) | ||
method = msg.get("method", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HALT method is 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam No longer this is valid and we use HALT method 3
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
scripts/check_platform.sh
Outdated
if [[ "$subtype" == "SmartSwitch" && "$is_dpu" != "True" ]]; then | ||
exit 0 | ||
else | ||
echo "gnoi-reboot-daemon is intended for SmartSwitch platforms only." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be debug level log, we don't want to see this on other platforms unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam Not valid anymore
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
6a74faf
to
cac4b67
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.
Comments suppressed due to low confidence (1)
tests/gnoi_shutdown_daemon_test.py:1
- Replace bare
assert
statement withself.assertEqual(mock_exec.call_count, 0)
for better test failure messages and consistency with unittest framework.
import unittest
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
# gNOI helpers | ||
# ############ | ||
|
||
def execute_gnoi_command(command_args, timeout_sec=REBOOT_RPC_TIMEOUT_SEC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like execute any command. Maybe we should bake in docker exec gnoi_client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
def test_shutdown_skips_when_port_closed(self): | ||
with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \ | ||
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStub2), \ | ||
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \ | ||
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=False), \ | ||
patch("gnoi_shutdown_daemon._cfg_get_entry", | ||
side_effect=lambda table, key: | ||
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \ | ||
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \ | ||
patch("gnoi_shutdown_daemon.logger") as mock_logger: |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method test_shutdown_skips_when_port_closed
is defined outside of the TestGnoiShutdownDaemon
class (line 570). It should be indented to be a method of the class.
Copilot uses AI. Check for mistakes.
import gnoi_shutdown_daemon as d | ||
db = MagicMock() | ||
db.pubsub.return_value = _mk_pubsub_once2() | ||
mock_sonic.return_value = db | ||
|
||
try: | ||
d.main() | ||
except Exception: | ||
pass | ||
|
||
mock_exec.assert_not_called() | ||
self.assertTrue(any("ip not found" in str(c.args[0]).lower() | ||
for c in (mock_logger.log_error.call_args_list or []))) |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method test_shutdown_missing_ip_logs_error_and_skips
is defined outside of the TestGnoiShutdownDaemon
class (line 607). It should be indented to be a method of the class.
import gnoi_shutdown_daemon as d | |
db = MagicMock() | |
db.pubsub.return_value = _mk_pubsub_once2() | |
mock_sonic.return_value = db | |
try: | |
d.main() | |
except Exception: | |
pass | |
mock_exec.assert_not_called() | |
self.assertTrue(any("ip not found" in str(c.args[0]).lower() | |
for c in (mock_logger.log_error.call_args_list or []))) | |
import gnoi_shutdown_daemon as d | |
db = MagicMock() | |
db.pubsub.return_value = _mk_pubsub_once2() | |
mock_sonic.return_value = db | |
try: | |
d.main() | |
except Exception: | |
pass | |
mock_exec.assert_not_called() | |
self.assertTrue(any("ip not found" in str(c.args[0]).lower() | |
for c in (mock_logger.log_error.call_args_list or []))) |
Copilot uses AI. Check for mistakes.
def test_shutdown_reboot_nonzero_does_not_poll_status(self): | ||
with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \ | ||
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStub2), \ | ||
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \ | ||
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=True), \ | ||
patch("gnoi_shutdown_daemon._cfg_get_entry", | ||
side_effect=lambda table, key: | ||
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \ | ||
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \ | ||
patch("gnoi_shutdown_daemon.logger") as mock_logger: |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method test_shutdown_reboot_nonzero_does_not_poll_status
is defined outside of the TestGnoiShutdownDaemon
class (line 630). It should be indented to be a method of the class.
Copilot uses AI. Check for mistakes.
def test_reboot_transition_type_success(self): | ||
"""Test that reboot transition type is handled correctly and clears transition on success""" | ||
|
||
class _MBStubReboot: | ||
def __init__(self, *a, **k): | ||
pass | ||
|
||
@staticmethod | ||
def get_module_state_transition(*_a, **_k): | ||
return {"state_transition_in_progress": "True", "transition_type": "reboot"} | ||
|
||
@staticmethod | ||
def clear_module_state_transition(db, name): | ||
return True | ||
|
||
with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \ | ||
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStubReboot), \ | ||
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \ | ||
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=True), \ | ||
patch("gnoi_shutdown_daemon._cfg_get_entry", | ||
side_effect=lambda table, key: | ||
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \ | ||
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \ | ||
patch("gnoi_shutdown_daemon.logger") as mock_logger: |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method test_reboot_transition_type_success
is defined outside of the TestGnoiShutdownDaemon
class (line 658). It should be indented to be a method of the class.
Copilot uses AI. Check for mistakes.
def test_reboot_transition_clear_failure(self): | ||
"""Test that reboot transition logs warning when clear fails""" | ||
|
||
class _MBStubRebootFail: | ||
def __init__(self, *a, **k): | ||
pass | ||
|
||
@staticmethod | ||
def get_module_state_transition(*_a, **_k): | ||
return {"state_transition_in_progress": "True", "transition_type": "reboot"} | ||
|
||
@staticmethod | ||
def clear_module_state_transition(db, name): | ||
return False # Simulate failure | ||
|
||
with patch("gnoi_shutdown_daemon.SonicV2Connector") as mock_sonic, \ | ||
patch("gnoi_shutdown_daemon.ModuleBase", new=_MBStubRebootFail), \ | ||
patch("gnoi_shutdown_daemon.execute_gnoi_command") as mock_exec, \ | ||
patch("gnoi_shutdown_daemon.is_tcp_open", return_value=True), \ | ||
patch("gnoi_shutdown_daemon._cfg_get_entry", | ||
side_effect=lambda table, key: | ||
{"ips@": "10.0.0.1"} if table == "DHCP_SERVER_IPV4_PORT" else {"gnmi_port": "8080"}), \ | ||
patch("gnoi_shutdown_daemon.time.sleep", return_value=None), \ | ||
patch("gnoi_shutdown_daemon.logger") as mock_logger: |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method test_reboot_transition_clear_failure
is defined outside of the TestGnoiShutdownDaemon
class (line 711). It should be indented to be a method of the class.
Copilot uses AI. Check for mistakes.
import gnoi_shutdown_daemon as d | ||
db = MagicMock() | ||
pubsub = MagicMock() | ||
pubsub.get_message.side_effect = [ | ||
{"type": "pmessage", "channel": "__keyspace@6__:CHASSIS_MODULE_TABLE|DPU0", "data": "set"}, | ||
Exception("stop"), | ||
] | ||
db.pubsub.return_value = pubsub | ||
mock_sonic.return_value = db | ||
|
||
mock_exec.side_effect = [ | ||
(0, "OK", ""), # Reboot command | ||
(0, "not complete", ""), # RebootStatus - never returns complete | ||
] | ||
|
||
try: | ||
d.main() | ||
except Exception: | ||
pass | ||
|
||
# Check for timeout warning | ||
all_logs = " | ".join(str(c) for c in mock_logger.method_calls) | ||
self.assertIn("Status polling of halting the services on DPU timed out for DPU0", all_logs) |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test method test_status_polling_timeout_warning
is defined outside of the TestGnoiShutdownDaemon
class (line 759). It should be indented to be a method of the class.
import gnoi_shutdown_daemon as d | |
db = MagicMock() | |
pubsub = MagicMock() | |
pubsub.get_message.side_effect = [ | |
{"type": "pmessage", "channel": "__keyspace@6__:CHASSIS_MODULE_TABLE|DPU0", "data": "set"}, | |
Exception("stop"), | |
] | |
db.pubsub.return_value = pubsub | |
mock_sonic.return_value = db | |
mock_exec.side_effect = [ | |
(0, "OK", ""), # Reboot command | |
(0, "not complete", ""), # RebootStatus - never returns complete | |
] | |
try: | |
d.main() | |
except Exception: | |
pass | |
# Check for timeout warning | |
all_logs = " | ".join(str(c) for c in mock_logger.method_calls) | |
self.assertIn("Status polling of halting the services on DPU timed out for DPU0", all_logs) | |
import gnoi_shutdown_daemon as d | |
db = MagicMock() | |
pubsub = MagicMock() | |
pubsub.get_message.side_effect = [ | |
{"type": "pmessage", "channel": "__keyspace@6__:CHASSIS_MODULE_TABLE|DPU0", "data": "set"}, | |
Exception("stop"), | |
] | |
db.pubsub.return_value = pubsub | |
mock_sonic.return_value = db | |
mock_exec.side_effect = [ | |
(0, "OK", ""), # Reboot command | |
(0, "not complete", ""), # RebootStatus - never returns complete | |
] | |
try: | |
d.main() | |
except Exception: | |
pass | |
# Check for timeout warning | |
all_logs = " | ".join(str(c) for c in mock_logger.method_calls) | |
self.assertIn("Status polling of halting the services on DPU timed out for DPU0", all_logs) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be bash? Can it be python? Inline script is difficult to maintain.
entry = _cfg_get_entry("DHCP_SERVER_IPV4_PORT", f"bridge-midplane|{dpu_name.lower()}") | ||
return entry.get("ips@") | ||
|
||
def get_gnmi_port(dpu_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest get_dpu_gnmi_port
so not to be confused with local NPU gnmi port.
enforcer = TimeoutEnforcer(db, module_base, interval_sec=5) | ||
enforcer.start() | ||
|
||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this loop is large and can create issues for debugging and maintaining. Maybe some logic can be extracted out. For example, the gnoi reboot->poll reboot status can be extracted into a class and condensed into a single function. But it is up to you.
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBase
helpers:set_module_state_transition(db, name, transition_type)
clear_module_state_transition(db, name)
get_module_state_transition(db, name) -> dict
is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
Eliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE
(replacesCHASSIS_MODULE_INFO_TABLE
).Ownership & lifecycle
The initiator of an operation (
startup
/shutdown
/reboot
) sets:state_transition_in_progress=True
transition_type=<op>
transition_start_time=<utc-iso8601>
The platform (
set_admin_state()
) is responsible for clearing:state_transition_in_progress=False
transition_end_time=<epoch>
(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json
; else constants.Typical production values used:
startup: 180s
,shutdown: 180s
(≈graceful_wait 60s + power 120s
),reboot: 120s
.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()
—not in ModuleBase.Boot behavior
chassisd
on start:set_initial_dpu_admin_state()
which marks transitions via ModuleBase before calling platformset_admin_state()
.gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLE
and triggers only when:state_transition_in_progress=True
andtransition_type=shutdown
.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …
)is_module_state_transition_timed_out()
→ auto-clear then proceed.startup
/shutdown
; platform clears on completion.Redis robustness
hset(mapping=...)
usage.Race reduction & consistency
transition_start_time
; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU