-
Notifications
You must be signed in to change notification settings - Fork 192
[xcvrd] Add new thread to poll temperatures more aggressively #685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@prgeor @judyjoseph FYI this is still being tested but feel free to comment early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this change @Staphylo! LGTM
def get_transceiver_dom_temperature(self, physical_port): | ||
try: | ||
return { | ||
'temperature': self.sfp_obj_dict[physical_port].get_temperature(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_temperature
reports 0 when it's not supported by the xcvr.
We should report N/A
in such cases after initially checking whether this is supported by the xcvr during init.
We need to also re-verify this when an insertion/removal event is detected.
self.on_remove_logical_port(port_change_event) | ||
self.port_mapping.handle_port_change_event(port_change_event) | ||
|
||
def on_remove_logical_port(self, port_change_event): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep this API here as it is already defined in the base class DomInfoUpdateBase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want this logic to only run once. I could however move the code related to the dom_temperature_tbl
in the DomThermalInfoUpdateTask
and not have it in the main Dom loop.
The main advantage was that at this point the DomThermalInfoUpdateTask
didn't have to be aware of OIR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Staphylo can you please test physical OIR? Removal of the optics should clear the temp field and insertion should populate the table and values should get updated periodically?
@Staphylo "With this loop running every 5s this means this represent a ~2.5% of busyness time which is acceptable." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new dedicated thread DomThermalInfoUpdateTask
to poll temperature information from transceivers more frequently (every 5 seconds) compared to the existing DOM monitoring thread (60 seconds). This enables more responsive cooling algorithms for high-power optics that heat up quickly.
- Adds new
TRANSCEIVER_DOM_TEMPERATURE
table for storing temperature-only data - Refactors DOM management by creating a common base class
DomInfoUpdateBase
- Implements optimized temperature-only polling function for better performance
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
xcvr_table_helper.py | Adds support for new TRANSCEIVER_DOM_TEMPERATURE table |
xcvrd.py | Integrates new thermal monitoring thread and updates cleanup operations |
utils.py | Adds temperature-specific data retrieval function |
db_utils.py | Implements database posting function for temperature data |
dom_mgr.py | Refactors DOM management with base class and adds thermal monitoring thread |
test_xcvrd.py | Updates tests to cover new thread and temperature table functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
sonic-xcvrd/xcvrd/dom/dom_mgr.py
Outdated
continue | ||
|
||
if not sfp_status_helper.detect_port_in_error_status(logical_port_name, self.xcvr_table_helper.get_status_sw_tbl(asic_index)): | ||
if not xcvrd._wrapper_get_presence(physical_port): |
Copilot
AI
Oct 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the module-level function xcvrd._wrapper_get_presence
creates a circular dependency and tight coupling. Consider injecting this dependency or using the common module's equivalent function.
if not xcvrd._wrapper_get_presence(physical_port): | |
if not common.get_presence(physical_port): |
Copilot uses AI. Check for mistakes.
875f690
to
b85433b
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Staphylo Can you please help with the below?
- Ensure that all the tables in
update_port_db_diagnostics_on_link_change
are updated with link change event - Please capture an output of the
TRANSCEIVER_DOM_TEMPERATURE
table - Please fix the test failure
Which service / component is going to consume the temperature data? Why not change thermalctld who is already providing the temperature data to state DB? |
The idea here was to have a separate thread in xcvrd to get the optics temperature alone, to drive the cooling algorithm faster. The creation of this thread can be controlled via a knob -- need to be turned on only on those platforms which need this. Btw @keboliu in thermalctld are you referring to this loop where SFP thermals are retrieved:
|
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Hi @judyjoseph, I am OK about the knob. However,
|
We actually used to report xcvr temperature via the dedicated From my understanding there is a desire for On the other hand, |
@eddyk-nvidia @keboliu could you please kindly share your comment? |
For high power modules it is important to query the temperature often as they can heat up quickly and the cooling algorithm needs to react quickly to avoid damaging the transceiver modules. Introducing a new DomThermalUpdateInfoTask thread which is solely polling the temperature information out of the xcvrs and publishing these in a new TRANSCEIVER_DOM_TEMPERATURE table in STATE_DB. This is mostly additional code being added. To maximize code reuse, the existing DomUpdateInfoTask has been refactored so that it now share a base with the newly introduced DomThermalInfoUpdateTask. Benchmarking shows that polling the temperature out of a xcvrand publishing it inr the databasse takes around 0.0016 for 800G optics. Considering a worst case of 0.002 and considering a device with 64 ports, this amounts to around 0.128s for every loop. With this loop running every 5s this means this represent a ~2.5% of busyness time.
Make the new DomThermalInfoUpdateTask optional. A program argument is now necessary for xcvrd which can be set via the `pmon_daemon_control.json` The new `--dom_temperature_poll_interval` option takes an update interval in seconds. Address misc comments
b85433b
to
3a0f79d
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Description
This change introduces a new
DomThermalUpdateInfoTask
thread. Its sole purpose is to poll the temperature information out of the transceivers and publish them in a new table calledTRANSCEIVER_DOM_TEMPERATURE
inSTATE_DB
.This change mostly introduces new code.
It however refactors slightly the
dom_mgr.py
to maximize code reuse between the existingDomUpdateInfoTask
and the newly introducedDomThermalInfoUpdateTask
by inheriting from a common base class.The new thread has a polling frequency of 5s to ensure that the data used.
Moving the entire
post_port_dom_sensor_info_to_db
to this new thread is not ideal because it takes much longer to run that simply gathering the temperature. For this reason a dedicatedpost_port_dom_temperature_info_to_db
was introduced which runs in a fraction of the time.Motivation and Context
Cooling becomes a challenge for more powerful optics.
These modules can heat up pretty quickly and the cooling algorithm needs to adjust quickly to these variations.
This can however only happen if the algorithm has recent data to use.
As of today the existing
DomUpdateInfoTask
responsible for updating all DOM, VDM, ... data into STATE_DB, but it is not going this frequently enough.On a system fully loaded with 64 ports, the data would remain stale in the database for 2+ minutes.
This is not frequent enough to ensure safe, stable and efficient cooling.
How Has This Been Tested?
Unit tests where extended to test the features of this new table and thread.
Testing on a 32x800G and 64x800G system populated with OSFP optics.
Additional Information (Optional)
Benchmarking shows that polling the temperature out of a xcvr and publishing it in the databasse takes around 0.0016 for a 800G optics.
Considering a worst case of 0.002 and considering a device with 64 ports, this amounts to around 0.128s for every loop.
With this loop running every 5s this means this represent a ~2.5% of busyness time which is acceptable.
It should be noted that the
DomUpdateInfoTask
is currently not efficient at all due to the rather hiddenselect
calls each with a 1s timeout, which forhandle_port_update_event
happens for every port.