Skip to content

Conversation

brandonchuang
Copy link
Contributor

Description

Update the fan service configuration for minipack3n to support incremental PID logic. This change improves fan speed control based on real-time thermal data, as verified by the thermal team.

Motivation

The existing configuration uses a fixed 60% PWM setting, regardless of temperature changes in optics, CPU, inlet, or ASIC components. This update introduces dynamic fan speed control based on temperature inputs to enhance thermal responsiveness and system reliability.

Changes

  1. Updated platform_manager.json: changed SMB CPLD address from 0x3e to 0x33 to enable SP4 power control.
  2. Applied OPTIC_AGGREGATION_TYPE_INCREMENTAL_PID for optics temperature management.
  3. Applied SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID for CPU_UNCORE_TEMP.
  4. Applied SENSOR_PWM_CALC_TYPE_FOUR_LINEAR_TABLE for SCM_INLET_U36_TEMP.
  5. Applied SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID for asic_temp.
  6. Added shutdownCondition with associated shutdownCmd for SP4.

Test Plan

  1. Build and deploy the latest versions of fboss components including fan_service, sensor_service,
    and platform_manager to ensure the updated configuration is in effect.
  2. Run platform_manager and confirm that the SMB CPLD address has been updated from 0x3e to 0x33 for SP4 power control.
  3. Start sensor_service, qsfp_service, and fan_service to ensure proper initialization and
    inter-service communication with the new configuration.
  4. Confirm with the thermal team that the new incremental PID logic adjusts fan speed dynamically based on
    temperature changes (optics, CPU, inlet, ASIC).
  5. Sequentially unplug fans (fan-1 to fan-8) and verify that fan_service detects each failure and increases PWM to compensate.
  6. Trigger the shutdown condition and verify that the shutdownCmd for SP4 is executed correctly.

Test Log

mp3n_platform_manager_smbcpld_change_to_0x33.txt
mp3n_sensor_service.txt
mp3n_test_fan_service_fan1_to_fan8_fail_then_recover.txt
mp3n_test_fan_service_sp4_shutdown.txt
mp3n_thermal_team_fan_service_35C.log
mp3n_thermal_team_fan_service_35C.xlsx
mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.log
mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx

mp3n_thermal_team_fan_service_35C:
image

mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed:
image

…D logic

Description
Update the fan service configuration for minipack3n to support incremental PID logic.
This change improves fan speed control based on real-time thermal data, as verified by the thermal team.

Motivation
The existing configuration uses a fixed 60% PWM setting, regardless of temperature changes in optics, CPU, inlet, or ASIC components.
This update introduces dynamic fan speed control based on temperature inputs to enhance thermal responsiveness and system reliability.

Changes
1. Updated `platform_manager.json`: changed SMB CPLD address from 0x3e to 0x33 to enable SP4 power control.
2. Applied `OPTIC_AGGREGATION_TYPE_INCREMENTAL_PID` for optics temperature management.
3. Applied `SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID` for `CPU_UNCORE_TEMP`.
4. Applied `SENSOR_PWM_CALC_TYPE_FOUR_LINEAR_TABLE` for `SCM_INLET_U36_TEMP`.
5. Applied `SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID` for `asic_temp`.
6. Added `shutdownCondition` with associated `shutdownCmd` for SP4.

Test Plan
1) Build and deploy the latest versions of fboss components including fan_service, sensor_service,
   and platform_manager to ensure the updated configuration is in effect.
2) Run platform_manager and confirm that the SMB CPLD address has been updated from 0x3e to 0x33 for SP4 power control.
3) Start sensor_service, qsfp_service, and fan_service to ensure proper initialization and
   inter-service communication with the new configuration.
4) Confirm with the thermal team that the new incremental PID logic adjusts fan speed dynamically based on
   temperature changes (optics, CPU, inlet, ASIC).
5) Sequentially unplug fans (fan-1 to fan-8) and verify that fan_service detects each failure and increases PWM to compensate.
6) Trigger the shutdown condition and verify that the shutdownCmd for SP4 is executed correctly.

Test Log
[mp3n_platform_manager_smbcpld_change_to_0x33.txt](https://github.com/user-attachments/files/22288809/mp3n_platform_manager_smbcpld_change_to_0x33.txt)
[mp3n_sensor_service.txt](https://github.com/user-attachments/files/22288811/mp3n_sensor_service.txt)
[mp3n_test_fan_service_fan1_to_fan8_fail_then_recover.txt](https://github.com/user-attachments/files/22288812/mp3n_test_fan_service_fan1_to_fan8_fail_then_recover.txt)
[mp3n_test_fan_service_sp4_shutdown.txt](https://github.com/user-attachments/files/22288813/mp3n_test_fan_service_sp4_shutdown.txt)
[mp3n_thermal_team_fan_service_35C.log](https://github.com/user-attachments/files/22288814/mp3n_thermal_team_fan_service_35C.log)
[mp3n_thermal_team_fan_service_35C.xlsx](https://github.com/user-attachments/files/22288815/mp3n_thermal_team_fan_service_35C.xlsx)
[mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.log](https://github.com/user-attachments/files/22288816/mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.log)
[mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx](https://github.com/user-attachments/files/22288819/mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx)
@meta-cla meta-cla bot added the CLA Signed label Sep 12, 2025
@somasun
Copy link
Contributor

somasun commented Sep 12, 2025

The asic_temp is not yet stable. How reliant are you on those?

@brandonchuang
Copy link
Contributor Author

brandonchuang commented Sep 15, 2025

The asic_temp is not yet stable. How reliant are you on those?

@somasun

Based on testing from our thermal team, the PWM values are currently almost entirely determined by the asic_temp. However, the values calculated using OPTIC_TYPE_800_GENERIC are nearly identical, typically within a 1–2% difference.

Therefore, if asic_temp is found to be unstable or inaccurate, and the control logic falls back to OPTIC_TYPE_800_GENERIC, the resulting fan behavior would remain very close to the intended outcome.

@somasun
Copy link
Contributor

somasun commented Sep 16, 2025

Thanks @brandonchuang. Can you please include the test results of fan_service behavior in the following two cases:

  • when asic_temp is unavailable.
  • when the asic_temp is found to be unreliable/inaccurate.

@brandonchuang
Copy link
Contributor Author

Thanks @brandonchuang. Can you please include the test results of fan_service behavior in the following two cases:

  • when asic_temp is unavailable.
  • when the asic_temp is found to be unreliable/inaccurate.

@somasun
I have reviewed the previous test results and did not observe any cases where asic_temp was unavailable or unreadable.
Additionally, there were no signs of unstable or inaccurate temperature readings, such as sudden spikes or drops.

Would you like me to simulate these scenarios, or is there a specific way to reproduce them that you'd recommend?

@somasun
Copy link
Contributor

somasun commented Sep 26, 2025

@brandonchuang

Thanks @brandonchuang. Can you please include the test results of fan_service behavior in the following two cases:

  • when asic_temp is unavailable.
  • when the asic_temp is found to be unreliable/inaccurate.

@somasun I have reviewed the previous test results and did not observe any cases where asic_temp was unavailable or unreadable. Additionally, there were no signs of unstable or inaccurate temperature readings, such as sudden spikes or drops.

Would you like me to simulate these scenarios, or is there a specific way to reproduce them that you'd recommend?

We are finding asic_temp readings to be unreliable. About half of the units are failing to read asic temperature. Please test with asic_temp being unavailable or unreadable. You can consider mget_temp failing to read asic temperature all the time in some units. In other it could be intermittent failures. Please test all these cases.

Copy link
Contributor

@somasun somasun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to author, to address testing comment.

@brandonchuang
Copy link
Contributor Author

brandonchuang commented Oct 1, 2025

@brandonchuang

Thanks @brandonchuang. Can you please include the test results of fan_service behavior in the following two cases:

  • when asic_temp is unavailable.
  • when the asic_temp is found to be unreliable/inaccurate.

@somasun I have reviewed the previous test results and did not observe any cases where asic_temp was unavailable or unreadable. Additionally, there were no signs of unstable or inaccurate temperature readings, such as sudden spikes or drops.
Would you like me to simulate these scenarios, or is there a specific way to reproduce them that you'd recommend?

We are finding asic_temp readings to be unreliable. About half of the units are failing to read asic temperature. Please test with asic_temp being unavailable or unreadable. You can consider mget_temp failing to read asic temperature all the time in some units. In other it could be intermittent failures. Please test all these cases.

@somasun

mget_temp failing to read asic temperature all the time in some units:

please refer to the log file: mp3n_test_fan_service_sp4_always_fail_to_read.txt.
mp3n_test_fan_service_sp4_always_fail_to_read.txt

In this scenario, when asic_temp consistently fails to read, and pwmBoostOnNumDeadSensor is set to 0,
fan_service will fall back to using other available sensors as the basis for thermal policy decisions.

intermittent failures scenario:

please refer to:
mp3n_test_sensor_service_sp4_temp_ok_to_fail.txt, and
mp3n_test_fan_service_sp4_read_failure_and_recovery.txt
mp3n_test_fan_service_sp4_read_failure_and_recovery.txt
mp3n_test_sensor_service_sp4_temp_ok_to_fail.txt

In these cases, when asic_temp was successfully read initially but later fails, fan_service will continue to use the last successfully read temperature value from sensor_service, via getSensorValueThroughThrift().

You can observe this behavior in the logs, where the reported temperature remains at 56°C for an extended period (keyword: asic_temp: Sensor read value is 56). Once mget_temp is able to read the temperature again, fan_service receives the updated value.

Note that during the failure period, fan_service has no awareness that temperature readings are stale or have failed — this is the current behavior of sensor_service.

@brandonchuang
Copy link
Contributor Author

@somasun

May I confirm if you executed mst start before running mget_temp?
It's required to initialize the drivers needed for mget_temp to work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants