-
Notifications
You must be signed in to change notification settings - Fork 353
fan_config: minipack3n: update fan service config with incremental PID logic #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…D logic Description Update the fan service configuration for minipack3n to support incremental PID logic. This change improves fan speed control based on real-time thermal data, as verified by the thermal team. Motivation The existing configuration uses a fixed 60% PWM setting, regardless of temperature changes in optics, CPU, inlet, or ASIC components. This update introduces dynamic fan speed control based on temperature inputs to enhance thermal responsiveness and system reliability. Changes 1. Updated `platform_manager.json`: changed SMB CPLD address from 0x3e to 0x33 to enable SP4 power control. 2. Applied `OPTIC_AGGREGATION_TYPE_INCREMENTAL_PID` for optics temperature management. 3. Applied `SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID` for `CPU_UNCORE_TEMP`. 4. Applied `SENSOR_PWM_CALC_TYPE_FOUR_LINEAR_TABLE` for `SCM_INLET_U36_TEMP`. 5. Applied `SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID` for `asic_temp`. 6. Added `shutdownCondition` with associated `shutdownCmd` for SP4. Test Plan 1) Build and deploy the latest versions of fboss components including fan_service, sensor_service, and platform_manager to ensure the updated configuration is in effect. 2) Run platform_manager and confirm that the SMB CPLD address has been updated from 0x3e to 0x33 for SP4 power control. 3) Start sensor_service, qsfp_service, and fan_service to ensure proper initialization and inter-service communication with the new configuration. 4) Confirm with the thermal team that the new incremental PID logic adjusts fan speed dynamically based on temperature changes (optics, CPU, inlet, ASIC). 5) Sequentially unplug fans (fan-1 to fan-8) and verify that fan_service detects each failure and increases PWM to compensate. 6) Trigger the shutdown condition and verify that the shutdownCmd for SP4 is executed correctly. Test Log [mp3n_platform_manager_smbcpld_change_to_0x33.txt](https://github.com/user-attachments/files/22288809/mp3n_platform_manager_smbcpld_change_to_0x33.txt) [mp3n_sensor_service.txt](https://github.com/user-attachments/files/22288811/mp3n_sensor_service.txt) [mp3n_test_fan_service_fan1_to_fan8_fail_then_recover.txt](https://github.com/user-attachments/files/22288812/mp3n_test_fan_service_fan1_to_fan8_fail_then_recover.txt) [mp3n_test_fan_service_sp4_shutdown.txt](https://github.com/user-attachments/files/22288813/mp3n_test_fan_service_sp4_shutdown.txt) [mp3n_thermal_team_fan_service_35C.log](https://github.com/user-attachments/files/22288814/mp3n_thermal_team_fan_service_35C.log) [mp3n_thermal_team_fan_service_35C.xlsx](https://github.com/user-attachments/files/22288815/mp3n_thermal_team_fan_service_35C.xlsx) [mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.log](https://github.com/user-attachments/files/22288816/mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.log) [mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx](https://github.com/user-attachments/files/22288819/mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx)
The asic_temp is not yet stable. How reliant are you on those? |
Based on testing from our thermal team, the PWM values are currently almost entirely determined by the asic_temp. However, the values calculated using OPTIC_TYPE_800_GENERIC are nearly identical, typically within a 1–2% difference. Therefore, if asic_temp is found to be unstable or inaccurate, and the control logic falls back to OPTIC_TYPE_800_GENERIC, the resulting fan behavior would remain very close to the intended outcome. |
Thanks @brandonchuang. Can you please include the test results of fan_service behavior in the following two cases:
|
@somasun Would you like me to simulate these scenarios, or is there a specific way to reproduce them that you'd recommend? |
We are finding asic_temp readings to be unreliable. About half of the units are failing to read asic temperature. Please test with asic_temp being unavailable or unreadable. You can consider mget_temp failing to read asic temperature all the time in some units. In other it could be intermittent failures. Please test all these cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to author, to address testing comment.
mget_temp failing to read asic temperature all the time in some units:please refer to the log file: mp3n_test_fan_service_sp4_always_fail_to_read.txt. In this scenario, when asic_temp consistently fails to read, and pwmBoostOnNumDeadSensor is set to 0, intermittent failures scenario:please refer to: In these cases, when asic_temp was successfully read initially but later fails, fan_service will continue to use the last successfully read temperature value from sensor_service, via getSensorValueThroughThrift(). You can observe this behavior in the logs, where the reported temperature remains at 56°C for an extended period (keyword: Note that during the failure period, fan_service has no awareness that temperature readings are stale or have failed — this is the current behavior of sensor_service. |
May I confirm if you executed |
Description
Update the fan service configuration for minipack3n to support incremental PID logic. This change improves fan speed control based on real-time thermal data, as verified by the thermal team.
Motivation
The existing configuration uses a fixed 60% PWM setting, regardless of temperature changes in optics, CPU, inlet, or ASIC components. This update introduces dynamic fan speed control based on temperature inputs to enhance thermal responsiveness and system reliability.
Changes
platform_manager.json
: changed SMB CPLD address from 0x3e to 0x33 to enable SP4 power control.OPTIC_AGGREGATION_TYPE_INCREMENTAL_PID
for optics temperature management.SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID
forCPU_UNCORE_TEMP
.SENSOR_PWM_CALC_TYPE_FOUR_LINEAR_TABLE
forSCM_INLET_U36_TEMP
.SENSOR_PWM_CALC_TYPE_INCREMENTAL_PID
forasic_temp
.shutdownCondition
with associatedshutdownCmd
for SP4.Test Plan
and platform_manager to ensure the updated configuration is in effect.
inter-service communication with the new configuration.
temperature changes (optics, CPU, inlet, ASIC).
Test Log
mp3n_platform_manager_smbcpld_change_to_0x33.txt
mp3n_sensor_service.txt
mp3n_test_fan_service_fan1_to_fan8_fail_then_recover.txt
mp3n_test_fan_service_sp4_shutdown.txt
mp3n_thermal_team_fan_service_35C.log
mp3n_thermal_team_fan_service_35C.xlsx
mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.log
mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed.xlsx
mp3n_thermal_team_fan_service_35C:

mp3n_thermal_team_fan_service_35C_fan3_one_rotor_failed:
