Skip to content

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Sep 12, 2025

This commit addresses multiple issues in the DockerStatsEngine:

Race Condition Fixes:

  • Add context cancellation checks in taskContainerMetricsUnsafe to prevent metrics collection on containers that are in the middle of being cleaned up.
  • Implement removeContainerFromAllTasksUnsafe method to handle orphaned container cleanup. This prevents container leaks when the Docker Task Engine has already cleaned up the container but the stats engine still has it tracked.

This fixes the condition where the ECS agent gets stuck in a state logging messages like this continuously for a particular container:

ecs_agent_logs/ecs-agent.log:1496:level=error time=2025-09-12T11:18:54Z msg="Error collecting cloudwatch metrics for container" container="111222333444555" error="need at least 1 non-NaN data points in queue to calculate CW stats set"
ecs_agent_logs/ecs-agent.log:1500:level=error time=2025-09-12T11:19:14Z msg="Error collecting cloudwatch metrics for container" container="111222333444555" error="need at least 1 non-NaN data points in queue to calculate CW stats set"
ecs_agent_logs/ecs-agent.log:1507:level=error time=2025-09-12T11:19:34Z msg="Error collecting cloudwatch metrics for container" container="111222333444555" error="need at least 1 non-NaN data points in queue to calculate CW stats set"

Unrelated but also rename engine_linux.go to engine_unix.go so that this package can compile on macOS more easily.

Testing

functional testing

New tests cover the changes: yes

Description for the changelog

Bugfix: fix "Error collecting cloudwatch metrics for container" errors

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This commit addresses multiple issues in the DockerStatsEngine:

Race Condition Fixes:
- Add context cancellation checks in taskContainerMetricsUnsafe to prevent
  metrics collection on containers that are in the middle of being
  cleaned up.
- Implement removeContainerFromAllTasksUnsafe method to handle orphaned
  container cleanup. This prevents container leaks when the Docker Task Engine
  has already cleaned up the container but the stats engine still has it tracked.

This fixes the condition where the ECS agent starts logging messages
like this continuously for a particular container:

```
ecs_agent_logs/ecs-agent.log:1496:level=error time=2025-09-12T11:18:54Z msg="Error collecting cloudwatch metrics for container" container="111222333444555" error="need at least 1 non-NaN data points in queue to calculate CW stats set"
ecs_agent_logs/ecs-agent.log:1500:level=error time=2025-09-12T11:19:14Z msg="Error collecting cloudwatch metrics for container" container="111222333444555" error="need at least 1 non-NaN data points in queue to calculate CW stats set"
ecs_agent_logs/ecs-agent.log:1507:level=error time=2025-09-12T11:19:34Z msg="Error collecting cloudwatch metrics for container" container="111222333444555" error="need at least 1 non-NaN data points in queue to calculate CW stats set"
```

Unrelated but also rename engine_linux.go to engine_unix.go so that this
package can compile on macOS more easily.
@sparrc sparrc requested a review from a team as a code owner September 12, 2025 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant