Skip to content

[Bug]: Node memory should not include terminated containers #1091

@bboreham

Description

@bboreham

What happened?

Prometheus, by default, will echo the last value of cAdvisor metrics for 5 minutes after they disappear.
This leads to artefacts where a pod restarts. To illustrate:

Image

So this query for example will sum them and double-count:

'sum(node_namespace_pod_container:container_memory_working_set_bytes{%(clusterLabel)s="$cluster", node=~"$node", container!=""}) by (pod)' % $._config,

This can be fixed by turning on track_timestamps_staleness, added to Prometheus in v2.48, but you could also amend the queries. That example could change to:

sum(max by (cluster, namespace, pod, container)(node_namespace_pod_container:container_memory_working_set_bytes{%(clusterLabel)s="$cluster", node=~"$node", container!=""})) by (pod)

Please provide any helpful snippets.

What parts of the codebase are affected?

Dashboards

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingkeepaliveUse to prevent automatic closing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions