Skip to content

Conversation

@sharnoff
Copy link
Member

There's two metrics added here:

  1. Number of reconcile workers
    This mirrors the existing controller_runtime_max_concurrent_reconciles metric we have from neonvm-controller, which allows us to determine the fraction of total worker-time that we're using (a measure of saturation).

  2. Total duration with items in the queue
    This is roughly analogous to the Linux kernel's CPU PSI metric — other metrics of saturation (using total time spent reconciling or total time in the queue) are useful, but they tend to be easy to misinterpret when the amount of saturation is very skewed across the duration between metric samples. So the idea here is to help get a more accurate picture.

Resolves neondatabase/cloud#27613.

@sharnoff sharnoff force-pushed the sharnoff/plugin-reconcile-metrics branch from e4d8f3f to 02c4f80 Compare April 24, 2025 13:23
@sharnoff sharnoff changed the base branch from main to sharnoff/plugin-init-metrics April 24, 2025 13:24
Copy link
Contributor

@olegbbtr olegbbtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some naming suggestions, otherwise LGTM.

@sharnoff sharnoff assigned sharnoff and unassigned olegbbtr May 13, 2025
@sharnoff sharnoff force-pushed the sharnoff/plugin-init-metrics branch from e8be52f to c561790 Compare May 13, 2025 18:44
Base automatically changed from sharnoff/plugin-init-metrics to main May 13, 2025 19:08
sharnoff added 2 commits May 13, 2025 20:57
Implemented with a new WaitingTimer abstraction, which we use to track
the aggregate amount of time where the queue is non-empty.

The change to .golangci.yml is in order to exempt prometheus.Opts from
exhaustruct requirements.
@sharnoff sharnoff force-pushed the sharnoff/plugin-reconcile-metrics branch from 02c4f80 to d55af20 Compare May 13, 2025 19:58
@sharnoff
Copy link
Member Author

Realized this doesn't handle objects being retried with a delay correctly (if still waiting, they'd be incorrectly counted towards the queue being non-empty when technically they're not ready to be used yet).

Need to fix that before merging.

Copy link
Contributor

@olegbbtr olegbbtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment from Em

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants