Skip to content

Conversation

@stktung
Copy link
Contributor

@stktung stktung commented Oct 6, 2025

No description provided.

@stktung stktung requested a review from alexeyzimarev October 6, 2025 06:29
@stktung stktung requested review from a team as code owners October 6, 2025 06:29
@cloudflare-workers-and-pages
Copy link

Deploying eventstore with  Cloudflare Pages  Cloudflare Pages

Latest commit: a61c852
Status: ✅  Deploy successful!
Preview URL: https://0692d6fa.eventstore.pages.dev
Branch Preview URL: https://monitoring-best-practices.eventstore.pages.dev

View logs

@alexeyzimarev
Copy link
Member

All screenshots would look better if

  • The menu on the left is left out or collapsed. It always points to Dashboards.
  • Labels are put at the bottom, leaving more width for the graph itself
  • If the graph shows nothing (like parked messages), maybe not include a screenshot? Or, simulate parked messages and show how the graph looks like.
  • Can we also add light theme images?


## Background

When monitoring the health of a KurrentDB cluster, one should investigate and alert on multiple factors. Here we discuss them in detail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we not need to end lines with full-stops in this system?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also reverse the order to "alert and investigate"


### Garbage Collection Pauses

Garbage collection monitoring is largely concerned with gen2 memory, where longer-lived objects are allocated. The length of **application pauses for compacting garbage collection** of this generation should be monitored using the Kurrent Grafana Dashboard. Steadily increasing durations may eventually cause a leader election as the database will be unresponsive to heartbeats during compacting garbage collections. Monitor this metric to be below the configured Heartbeat Timeout value (default is 10 seconds, so for most customers, 8 seconds should be appropriate)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a link to this, since I don't believe the notion of "gen2 memory" is anything but a .NET implementation detail: https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might also add some commentary on leader election: imagine that causes a short pause in the ability to perform steady writes?


### CPU Utilization

To avoid thrashing, monitor **sustained CPU utilization remains below 80%**. This can be done at the operating system level, or on the Kurrent Grafana Dashboard

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitor -> ensure. Or reward to "monitor sustained CPU utilization and ensure it remains below 80%


### Disk Utilization

Kurrent recommends that organizations configure separate disk locations for logs, data, and indexes to avoid one impacting the other. Monitoring of these spaces should be at the operating system level. Ensure that **log and data disk utilizations are under 90%**. **Index disk utilization should be under 40%**, as additional disk space is required when performing index merges

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the breakout of these recommendations. Do you think they make sense as a bulleted list?


If your **Projection Progress is decreasing, contact Kurrent Support** for analysis and recommendations to mitigate

NOTE: On large databases, this metric could show as 100% but still in fact be far behind due to the number of significant digits when dividing large numbers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta comment: to me that suggests we should actually be showing the number of events behind that we are

Copy link

@mackrorysd mackrorysd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments, but generally this seems like a good addition, so go ahead and push when you think it's ready. We can always refine later too.

@stktung stktung merged commit 6a4e62f into master Oct 8, 2025
3 checks passed
@stktung stktung deleted the monitoring-best-practices branch October 8, 2025 04:14
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stktung 👉 Created pull request targeting release/v24.10: #5297

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stktung 👉 Created pull request targeting release/v25.0: #5298

@stktung
Copy link
Contributor Author

stktung commented Oct 8, 2025

Will publish first and create another PR next week to address these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants