-
Couldn't load subscription status.
- Fork 668
Added monitoring best practice docs #5295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying eventstore with
|
| Latest commit: |
a61c852
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0692d6fa.eventstore.pages.dev |
| Branch Preview URL: | https://monitoring-best-practices.eventstore.pages.dev |
|
All screenshots would look better if
|
|
|
||
| ## Background | ||
|
|
||
| When monitoring the health of a KurrentDB cluster, one should investigate and alert on multiple factors. Here we discuss them in detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do we not need to end lines with full-stops in this system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also reverse the order to "alert and investigate"
|
|
||
| ### Garbage Collection Pauses | ||
|
|
||
| Garbage collection monitoring is largely concerned with gen2 memory, where longer-lived objects are allocated. The length of **application pauses for compacting garbage collection** of this generation should be monitored using the Kurrent Grafana Dashboard. Steadily increasing durations may eventually cause a leader election as the database will be unresponsive to heartbeats during compacting garbage collections. Monitor this metric to be below the configured Heartbeat Timeout value (default is 10 seconds, so for most customers, 8 seconds should be appropriate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding a link to this, since I don't believe the notion of "gen2 memory" is anything but a .NET implementation detail: https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might also add some commentary on leader election: imagine that causes a short pause in the ability to perform steady writes?
|
|
||
| ### CPU Utilization | ||
|
|
||
| To avoid thrashing, monitor **sustained CPU utilization remains below 80%**. This can be done at the operating system level, or on the Kurrent Grafana Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitor -> ensure. Or reward to "monitor sustained CPU utilization and ensure it remains below 80%
|
|
||
| ### Disk Utilization | ||
|
|
||
| Kurrent recommends that organizations configure separate disk locations for logs, data, and indexes to avoid one impacting the other. Monitoring of these spaces should be at the operating system level. Ensure that **log and data disk utilizations are under 90%**. **Index disk utilization should be under 40%**, as additional disk space is required when performing index merges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the breakout of these recommendations. Do you think they make sense as a bulleted list?
|
|
||
| If your **Projection Progress is decreasing, contact Kurrent Support** for analysis and recommendations to mitigate | ||
|
|
||
| NOTE: On large databases, this metric could show as 100% but still in fact be far behind due to the number of significant digits when dividing large numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta comment: to me that suggests we should actually be showing the number of events behind that we are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments, but generally this seems like a good addition, so go ahead and push when you think it's ready. We can always refine later too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Will publish first and create another PR next week to address these issues. |
No description provided.