-
Notifications
You must be signed in to change notification settings - Fork 335
PromQL Alerts: Google GKE #1182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
PromQL Alerts: Google GKE #1182
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above file. No screenshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No screenshot, but very similar to the file directly above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All files that don't have a good comment were very difficult to produce good data for, so they deserve extra scrutiny. With that being said, the conversions were relatively straightforward for them, so I'm confident in the conversions.
}, | ||
"query": "fetch k8s_container\n| \n{\n metric 'kubernetes.io/container/cpu/core_usage_time'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | align rate(5m)\n | every 1m\n | group_by [],\n [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n metric 'kubernetes.io/container/cpu/limit_cores'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | group_by 5m, [value_limit_cores_mean: mean(value.limit_cores)]\n | every 1m\n | group_by [],\n [value_limit_cores_mean_aggregate: aggregate(value_limit_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Limit Utilization']\n| scale('%')\n| condition val() > 90'%'" | ||
"evaluationInterval": "30s", | ||
"query": "(\n sum(rate(\n {\"kubernetes.io/container/cpu/core_usage_time\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n /\n sum(avg_over_time(\n {\"kubernetes.io/container/cpu/limit_cores\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")\n> 0.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to prepend top_level_controller_* with metadata_system_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually work / do anything?
* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's basically another way of doing label_replace. I'll swap this to just doing label_replace.
}, | ||
"query": "fetch k8s_container\n| \n{\n metric 'kubernetes.io/container/cpu/core_usage_time'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | align rate(5m)\n | every 1m\n | group_by [],\n [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n metric 'kubernetes.io/container/cpu/request_cores'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | group_by 5m, [value_request_cores_mean: mean(value.request_cores)]\n | every 1m\n | group_by [],\n [value_request_cores_mean_aggregate: aggregate(value_request_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Request Utilization']\n| scale('%')\n| condition val() > 90'%'" | ||
"evaluationInterval": "30s", | ||
"query": "(\n sum(rate(\n {\"kubernetes.io/container/cpu/core_usage_time\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n /\n sum(avg_over_time(\n {\"kubernetes.io/container/cpu/request_cores\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Request Utilization\", \"\", \"\")\n> 0.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto - need to add metadata_system_ to the top level controller labels
}, | ||
"query": "fetch k8s_container\n| \n{\n metric 'kubernetes.io/container/cpu/core_usage_time'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | align rate(5m)\n | every 1m\n | group_by [],\n [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n metric 'kubernetes.io/container/cpu/limit_cores'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | group_by 5m, [value_limit_cores_mean: mean(value.limit_cores)]\n | every 1m\n | group_by [],\n [value_limit_cores_mean_aggregate: aggregate(value_limit_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Limit Utilization']\n| scale('%')\n| condition val() > 90'%'" | ||
"evaluationInterval": "30s", | ||
"query": "(\n sum(rate(\n {\"kubernetes.io/container/cpu/core_usage_time\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n /\n sum(avg_over_time(\n {\"kubernetes.io/container/cpu/limit_cores\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")\n> 0.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yqlu do you know why these alerts are written as something you have to install once per top_level_controller_name by specifying the labels (top_level_controller_name, top_level_controller_type, project_id, cluster_name, location, namespace_name), instead of what I would expect which is to have one alert that does a sum by (top_level_controller_name) with optional regex include/exclude filters if you didn't want to alert on everything?
Are they used somewhere in the UI where having a separate alert per label combination is useful? If not, we should take the opportunity to rewrite these to be more sane and aggregate-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the original author might not have known about doc variables in alerts: https://cloud.google.com/monitoring/alerts/doc-variables#doc-vars
Or maybe they didn't exist then
Hence one alert per combination of label values, which could be used to populate the troubleshooting link
But there is a better way to do this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these alerts are written this way so I'm going to pause review until we figure out if we want to do a more holistic rewrite, which I think we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - so we figured out where these are used in the Console. They're within GKE>Workloads, then click a single Workload, then Observability, then select Application, enable Istio or something, then click Recommended Alerts. When you do so, the side panel opens up and you can see the alert ${VARIABLES} subbed out with variables inferred from the workload you're on:
So with that context, I'mma say that we should NOT bother making these alerts universal/aggregated (as people are on a specific workload's page when they install it), but we should change them to use doc variables. Basically it's just the following:
Change the string ${CLUSTER_LOCATION}/${CLUSTER_NAME}/${NAMESPACE}/${WORKLOAD_NAME}/observability?project=${PROJECT_ID}
within the content
field to ${metric_or_resource.label.location}/${metric_or_resource.label.cluster_name}/${metric_or_resource.label.namespace_name}/${metric.label.metadata_system_top_level_controller_name}/observability?project=${metric_or_resource.label.project_id}
We also should remove the reference to MQL within content
, replacing [MQL documentation](https://cloud.google.com/monitoring/mql/alerts)
with [PromQL documentation](https://cloud.google.com/monitoring/promql/create-promql-alerts-console)
Let's also change all = to =~, so if someone wants to make it universal, it's really easy to just add a | and have multiple entries.
Let's also keep all userLabels that currently exist, because might as well.
{ | ||
"displayName": "High CPU Limit Utilization (GKE Workload: ${WORKLOAD_NAME})", | ||
"documentation": { | ||
"content": "[View Workload Details](https://console.cloud.google.com/kubernetes/deployment/${CLUSTER_LOCATION}/${CLUSTER_NAME}/${NAMESPACE}/${WORKLOAD_NAME}/observability?project=${PROJECT_ID}).\n- Workloads with high CPU limit utilization likely have containers that are CPU throttled. To avoid application slowdown and unresponsiveness, you can either keep CPU usage below the CPU limit or increase the limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits)\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [Alerting policies documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth), [MQL documentation](https://cloud.google.com/monitoring/mql/alerts)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these manually-injected variables can and should be replaced by cloud alerting variables instead: https://cloud.google.com/monitoring/alerts/doc-variables#doc-vars
That way you can install one alert that dynamically populates the labels based on the values seen in the offending time series
@lyanco This should be good for another pass - let me know if I've gone overboard with the doc variables. |
@@ -1,13 +1,13 @@ | |||
{ | |||
"displayName": "GKE Container - High CPU Limit Utilization (${CLUSTER_NAME} cluster)", | |||
"displayName": "GKE Container - High CPU Limit Utilization (${metric_or_resource.label.cluster_name} cluster)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we need to keep ${CLUSTER_NAME} in displayName, IIRC doc variables only work within the "documentation" field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - sorry for the confusion here, you might want to just revert this PR and make a smaller edit. The doc variables only work for fields within the "documentation" block. So we still need to keep the ${CLUSTER_NAME} etc variables in the queries and the displayName fields.
${CLUSTER_NAME} gets filled in by our UI with the name of the cluster, which is needed to actually run the alert query. ${metric_or_resource.label.cluster_name} only gets filled in if the alert fires, and only for documentation. So we need to keep ${CLUSTER_NAME} etc. everywhere that's not the documentation field.
"comparison": "COMPARISON_GT", | ||
"duration": "0s", | ||
"filter": "resource.type = \"k8s_container\" AND metric.type = \"kubernetes.io/container/cpu/limit_utilization\" AND resource.labels.cluster_name=\"${CLUSTER_NAME}\" AND resource.labels.location=\"${CLUSTER_LOCATION}\"", | ||
"filter": "resource.type = \"k8s_container\" AND metric.type = \"kubernetes.io/container/cpu/limit_utilization\" AND resource.labels.cluster_name=\"${metric_or_resource.label.cluster_name}\" AND resource.labels.location=\"${metric_or_resource.label.location}\"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto - doc variables should only be used within "documentation" fields.
This reverts commit b66ebd2.
Good to know, that's what I was worried about. Should've been less presumptuous. Changes have been reverted, and then the still-relevant changes have been made. |
This PR updates Google GKE alerts to use PromQL instead of the deprecated MQL.
One thing to note is that a lot of these alerts weren't able to be created as-is in Google Cloud because they had alerts with user labels relying on template variables (i.e.
"workload_name": "${WORKLOAD_NAME}"
). I deleted all user labels with those template variables. Other notes/screenshots will be commented on specific files.