Skip to content

Conversation

mrsillydog
Copy link
Contributor

This PR updates Google GKE alerts to use PromQL instead of the deprecated MQL.

One thing to note is that a lot of these alerts weren't able to be created as-is in Google Cloud because they had alerts with user labels relying on template variables (i.e. "workload_name": "${WORKLOAD_NAME}"). I deleted all user labels with those template variables. Other notes/screenshots will be commented on specific files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above file. No screenshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No screenshot, but very similar to the file directly above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do ignoring (memory_type) group_left() because the limit metric does not have memory_type
image

Copy link
Contributor Author

@mrsillydog mrsillydog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All files that don't have a good comment were very difficult to produce good data for, so they deserve extra scrutiny. With that being said, the conversions were relatively straightforward for them, so I'm confident in the conversions.

},
"query": "fetch k8s_container\n| \n{\n metric 'kubernetes.io/container/cpu/core_usage_time'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | align rate(5m)\n | every 1m\n | group_by [],\n [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n metric 'kubernetes.io/container/cpu/limit_cores'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | group_by 5m, [value_limit_cores_mean: mean(value.limit_cores)]\n | every 1m\n | group_by [],\n [value_limit_cores_mean_aggregate: aggregate(value_limit_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Limit Utilization']\n| scale('%')\n| condition val() > 90'%'"
"evaluationInterval": "30s",
"query": "(\n sum(rate(\n {\"kubernetes.io/container/cpu/core_usage_time\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n /\n sum(avg_over_time(\n {\"kubernetes.io/container/cpu/limit_cores\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")\n> 0.9"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to prepend top_level_controller_* with metadata_system_

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work / do anything?

* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically another way of doing label_replace. I'll swap this to just doing label_replace.

},
"query": "fetch k8s_container\n| \n{\n metric 'kubernetes.io/container/cpu/core_usage_time'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | align rate(5m)\n | every 1m\n | group_by [],\n [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n metric 'kubernetes.io/container/cpu/request_cores'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | group_by 5m, [value_request_cores_mean: mean(value.request_cores)]\n | every 1m\n | group_by [],\n [value_request_cores_mean_aggregate: aggregate(value_request_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Request Utilization']\n| scale('%')\n| condition val() > 90'%'"
"evaluationInterval": "30s",
"query": "(\n sum(rate(\n {\"kubernetes.io/container/cpu/core_usage_time\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n /\n sum(avg_over_time(\n {\"kubernetes.io/container/cpu/request_cores\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Request Utilization\", \"\", \"\")\n> 0.9"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto - need to add metadata_system_ to the top level controller labels

},
"query": "fetch k8s_container\n| \n{\n metric 'kubernetes.io/container/cpu/core_usage_time'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | align rate(5m)\n | every 1m\n | group_by [],\n [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n metric 'kubernetes.io/container/cpu/limit_cores'\n | filter\n resource.project_id == '${PROJECT_ID}'\n &&\n (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n &&\n (resource.cluster_name == '${CLUSTER_NAME}'\n && resource.location == '${CLUSTER_LOCATION}'\n && resource.namespace_name == '${NAMESPACE}')\n | group_by 5m, [value_limit_cores_mean: mean(value.limit_cores)]\n | every 1m\n | group_by [],\n [value_limit_cores_mean_aggregate: aggregate(value_limit_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Limit Utilization']\n| scale('%')\n| condition val() > 90'%'"
"evaluationInterval": "30s",
"query": "(\n sum(rate(\n {\"kubernetes.io/container/cpu/core_usage_time\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n /\n sum(avg_over_time(\n {\"kubernetes.io/container/cpu/limit_cores\",\n monitored_resource=\"k8s_container\",\n top_level_controller_name=\"${WORKLOAD_NAME}\",\n top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n project_id=\"${PROJECT_ID}\",\n cluster_name=\"${CLUSTER_NAME}\",\n location=\"${CLUSTER_LOCATION}\",\n namespace_name=\"${NAMESPACE}\"}[5m])\n )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")\n> 0.9"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yqlu do you know why these alerts are written as something you have to install once per top_level_controller_name by specifying the labels (top_level_controller_name, top_level_controller_type, project_id, cluster_name, location, namespace_name), instead of what I would expect which is to have one alert that does a sum by (top_level_controller_name) with optional regex include/exclude filters if you didn't want to alert on everything?

Are they used somewhere in the UI where having a separate alert per label combination is useful? If not, we should take the opportunity to rewrite these to be more sane and aggregate-y

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the original author might not have known about doc variables in alerts: https://cloud.google.com/monitoring/alerts/doc-variables#doc-vars

Or maybe they didn't exist then

Hence one alert per combination of label values, which could be used to populate the troubleshooting link

But there is a better way to do this now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these alerts are written this way so I'm going to pause review until we figure out if we want to do a more holistic rewrite, which I think we should.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - so we figured out where these are used in the Console. They're within GKE>Workloads, then click a single Workload, then Observability, then select Application, enable Istio or something, then click Recommended Alerts. When you do so, the side panel opens up and you can see the alert ${VARIABLES} subbed out with variables inferred from the workload you're on:
image

So with that context, I'mma say that we should NOT bother making these alerts universal/aggregated (as people are on a specific workload's page when they install it), but we should change them to use doc variables. Basically it's just the following:

Change the string ${CLUSTER_LOCATION}/${CLUSTER_NAME}/${NAMESPACE}/${WORKLOAD_NAME}/observability?project=${PROJECT_ID} within the content field to ${metric_or_resource.label.location}/${metric_or_resource.label.cluster_name}/${metric_or_resource.label.namespace_name}/${metric.label.metadata_system_top_level_controller_name}/observability?project=${metric_or_resource.label.project_id}

We also should remove the reference to MQL within content, replacing [MQL documentation](https://cloud.google.com/monitoring/mql/alerts) with [PromQL documentation](https://cloud.google.com/monitoring/promql/create-promql-alerts-console)

Let's also change all = to =~, so if someone wants to make it universal, it's really easy to just add a | and have multiple entries.

Let's also keep all userLabels that currently exist, because might as well.

{
"displayName": "High CPU Limit Utilization (GKE Workload: ${WORKLOAD_NAME})",
"documentation": {
"content": "[View Workload Details](https://console.cloud.google.com/kubernetes/deployment/${CLUSTER_LOCATION}/${CLUSTER_NAME}/${NAMESPACE}/${WORKLOAD_NAME}/observability?project=${PROJECT_ID}).\n- Workloads with high CPU limit utilization likely have containers that are CPU throttled. To avoid application slowdown and unresponsiveness, you can either keep CPU usage below the CPU limit or increase the limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits)\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [Alerting policies documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth), [MQL documentation](https://cloud.google.com/monitoring/mql/alerts)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these manually-injected variables can and should be replaced by cloud alerting variables instead: https://cloud.google.com/monitoring/alerts/doc-variables#doc-vars

That way you can install one alert that dynamically populates the labels based on the values seen in the offending time series

@mrsillydog mrsillydog requested a review from lyanco October 1, 2025 15:50
@mrsillydog
Copy link
Contributor Author

@lyanco This should be good for another pass - let me know if I've gone overboard with the doc variables.

@@ -1,13 +1,13 @@
{
"displayName": "GKE Container - High CPU Limit Utilization (${CLUSTER_NAME} cluster)",
"displayName": "GKE Container - High CPU Limit Utilization (${metric_or_resource.label.cluster_name} cluster)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we need to keep ${CLUSTER_NAME} in displayName, IIRC doc variables only work within the "documentation" field

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - sorry for the confusion here, you might want to just revert this PR and make a smaller edit. The doc variables only work for fields within the "documentation" block. So we still need to keep the ${CLUSTER_NAME} etc variables in the queries and the displayName fields.

${CLUSTER_NAME} gets filled in by our UI with the name of the cluster, which is needed to actually run the alert query. ${metric_or_resource.label.cluster_name} only gets filled in if the alert fires, and only for documentation. So we need to keep ${CLUSTER_NAME} etc. everywhere that's not the documentation field.

"comparison": "COMPARISON_GT",
"duration": "0s",
"filter": "resource.type = \"k8s_container\" AND metric.type = \"kubernetes.io/container/cpu/limit_utilization\" AND resource.labels.cluster_name=\"${CLUSTER_NAME}\" AND resource.labels.location=\"${CLUSTER_LOCATION}\"",
"filter": "resource.type = \"k8s_container\" AND metric.type = \"kubernetes.io/container/cpu/limit_utilization\" AND resource.labels.cluster_name=\"${metric_or_resource.label.cluster_name}\" AND resource.labels.location=\"${metric_or_resource.label.location}\"",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto - doc variables should only be used within "documentation" fields.

@mrsillydog
Copy link
Contributor Author

mrsillydog commented Oct 1, 2025

The doc variables only work for fields within the "documentation" block. So we still need to keep the ${CLUSTER_NAME} etc variables in the queries and the displayName fields.

Good to know, that's what I was worried about. Should've been less presumptuous. Changes have been reverted, and then the still-relevant changes have been made.

@mrsillydog mrsillydog requested a review from lyanco October 1, 2025 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants