PromQL Alerts: Google GKE #1182

mrsillydog · 2025-09-08T18:44:45Z

This PR updates Google GKE alerts to use PromQL instead of the deprecated MQL.

One thing to note is that a lot of these alerts weren't able to be created as-is in Google Cloud because they had alerts with user labels relying on template variables (i.e. "workload_name": "${WORKLOAD_NAME}"). I deleted all user labels with those template variables. Other notes/screenshots will be commented on specific files.

mrsillydog · 2025-09-08T18:45:23Z

alerts/google-gke/cpu-limit-utilization-entire-workload.v1.json

Screenshot:

mrsillydog · 2025-09-08T18:46:16Z

alerts/google-gke/cpu-request-utilization-entire-workload.v1.json

Similar to the above file. No screenshots

mrsillydog · 2025-09-08T18:46:52Z

alerts/google-gke/gke-ingress-server-errors-entire-workload.v1.json

mrsillydog · 2025-09-08T18:49:38Z

alerts/google-gke/memory-limit-utilization-entire-workload.v1.json

mrsillydog · 2025-09-08T18:50:59Z

alerts/google-gke/memory-request-utilization-entire-workload.v1.json

No screenshot, but very similar to the file directly above

mrsillydog · 2025-09-08T18:51:58Z

alerts/google-gke/utilization-container-cpu-within-workload.v1.json

mrsillydog · 2025-09-08T18:53:01Z

alerts/google-gke/utilization-container-memory-within-workload.v1.json

We need to do ignoring (memory_type) group_left() because the limit metric does not have memory_type

mrsillydog

All files that don't have a good comment were very difficult to produce good data for, so they deserve extra scrutiny. With that being said, the conversions were relatively straightforward for them, so I'm confident in the conversions.

lyanco · 2025-09-19T20:39:55Z

alerts/google-gke/cpu-limit-utilization-entire-workload.v1.json

-        },
-        "query": "fetch k8s_container\n| \n{\n    metric 'kubernetes.io/container/cpu/core_usage_time'\n    | filter\n        resource.project_id == '${PROJECT_ID}'\n        &&\n        (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n        && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n        &&\n        (resource.cluster_name == '${CLUSTER_NAME}'\n        && resource.location == '${CLUSTER_LOCATION}'\n        && resource.namespace_name == '${NAMESPACE}')\n    | align rate(5m)\n    | every 1m\n    | group_by [],\n        [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n    metric 'kubernetes.io/container/cpu/limit_cores'\n    | filter\n        resource.project_id == '${PROJECT_ID}'\n        &&\n        (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n        && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n        &&\n        (resource.cluster_name == '${CLUSTER_NAME}'\n        && resource.location == '${CLUSTER_LOCATION}'\n        && resource.namespace_name == '${NAMESPACE}')\n    | group_by 5m, [value_limit_cores_mean: mean(value.limit_cores)]\n    | every 1m\n    | group_by [],\n        [value_limit_cores_mean_aggregate: aggregate(value_limit_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Limit Utilization']\n| scale('%')\n| condition val() > 90'%'"
+        "evaluationInterval": "30s",
+        "query": "(\n  sum(rate(\n    {\"kubernetes.io/container/cpu/core_usage_time\",\n    monitored_resource=\"k8s_container\",\n    top_level_controller_name=\"${WORKLOAD_NAME}\",\n    top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n    project_id=\"${PROJECT_ID}\",\n    cluster_name=\"${CLUSTER_NAME}\",\n    location=\"${CLUSTER_LOCATION}\",\n    namespace_name=\"${NAMESPACE}\"}[5m])\n  )\n  /\n  sum(avg_over_time(\n    {\"kubernetes.io/container/cpu/limit_cores\",\n    monitored_resource=\"k8s_container\",\n    top_level_controller_name=\"${WORKLOAD_NAME}\",\n    top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n    project_id=\"${PROJECT_ID}\",\n    cluster_name=\"${CLUSTER_NAME}\",\n    location=\"${CLUSTER_LOCATION}\",\n    namespace_name=\"${NAMESPACE}\"}[5m])\n  )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")\n> 0.9"


need to prepend top_level_controller_* with metadata_system_

Does this actually work / do anything?

* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")

It's basically another way of doing label_replace. I'll swap this to just doing label_replace.

lyanco · 2025-09-19T20:41:56Z

alerts/google-gke/cpu-request-utilization-entire-workload.v1.json

-        },
-        "query": "fetch k8s_container\n| \n{\n    metric 'kubernetes.io/container/cpu/core_usage_time'\n    | filter\n        resource.project_id == '${PROJECT_ID}'\n        &&\n        (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n        && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n        &&\n        (resource.cluster_name == '${CLUSTER_NAME}'\n        && resource.location == '${CLUSTER_LOCATION}'\n        && resource.namespace_name == '${NAMESPACE}')\n    | align rate(5m)\n    | every 1m\n    | group_by [],\n        [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n    metric 'kubernetes.io/container/cpu/request_cores'\n    | filter\n        resource.project_id == '${PROJECT_ID}'\n        &&\n        (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n        && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n        &&\n        (resource.cluster_name == '${CLUSTER_NAME}'\n        && resource.location == '${CLUSTER_LOCATION}'\n        && resource.namespace_name == '${NAMESPACE}')\n    | group_by 5m, [value_request_cores_mean: mean(value.request_cores)]\n    | every 1m\n    | group_by [],\n        [value_request_cores_mean_aggregate: aggregate(value_request_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Request Utilization']\n| scale('%')\n| condition val() > 90'%'"
+        "evaluationInterval": "30s",
+        "query": "(\n  sum(rate(\n    {\"kubernetes.io/container/cpu/core_usage_time\",\n    monitored_resource=\"k8s_container\",\n    top_level_controller_name=\"${WORKLOAD_NAME}\",\n    top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n    project_id=\"${PROJECT_ID}\",\n    cluster_name=\"${CLUSTER_NAME}\",\n    location=\"${CLUSTER_LOCATION}\",\n    namespace_name=\"${NAMESPACE}\"}[5m])\n  )\n  /\n  sum(avg_over_time(\n    {\"kubernetes.io/container/cpu/request_cores\",\n    monitored_resource=\"k8s_container\",\n    top_level_controller_name=\"${WORKLOAD_NAME}\",\n    top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n    project_id=\"${PROJECT_ID}\",\n    cluster_name=\"${CLUSTER_NAME}\",\n    location=\"${CLUSTER_LOCATION}\",\n    namespace_name=\"${NAMESPACE}\"}[5m])\n  )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Request Utilization\", \"\", \"\")\n> 0.9"


ditto - need to add metadata_system_ to the top level controller labels

lyanco · 2025-09-19T20:49:53Z

alerts/google-gke/cpu-limit-utilization-entire-workload.v1.json

-        },
-        "query": "fetch k8s_container\n| \n{\n    metric 'kubernetes.io/container/cpu/core_usage_time'\n    | filter\n        resource.project_id == '${PROJECT_ID}'\n        &&\n        (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n        && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n        &&\n        (resource.cluster_name == '${CLUSTER_NAME}'\n        && resource.location == '${CLUSTER_LOCATION}'\n        && resource.namespace_name == '${NAMESPACE}')\n    | align rate(5m)\n    | every 1m\n    | group_by [],\n        [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]\n;\n    metric 'kubernetes.io/container/cpu/limit_cores'\n    | filter\n        resource.project_id == '${PROJECT_ID}'\n        &&\n        (metadata.system_labels.top_level_controller_name == '${WORKLOAD_NAME}'\n        && metadata.system_labels.top_level_controller_type == '${TOP_LEVEL_CONTROLLER_TYPE}')\n        &&\n        (resource.cluster_name == '${CLUSTER_NAME}'\n        && resource.location == '${CLUSTER_LOCATION}'\n        && resource.namespace_name == '${NAMESPACE}')\n    | group_by 5m, [value_limit_cores_mean: mean(value.limit_cores)]\n    | every 1m\n    | group_by [],\n        [value_limit_cores_mean_aggregate: aggregate(value_limit_cores_mean)]\n}\n| join\n| div\n| map add[p: 'CPU Limit Utilization']\n| scale('%')\n| condition val() > 90'%'"
+        "evaluationInterval": "30s",
+        "query": "(\n  sum(rate(\n    {\"kubernetes.io/container/cpu/core_usage_time\",\n    monitored_resource=\"k8s_container\",\n    top_level_controller_name=\"${WORKLOAD_NAME}\",\n    top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n    project_id=\"${PROJECT_ID}\",\n    cluster_name=\"${CLUSTER_NAME}\",\n    location=\"${CLUSTER_LOCATION}\",\n    namespace_name=\"${NAMESPACE}\"}[5m])\n  )\n  /\n  sum(avg_over_time(\n    {\"kubernetes.io/container/cpu/limit_cores\",\n    monitored_resource=\"k8s_container\",\n    top_level_controller_name=\"${WORKLOAD_NAME}\",\n    top_level_controller_type=\"${TOP_LEVEL_CONTROLLER_TYPE}\",\n    project_id=\"${PROJECT_ID}\",\n    cluster_name=\"${CLUSTER_NAME}\",\n    location=\"${CLUSTER_LOCATION}\",\n    namespace_name=\"${NAMESPACE}\"}[5m])\n  )\n)\n* on() group_left(p) label_replace(vector(1), \"p\", \"CPU Limit Utilization\", \"\", \"\")\n> 0.9"


@yqlu do you know why these alerts are written as something you have to install once per top_level_controller_name by specifying the labels (top_level_controller_name, top_level_controller_type, project_id, cluster_name, location, namespace_name), instead of what I would expect which is to have one alert that does a sum by (top_level_controller_name) with optional regex include/exclude filters if you didn't want to alert on everything?

Are they used somewhere in the UI where having a separate alert per label combination is useful? If not, we should take the opportunity to rewrite these to be more sane and aggregate-y

It looks like the original author might not have known about doc variables in alerts: https://cloud.google.com/monitoring/alerts/doc-variables#doc-vars

Or maybe they didn't exist then

Hence one alert per combination of label values, which could be used to populate the troubleshooting link

But there is a better way to do this now

All these alerts are written this way so I'm going to pause review until we figure out if we want to do a more holistic rewrite, which I think we should.

OK - so we figured out where these are used in the Console. They're within GKE>Workloads, then click a single Workload, then Observability, then select Application, enable Istio or something, then click Recommended Alerts. When you do so, the side panel opens up and you can see the alert ${VARIABLES} subbed out with variables inferred from the workload you're on:

So with that context, I'mma say that we should NOT bother making these alerts universal/aggregated (as people are on a specific workload's page when they install it), but we should change them to use doc variables. Basically it's just the following:

Change the string ${CLUSTER_LOCATION}/${CLUSTER_NAME}/${NAMESPACE}/${WORKLOAD_NAME}/observability?project=${PROJECT_ID} within the content field to ${metric_or_resource.label.location}/${metric_or_resource.label.cluster_name}/${metric_or_resource.label.namespace_name}/${metric.label.metadata_system_top_level_controller_name}/observability?project=${metric_or_resource.label.project_id}

We also should remove the reference to MQL within content, replacing [MQL documentation](https://cloud.google.com/monitoring/mql/alerts) with [PromQL documentation](https://cloud.google.com/monitoring/promql/create-promql-alerts-console)

Let's also change all = to =~, so if someone wants to make it universal, it's really easy to just add a | and have multiple entries.

Let's also keep all userLabels that currently exist, because might as well.

lyanco · 2025-09-19T20:52:25Z

alerts/google-gke/cpu-limit-utilization-entire-workload.v1.json

 {
  "displayName": "High CPU Limit Utilization (GKE Workload: ${WORKLOAD_NAME})",
  "documentation": {
    "content": "[View Workload Details](https://console.cloud.google.com/kubernetes/deployment/${CLUSTER_LOCATION}/${CLUSTER_NAME}/${NAMESPACE}/${WORKLOAD_NAME}/observability?project=${PROJECT_ID}).\n- Workloads with high CPU limit utilization likely have containers that are CPU throttled. To avoid application slowdown and unresponsiveness, you can either keep CPU usage below the CPU limit or increase the limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits)\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [Alerting policies documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth), [MQL documentation](https://cloud.google.com/monitoring/mql/alerts)",


I think these manually-injected variables can and should be replaced by cloud alerting variables instead: https://cloud.google.com/monitoring/alerts/doc-variables#doc-vars

That way you can install one alert that dynamically populates the labels based on the values seen in the offending time series

mrsillydog · 2025-10-01T15:51:10Z

@lyanco This should be good for another pass - let me know if I've gone overboard with the doc variables.

lyanco · 2025-10-01T17:53:54Z

alerts/google-gke/cpu-limit-utilization-containers-within-cluster.v1.json

@@ -1,13 +1,13 @@
 {
-  "displayName": "GKE Container - High CPU Limit Utilization (${CLUSTER_NAME} cluster)",
+  "displayName": "GKE Container - High CPU Limit Utilization (${metric_or_resource.label.cluster_name} cluster)",


I actually think we need to keep ${CLUSTER_NAME} in displayName, IIRC doc variables only work within the "documentation" field

OK - sorry for the confusion here, you might want to just revert this PR and make a smaller edit. The doc variables only work for fields within the "documentation" block. So we still need to keep the ${CLUSTER_NAME} etc variables in the queries and the displayName fields.

${CLUSTER_NAME} gets filled in by our UI with the name of the cluster, which is needed to actually run the alert query. ${metric_or_resource.label.cluster_name} only gets filled in if the alert fires, and only for documentation. So we need to keep ${CLUSTER_NAME} etc. everywhere that's not the documentation field.

lyanco · 2025-10-01T17:54:42Z

alerts/google-gke/cpu-limit-utilization-containers-within-cluster.v1.json

        "comparison": "COMPARISON_GT",
        "duration": "0s",
-        "filter": "resource.type = \"k8s_container\" AND metric.type = \"kubernetes.io/container/cpu/limit_utilization\" AND resource.labels.cluster_name=\"${CLUSTER_NAME}\" AND resource.labels.location=\"${CLUSTER_LOCATION}\"",
+        "filter": "resource.type = \"k8s_container\" AND metric.type = \"kubernetes.io/container/cpu/limit_utilization\" AND resource.labels.cluster_name=\"${metric_or_resource.label.cluster_name}\" AND resource.labels.location=\"${metric_or_resource.label.location}\"",


Ditto - doc variables should only be used within "documentation" fields.

This reverts commit b66ebd2.

mrsillydog · 2025-10-01T20:02:18Z

The doc variables only work for fields within the "documentation" block. So we still need to keep the ${CLUSTER_NAME} etc variables in the queries and the displayName fields.

Good to know, that's what I was worried about. Should've been less presumptuous. Changes have been reverted, and then the still-relevant changes have been made.

mrsillydog added 2 commits September 8, 2025 14:40

MQL

ec93637

PromQL

88e37cb