Skip to content

Conversation

fmg-john
Copy link

@fmg-john fmg-john commented Sep 3, 2025

Context

I have several runner scale sets running in my Kubernetes cluster, where each scale set allocates different resources to the workflow pods that it manages. Workflow jobs are then able to effectively specify a runner size to run on (i.e small, medium, large etc) and the cluster will scale out the nodes to accommodate the demand of the runner workloads.

I want to ensure that workflow jobs are selecting appropriate runner sizes to prevent resource wastage via unused compute and excess workload based dynamic scaling. To achieve this I want to gather metrics on runner resource utilization and be able to tie those metrics back to the repo/workflow/job that initiated the run from some reporting/visualization tool (i.e Grafana). Currently describing the pods does not give any indication of the what repo/workflow/trigger etc is related to that pod.

Additions

  • Change the debug log that outputs the job container image to be an info log, the official GitHub hosted runners output the image used for the job without needing debug to be enabled, changing this debug to be an info updates the output to be closer to the official GitHub runners.
  • Add labels to job and step pods, with the prefix arc-context- to provide additional information about the context

eg:

Labels:
  arc-context-event-name=pull_request
  arc-context-job=build-services
  arc-context-repository=example-repo
  arc-context-repository-owner=example-org
  arc-context-run-attempt=1
  arc-context-run-id=16253714786
  arc-context-run-number=23204
  arc-context-sha=ae9b1b887bf31a940a5c21d59b789fed9d659f15
  arc-context-workflow=BuildApplication

Benefits

  • Gain additional information when debugging, when describing failing pods, it is now trivial to determine the repo/workflow that triggered the run.
  • Get a better view of the metrics, build better dashboards. i.e view CPU/Memory usage grouped by repo, workflow, job and compare usage to allocation

I have been running a custom build of the runner container hooks internally now for several months now that includes these changes, they've been instrumental in optimizing the resource usage and cost of the GitHub runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant