Skip to content

Runners fail with ERR_TLS_CERT_ALTNAME_INVALID when preparing job on an IPv6 EKS cluster. #245

@carl-reverb

Description

@carl-reverb

Checks

Controller Version

0.12.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy ARC and a RSS to an IPv6-enabled cluster. Runner is 'kubernetes' mode.
2. Create a simple dispatch workflow that `runs-on: myrunner`
3. Dispatch.
4. Runner crashes during 'Initialize Containers'.

Describe the bug

When trying to migrate our runners to an IPv6 EKS cluster, we find that the runners consistently crash in the 'Initialize Containers' step.

Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: Host: fd26. is not in the cert's altnames: DNS:c699ee59bb9e133834eae210f228abc6.yl4.eks-cluster.us-east-1.api.aws, DNS:ip-172-16-172-216.ec2.internal, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:FD26:11D8:2382:0:0:0:0:1, IP Address:2600:1F18:427C:8111:0:0:0:F22, IP Address:172.16.172.216

This is a suspicious message, especially Host: fd26. is not. fd26 is the first part of my cluster's "Service IPv6 Range": fd26:11d8:2382::/108. In fact, KUBERNETES_SERVICE_HOST=fd26:11d8:2382::1. And the DNS result for kubernetes.default is

$ getent hosts kubernetes.default
fd26:11d8:2382::1 kubernetes.default.svc.cluster.local

This Host: fd26. is an outcome I would theorize coming from improper handling of host addresses; attempting to split an assumed address+port string on :, then taking the first part as the address.

Describe the expected behavior

I expect that the runner should work on IPv6 without modification or override.

When migrating workloads to IPv6, I often encounter improper address handling in every language from Ruby, Javascript, Python and Go. Implementers often assume they can string-build a URI from component strings, which falls apart at the edge cases. All of these migrations were solved by correcting implementation to use proper URI-handling standard library functions, or by formatting the address using the bracketed notation: [feeb:beef::1]:8080, which overcomes some issues in third-party libraries and tools.

I attempted to find the root cause of the issue in this repository, @actions/runner-container-hooks and @actions/runner but was unable to pinpoint it.

This effectively blocks me (and apparently everyone) from deploying ARC to IPv6-only clusters.

Controller Logs

https://gist.github.com/carl-reverb/5666aa77be92f57e16320d89c0f2c2db

Runner Pod Logs

https://gist.github.com/carl-reverb/40ac08236942c9bb2aa8330e78fb5c7f

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions