Skip to content

Conversation

@u-kai
Copy link
Contributor

@u-kai u-kai commented Sep 23, 2025

What does it do?

Adds apex record delegation checks to prevent creating apex records when the TXT registry configuration would otherwise generate ownership TXT records outside of the managed zone.

Specifically, when --txt-prefix includes a record type template with a trailing dot, ExternalDNS could attempt to create TXT records that do not belong to the managed zone. This change introduces a guard (apexChecker) to block such cases.

Motivation

When using certain --txt-prefix configurations with apex records (records at the zone root), ExternalDNS may try to create ownership TXT records outside of the intended zone.
For example, with --txt-prefix=txt- and an apex A record for example.com, the ownership TXT record would be created as txt-example.com, which is outside the example.com zone and therefore unmanaged.

This PR ensures apex records are only created when the resulting TXT record can be safely managed within the zone.

Related: documentation clarification proposed in #5863 .

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot k8s-ci-robot added the internal Issues or PRs related to internal code label Sep 23, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. plan Issues or PRs related to external-dns plan registry Issues or PRs related to a registry needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @u-kai. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 23, 2025
@u-kai
Copy link
Contributor Author

u-kai commented Sep 23, 2025

After opening this PR, I realized I still need to think through a few more things.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 24, 2025
@u-kai
Copy link
Contributor Author

u-kai commented Sep 24, 2025

Earlier I mentioned that I still needed to think through a few more things.
I've now completed that work, and this PR is ready for review. 🙇

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2025
@ivankatliarchuk
Copy link
Member

To be honest, I'm not sure what this change is intend to improve. Do you have any kubernetes manifests to try out before/with-the-fix?

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2025
@u-kai
Copy link
Contributor Author

u-kai commented Sep 30, 2025

@ivankatliarchuk
Thanks for asking! This change adds a warning and a safety check around apex records when txt-prefix would produce a non-creatable TXT record.

Why:
Until now, if txt-prefix was incompatible with an apex record, ExternalDNS would still create the target record (e.g., A), but silently skip the ownership TXT because it falls outside the managed zone (at least on the AWS provider this only surfaced as a debug-level below log). That leaves you with an apex record without ownership, and it’s hard to notice.

time="2025-09-19T20:12:18Z" level=debug msg="Skipping record cname-example.me because no hosted zone matching record DNS Name was detected"

What this PR changes:

  1. If the apex ownership TXT cannot be created due to an invalid txt-prefix, ExternalDNS will not create the target apex record either.
  2. Emits a warn-level log pointing out that txt-prefix is invalid for apex usage.

Backward-compatibility: existing apex records that already lack a TXT remain untouched.

Manifests to try (before/with the fix)

ExternalDNS Deployment (set an intentionally incompatible txt-prefix to simulate the problem):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  replicas: 1
  selector:
    matchLabels: { app: external-dns }
  template:
    metadata:
      labels: { app: external-dns }
    spec:
      serviceAccountName: external-dns
      containers:
        - name: external-dns
          image: ghcr.io/kubernetes-sigs/external-dns:latest
          args:
            - --source=service
            - --provider=aws
            - --registry=txt
            - --txt-prefix=hoge

Apex record via a Service:

apiVersion: v1
kind: Service
metadata:
  name: website
  namespace: default
  annotations:
    external-dns.alpha.kubernetes.io/hostname: example.com
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80
  selector:
    app: website

Expected behavior:

Before the fix: A record for example.com is created; TXT ownership for apex is skipped with only a debug log (no clear signal).

With the fix: A record creation is blocked when the corresponding apex TXT would be invalid, and a warn-level log explains that txt-prefix is not suitable for apex.

Related issue:
#5850

@ivankatliarchuk
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 1, 2025
@coveralls
Copy link

coveralls commented Oct 1, 2025

Pull Request Test Coverage Report for Build 18129040433

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.3%) to 78.925%

Files with Coverage Reduction New Missed Lines %
openshift_route.go 1 79.49%
Totals Coverage Status
Change from base Build 18123647456: 0.3%
Covered Lines: 16055
Relevant Lines: 20342

💛 - Coveralls

@mloiseleur
Copy link
Collaborator

@u-kai Thanks for this PR 👍
It looks good to me.
I have one question, after reading this.

When txt-registry is disabled, is this PR also blocking the A record creation ?

@u-kai
Copy link
Contributor Author

u-kai commented Oct 9, 2025

@mloiseleur

txt-registry is disabled

Is this referring to cases where another registry (such as dynamodb, aws-sd, or noop) is used instead?
If so, then no — it does not block A record creation (and I don’t think it should).
This feature works only within the TXTRegistry.

@mloiseleur
Copy link
Collaborator

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mloiseleur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2025
@ivankatliarchuk
Copy link
Member

I only performed high-level smoke testing, and the PR behaves as expected. I’m not entirely confident about the apex record check—it doesn’t seem very reliable, and I’m not sure what a truly reliable check for apex records would look like.

The change itself make sense.

@u-kai
Copy link
Contributor Author

u-kai commented Oct 17, 2025

@ivankatliarchuk
Thanks for the review!

Could you share what would make you feel more confident about it?
For example, are there specific scenarios or edge cases you think we should validate more carefully, or is it more about adding extra safeguards in the logic?

@ivankatliarchuk
Copy link
Member

There is a long-standing achitectural gap: accurate differentiation of apex roots independent of provider or zone nesting

I see a risk in the proposed APEX check. The rootApexDetector implementation operates entirely heuristically, relying on NS record metadata observed from endpoint listings, not on authoritative DNS resolution.

We can’t get the apex definitively without querying SOA, at most we could make a reasonable heuristic using net.LookupNS(). Simply parsing records I'm unsure where it will work or not, and when the edge cases are.

@u-kai
Copy link
Contributor Author

u-kai commented Oct 20, 2025

@ivankatliarchuk
According to RFC2181 every DNS zone origin must include both NS and SOA records, and the NS records enumerate its authoritative servers.
Therefore, the shortest domain name among NS records within a zone identifies the zone apex.

Our rootApexDetector only retains the shortest (top-level) domain even when sub-zones are delegated,
so delegated subdomains do not affect the apex detection.
Moreover, ExternalDNS only considers zones managed by the configured provider,
meaning it never creates records outside those zones.
As a result, the detected root apex always corresponds to the provider-managed zone boundary,
making this NS-based apex-detection logic both RFC-compliant and operationally safe across environments.

Please let me know if anything looks incorrect or if you see a case I might have missed. 🙇

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ivankatliarchuk
Copy link
Member

The PR reduces risk of creating invalid apex TXT records, but it can’t guarantee correctness across all DNS topologies and providers. Golang standart libraries do not provide any ways to identify that either.

I tried to summarize it

Scenario What Happens Reliability
ExternalDNS manages that private zone (has API access) It can see apex boundaries from the provider API Reliable
The private resolver (10.10.10.5) has additional internal zones not visible to the provider ExternalDNS won’t detect them Unreliable
Split-horizon DNS (same domain, different private/public views) ExternalDNS only sees one side Partial reliability
Purely internal DNS (CoreDNS, Infoblox, etc.) without zone list API Apex detection becomes best-effort only (name pattern heuristic) Weak reliability

That’s key for: the code only sees what your provider’s managed zones expose via their API. I'm not too sure where all the providers will expose this information same way. Hence this needs validating for every single one of them.

Unfortunately, I'm only aware of flow, where we could validate for example public suffixes https://pkg.go.dev/golang.org/x/net/publicsuffix for public domain.

Without querying authoritative servers for SOA (which isn’t done here), this method relies solely on names observed from the registry — a local view of reality. But this approach adds latency and brittleness, and might fail in locked-down VPC networks.

@ivankatliarchuk
Copy link
Member

From my perspective apex detection requires policy, not just data.

There’s no single, universally correct way to define a “root apex domain”, as a result, there is no support in standart libraries for the same.

Examples:

  • example.com might be an apex for one DNS zone, but if you delegate sub.example.com to another provider, its apex is different.
  • In split-horizon DNS, both may exist, with different authoritative sources.

This means determining the apex isn’t a pure DNS query problem, but a contextual policy problem - you need to know which DNS view, resolver, and authority you’re talking to.

@u-kai
Copy link
Contributor Author

u-kai commented Oct 25, 2025

@ivankatliarchuk

Thanks a lot for the detailed write-up — I read the code and thought this through.

Before replying to each “unreliable” scenario you listed, I’d like to align on scope:

  • I believe ExternalDNS should only manage records within what Records() returns. If we create records outside that observable set, plan.Calculate will always show a diff and keep recreating them.
    Under this premise, the proposal is not trying to find the “true/global” apex; it only targets the topmost apex inside the managed tree.
    Example: given example.com, api.example.com, sub.sub.example.com, sub.sub.hoge.com, the detector treats example.com and sub.sub.hoge.com as apexes within the managed scope and ignores unmanaged parents like com or sub.hoge.com.

With that scope in mind, here is how I see the scenarios (please correct me if I’m off):

“The private resolver (10.10.10.5) has additional internal zones not visible to the provider”

Out of scope. If the provider API does not expose those zones, ExternalDNS should not manage them, so the detector doesn’t consider them.

Split-horizon DNS (same domain, different private/public views)

From the provider implementations I checked, when a provider supports both public and private zones, an ExternalDNS instance chooses one (e.g., --aws-zone-type=public|private). So per instance, split-horizon does not surface as a problem for this logic.

Purely internal DNS (CoreDNS, Infoblox, etc.) without a zone list API

While looking into this case, I found something more fundamental related to this PR. I explain this below.


New finding: some providers do not return NS records via Records()

While investigating this, I found that CoreDNS and Pi-hole providers don’t return NS records at all through their Records() implementation.
As a result, the new detector simply becomes a no-op for these providers — it won’t block anything, but it also won’t change existing behavior.
That means the change is safe (no regression), though the benefit will only apply to providers that actually expose zone boundaries (e.g., Route53, Cloud DNS, Azure DNS, etc.).


Summary

In summary, this change should work well for most providers —
those like Route53, Cloud DNS, or Azure DNS will benefit from safer apex detection,
while providers such as CoreDNS, Pi-hole, or custom webhook implementations that don’t expose NS records will behave exactly the same as before.

That’s a trade-off between benefit and consistency.
This PR addresses a recurring issue around apex/TXT registry reliability,
but it introduces a mild inconsistency where some providers don’t apply the logic at all.
How do you think we should approach this trade-off?

As an alternative idea, we could deprecate configurations whose txt-prefix cannot safely support apex records,
by warning at startup (configuration level) rather than runtime.
However, that approach might mark many existing ExternalDNS setups as “discouraged,”
so it would need careful consideration before adopting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. internal Issues or PRs related to internal code needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. plan Issues or PRs related to external-dns plan registry Issues or PRs related to a registry size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants