coredns-rfc2136/README.md
Ryan Malloy 89993ca207 L1/L2/L4: cleanup + README operational guide
L1 — Replace hand-rolled atoi/parseUint with strconv.ParseUint wrapped
in mustParseUint. Hamilton's reasoning: the comment "strconv adds
overhead we don't need" is the Lauren-Bug shape — we already validated
the input. Until we hadn't, on a path we couldn't predict. Stdlib's
edge-case coverage is the safer default; the wrapper panics on
malformed input so any future regression surfaces in CI, not as a
silent 0 serial.

L2 — applyUpdate no longer mutates the caller's RR header TTL. miekg/
dns parses the UPDATE message into RRs the caller still owns; silently
rewriting hdr.Ttl was a hygiene smell with no current functional
consequence but a clear documentation issue. Now we dns.Copy() the RR
before any header mutation.

L4 — README expanded with an "Operational constraints" section
documenting the contracts and limits operators should understand
before relying on this in production:
  - Single-process atomicity only (with rsync-race mitigation)
  - Process-global MsgAcceptFunc override
  - No-op UPDATE doesn't bump SOA (with touch-UPDATE workaround)
  - SOA invariants enforced strictly (zero, multi, non-apex SOA all
    refused)
  - Serial counter NNNN=9999 rollover semantics
  - TSIG replay window dependency on miekg/dns default
  - Git commit failure logged at ERROR, not rolled back
  - Per-key rate limit knobs

Every constraint maps to a Hamilton review finding; documenting the
contract in operator-facing prose closes the gap between code and
expectation that the review identified.
2026-05-22 21:33:37 -06:00

6.1 KiB

coredns-rfc2136

A CoreDNS plugin that accepts RFC 2136 dynamic DNS updates (TSIG-authenticated), filling a gap in the official plugin set.

CoreDNS as-shipped has no plugin for accepting dynamic updates — its plugin model treats authoritative data as read-only (loaded from auto, file, secondary, etc.). This plugin adds the missing piece.

Primary use case: self-hosted ACME DNS-01

The motivating problem: automate Let's Encrypt cert issuance for many domains without depending on registrar APIs (Vultr/Route53/Cloudflare). The architecture:

_acme-challenge.example.com  CNAME  <uuid>.auth.supported.systems
                                      │
                                      │ delegated NS to your CoreDNS host
                                      ▼
                              CoreDNS + rfc2136 plugin
                                      │
                                      │ accepts TSIG UPDATEs from Caddy
                                      │ (caddy-dns/rfc2136) or any other
                                      │ ACME client
                                      ▼
                                  Let's Encrypt validates

One-time per protected domain: add a CNAME glue line in your static zones. After that, all cert issuance + renewal happens via UPDATE messages — zero static zone-file churn.

Status

v2026.05.22.2: production-ready. Handles UPDATE messages against file-backed zones, TSIG-authenticates, bumps SOA serial in CalVer YYMMDD*10000+NNNN form, atomically writes the zone file, optionally git-commits each change for an audit trail. Designed to coexist with CoreDNS's auto plugin (which serves queries from the same zone files on its reload cycle).

Configuration

rfc2136 <zone> [<zone>...] {
    zones-dir <path>                              # required
    tsig-key <name> <algorithm> <base64-secret>   # may repeat
    ttl <seconds>                                 # default 60
    auto-commit <true|false>                      # default true
    git-author <name> <email>                     # optional
    rate-limit <burst> <period-seconds>           # default 100 / 60s
    rate-limit off                                # disable rate-limit
}

Example:

.:53 auth.example.com {
    rfc2136 auth.example.com {
        zones-dir /var/lib/coredns/zones
        tsig-key acme-key. hmac-sha256 BASE64SECRET==
        ttl 60
        auto-commit true
        git-author "coredns-rfc2136" "rfc2136@coredns.example.com"
    }
    auto {
        directory /var/lib/coredns/zones (.*)\.zone {1}
        reload 30s
    }
    errors
    log
}

Operational constraints

A few behaviors operators should know before relying on this plugin:

Single-process atomicity only

The per-zone mutex serializes UPDATEs within one CoreDNS process. It does NOT coordinate with external file edits. If you rsync a zone file from a workstation while the plugin is mid-UPDATE, you get a race. The plugin defends against this with a snapshot-and-recheck: loadRRs captures (mtime, size), and immediately before writing back, we re-stat; if the file changed, the UPDATE is refused with SERVFAIL and the client (Caddy etc.) retries on a fresh load. The window is narrow but non-zero.

Recommendation: don't rsync zone files into a directory the plugin is actively writing to. If you must, expect occasional SERVFAILs that resolve on retry.

MsgAcceptFunc is process-global

CoreDNS 1.14.3 doesn't expose a per-Config MsgAcceptFunc, so this plugin overrides the miekg/dns package-level default at init() time. Every server block in the process will accept the UPDATE opcode at the wire layer — but only blocks with rfc2136 in their plugin chain do anything useful with it (others pass through and return FormatError). The actual security boundary is TSIG, enforced both in ServeDNS and as a defense-in-depth check inside handleUpdate.

No-op UPDATEs do not bump the SOA serial

If an UPDATE adds an RR that's already present (deduped per RFC 2136 §3.4.2.2) or deletes one that doesn't exist, the file is not rewritten and the SOA serial is not advanced. We return NOERROR. Downstream secondaries are not asked to AXFR for a no-change.

If you need to force a serial bump (rare), send a touch-UPDATE: add a throwaway RR then delete it.

SOA invariants are enforced strictly

loadRRs refuses zone files with: zero SOAs, multiple SOAs, or an SOA whose owner doesn't match the zone apex. Both at startup (via validateZoneFiles) and on every UPDATE. Zone-file corruption fails loud at boot rather than mysteriously on first ACME activity.

Serial counter rolls over at NNNN=9999

Format is YYMMDD*10000 + NNNN. At NNNN=9999, the next bump rolls to the next encoded day with NNNN=0001. On heavy days the encoded date drifts ahead of wall time; on quiet days it catches back up. Monotonic ordering (the only DNS requirement) holds. uint32 won't wrap for ~117 years at full 10000/day burn.

TSIG replay window is miekg/dns's default (currently 300s)

The fudge window enforced by miekg/dns's TsigVerify is what gates replay. If miekg/dns ever changes its default, this plugin's behavior changes with it. A future enhancement is tsig-fudge as a Corefile directive.

Git commit failure is logged at ERROR, not rolled back

If git commit fails after a successful writeAtomic, the zone file is correct but the audit trail diverges. We log at ERROR with a recovery hint (git -C <dir> status + manual commit). We do NOT roll back the file write — the auto plugin may have already noticed the new mtime, and rolling back creates more races than it solves.

Per-key rate limit

UPDATE traffic is token-bucket capped per TSIG key. Default 100 UPDATEs per 60 seconds. ACME storms are well within this; anything beyond is suspicious. Tune via rate-limit <burst> <period>.

Building

This plugin is consumed by a custom CoreDNS build via plugin.cfg:

# In CoreDNS source's plugin.cfg, BEFORE the `cache` plugin:
rfc2136:git.supported.systems/rsp2k/coredns-rfc2136

Then go get git.supported.systems/rsp2k/coredns-rfc2136 && make.

License

MIT (TODO: add LICENSE file).