Per RFC 1996, a master that mutates a zone SHOULD notify its
secondaries so they can immediately AXFR rather than wait for their
next SOA-refresh poll. Without this, propagation lag from UPDATE to
public DNS is bounded by the secondary's refresh interval (300s for
us) — which is borderline for ACME validation timing.
New Corefile directive:
notify <host[:port]> [<host[:port]>...]
Targets accept bare hostnames (port 53 default), host:port, or
[ipv6]:port. The same list applies to every zone in the rfc2136
block.
Implementation: fire-and-forget UDP per target, each in its own
goroutine, capped by a 2s timeout. The UPDATE response to the client
is never held pending NOTIFY acks (RFC 1996 §4 explicitly decouples
them). Failures log at DEBUG only — a briefly-unreachable secondary
is normal and would otherwise spam logs.
Retires the external scripts/notify-secondaries.py workflow for any
deployment that wires the directive: secondaries now hear about
changes within seconds of the UPDATE landing, no cron or manual
invocation needed.
New tests:
- TestSendNotify_DeliversToTarget — packet arrives, opcode + zone correct
- TestSendNotify_NoTargets_NoCrash — empty list short-circuits
- TestSendNotify_BadTarget_LogsButDoesNotBlock — fire-and-forget timing
- TestNotifyOne_AppendsDefaultPort — host vs host:port normalization
L1 — Replace hand-rolled atoi/parseUint with strconv.ParseUint wrapped
in mustParseUint. Hamilton's reasoning: the comment "strconv adds
overhead we don't need" is the Lauren-Bug shape — we already validated
the input. Until we hadn't, on a path we couldn't predict. Stdlib's
edge-case coverage is the safer default; the wrapper panics on
malformed input so any future regression surfaces in CI, not as a
silent 0 serial.
L2 — applyUpdate no longer mutates the caller's RR header TTL. miekg/
dns parses the UPDATE message into RRs the caller still owns; silently
rewriting hdr.Ttl was a hygiene smell with no current functional
consequence but a clear documentation issue. Now we dns.Copy() the RR
before any header mutation.
L4 — README expanded with an "Operational constraints" section
documenting the contracts and limits operators should understand
before relying on this in production:
- Single-process atomicity only (with rsync-race mitigation)
- Process-global MsgAcceptFunc override
- No-op UPDATE doesn't bump SOA (with touch-UPDATE workaround)
- SOA invariants enforced strictly (zero, multi, non-apex SOA all
refused)
- Serial counter NNNN=9999 rollover semantics
- TSIG replay window dependency on miekg/dns default
- Git commit failure logged at ERROR, not rolled back
- Per-key rate limit knobs
Every constraint maps to a Hamilton review finding; documenting the
contract in operator-facing prose closes the gap between code and
expectation that the review identified.
Hamilton M8: a compromised TSIG key — or a misconfigured client
retrying forever — must not be able to drive unbounded UPDATE traffic.
Each UPDATE costs disk IOPS, a git commit, and a slot in the SOA
serial counter (now 9999/day per zone). Without a cap, a few hours of
runaway traffic could exhaust the SOA serial counter and brick the
zone for the day.
Implementation: per-key token bucket in ratelimit.go. Default 100
tokens / 60 seconds. New keys start full so legitimate clients see no
delay at boot. Refill is continuous, capped at the burst value.
Configurable in Corefile:
rate-limit off # disable entirely
rate-limit <burst> <period-secs> # e.g., rate-limit 200 60
Enforcement runs in ServeDNS after TSIG verification — a request that
fails auth doesn't consume a token (and a forged TSIG can't be used to
deny service to a real key holder, since we never reached the rate
check).
100/min is well above ACME's needs: a worst-case full-renewal storm
across our ~84 zones emits maybe 200 UPDATEs total over several
minutes. Anything beyond is suspicious by definition.
New tests covering: first-call allowed, burst exhaustion, refill
behavior, per-key isolation, refill-cap (no idle-accumulation
overflow).
H6 — TSIG replay-window test. New TestCheckTSIG_BadStatus_Refused
verifies that when miekg/dns reports a TSIG verification failure via
ResponseWriter.TsigStatus (the channel for fudge-window violations,
bad MACs, expired timestamps), our plugin refuses. The fudge tolerance
itself is miekg/dns's default (300s); documented in tsig.go so
operators know the dependency.
H7 — No-op UPDATE policy: documented explicitly in update.go. We do
NOT bump the SOA on a no-op (deduped) UPDATE — forcing downstream
secondaries to AXFR identical content wastes bandwidth and contradicts
RFC 2136's intent. Callers wanting to force a serial bump can send a
throwaway add+delete pair (touch-UPDATE pattern).
M3 — Delete-by-exact-match ignores TTL and class per RFC 2136 §2.5.4.
The previous rr.String() comparison included TTL, so an UPDATE with
CLASS=NONE TTL=0 (the protocol-required encoding for a delete) failed
to match stored RRs at CLASS=IN with non-zero TTL. Now we normalize
both sides (TTL=0, class=IN) before invoking dns.IsDuplicate.
M4 — validateZoneFiles now actually parses each zone at startup
(loadRRs invocation). Previously it only stat()'d the file; corrupt
zone content sailed through startup and produced SERVFAIL on the first
UPDATE with no startup-time signal. Combined with H3+H4's invariant
checks, this turns silent zone corruption into immediate startup
failure.
M7 — Commit-message sanitization. RR names are attacker-controlled
(TSIG only authenticates the sender; the payload is hostile by
default). Control characters in commit messages could inject newlines
into git log or ANSI sequences into downstream log renderers. New
sanitizeForCommitMessage escapes \n, \r, \t, and other C0 controls.
New tests:
- TestCheckTSIG_BadStatus_Refused (H6)
- TestUpdate_DeleteRR_IgnoresTTL (M3)
- TestSanitizeForCommitMessage (M7)
H3+H4 — Zone SOA invariant. After parsing, loadRRs enforces:
exactly one SOA, owned by the zone apex. Catches three failure modes
with a single guard:
- Missing SOA (H4): a malformed line earlier in the file may have
tripped miekg/dns's ZoneParser into dropping records without
reporting an error via parser.Err(). If the SOA went missing, we
refuse rather than treat the partial parse as authoritative.
- Multiple SOAs (H3): zone files with accidental duplicate SOA
records produce inconsistent zone state visible to AXFR clients.
The old code's first-match SOA-bump would silently propagate the
inconsistency. Now we refuse.
- Non-apex SOA (H3): an SOA whose owner doesn't match the zone
origin is either a parse error or a hand-edit mistake; bumping
it would leave the real apex unchanged. Now we refuse.
assertSingleApexSOA returns a descriptive error so the failure mode
is actionable from logs alone.
H5 — MaxUint32 guard in bumpSerial. The old "+1 defensive advance"
branch would wrap to 0 if soa.Serial == MaxUint32, and downstream
secondaries per RFC 1982 §3.2 treat 0-after-MaxUint32 as "older"
(they refuse to AXFR and the zone goes dark). Now we explicitly check
and refuse with a loud message; operator must reset the serial
manually. Practical reach is zero for our deployment (10000 bumps/day
× 117 years would still fit uint32) but the defensive ceiling matters
for fuzz, hand-edit, or future code-path errors.
The full RFC 1982 wraparound-aware comparison was prototyped but
removed: it broke the legacy-format migration case where a tiny
non-CalVer serial (e.g., 12345) is "more than 2^31 distant" from a
new-format serial (~2.6B), which RFC 1982 reads as "going backwards"
and would block migration. Naive `>` is correct in practice; the
MaxUint32 case is the only real failure mode worth guarding.
New tests:
- TestBumpSerial_MaxUint32_RefusesWrap
- TestLoadRRs_NoSOA_Refused
- TestLoadRRs_MultipleSOAs_Refused
- TestLoadRRs_NonApexSOA_Refused
H1 — Concurrent-modification detection. loadRRs now returns a
fileSnapshot capturing (mtime, size) at read time. handleUpdate calls
zf.checkUnchanged(snap) immediately before writeAtomic. If anything
modified the file between load and write — rsync push, manual edit,
`git checkout` — the UPDATE is refused with SERVFAIL. Caddy retries
with a fresh load. Protects against the CLAUDE.md-documented rsync
workflow racing the plugin.
H2 — Git commit-failure policy. The previous code logged at WARN and
continued, breaking the documented "file + git both updated" contract.
Now logs at ERROR with structured fields (zone, path, error, recovery
command) so operators discover the divergence. We do NOT roll back the
file write: by the time the commit fails, the auto plugin may have
already noticed the new mtime and reloaded; rolling back creates more
races than it solves. Recovery is `git -C <dir> status` + manual
commit.
M1 — exec.CommandContext with 10s timeout on git invocations. If git
hangs (NFS stall, gpg-sign prompt, broken pre-commit hook waiting on
stdin), the per-zone mutex would otherwise be held forever and queue
all subsequent UPDATEs. gitCommandTimeout caps the hang.
M2 deferred. Dropping the separate `git add` cleanly requires either
`-a` (wrong scope: auto-stages all tracked modifications) or `--include`
(still needs prior staging). The race window between add and commit is
theoretical for our setup (single-writer plugin + occasional `git
status`). M1's timeout already mitigates the worst hang case.
New tests:
- TestZoneFile_CheckUnchanged_DetectsExternalModification (H1)
C1 — Document the process-global MsgAcceptFunc mutation:
CoreDNS 1.14.3 doesn't expose per-Config MsgAcceptFunc (server.go:159
hardcodes the dns.Server struct), so the override has to be global. The
init()-level comment now explains the operational consequences in
detail, and setup() emits a loud INFO log calling out the global scope
for operator audit. Upstream support for per-Config MsgAcceptFunc would
let us delete the whole stanza.
C2 — handleUpdate now requires the caller to assert TSIG verification
via an explicit `verified bool` parameter. The security contract is
encoded in the function signature, not in convention. ServeDNS passes
verified=true after checkTSIG succeeds; verified=false produces an
immediate Refused with no state mutation. Future internal callers
(NOTIFY relay, admin RPC, refactor) physically cannot reach the
mutation code without proving the request was authenticated.
M9 — Don't sign TSIG-failure rejection responses. Per Hamilton's
finding, signing a rejection with the named key attests "yes, this
server holds that key" — useful intel for an attacker probing key
existence. Unsigned Refused is the right shape: nsupdate sees "no TSIG
on reply" and treats as auth failure, which is what actually happened.
New test TestUpdate_UnverifiedCaller_Refused proves the C2 contract:
handleUpdate(w, msg, false) refuses, zone file unchanged.
The previous YYYYMMDDNN encoding capped at NN=99 (100 bumps/day) and
hard-failed UPDATEs once the day's counter was exhausted — confirmed
in production on 2026-05-22 when ACME activity across the supported.
systems zone hit the cap and SERVFAILed every subsequent UPDATE.
New format: YYMMDD*10000+NNNN. With 4-digit NNNN we get 10000/day, and
dropping the century keeps a 2026-dated serial (2,605,229,999 max) under
uint32's 4,294,967,295 ceiling. A 4-digit year (e.g., 20260522*10000)
would overflow uint32 — RFC 1035's SOA serial type bounds this.
Three behavior changes:
1. On NNNN=9999, roll forward to the next encoded day with NNNN=0001
rather than erroring. The encoded date drifts ahead of wall time on
heavy churn days and catches up on quiet days; monotonic ordering
(the only DNS requirement) holds.
2. Future-encoded serials (from a prior rollover) are honoured — the
previous "older date" branch downgraded them back to today*100+1,
producing a backwards serial. This bug also tripped a manual
workaround on the same day. Now: future encoded dates bump their
own NNNN.
3. Legacy YYYYMMDDNN serials migrate automatically on first bump. A
value like 2026052299 (~2.026B) is numerically smaller than today's
new-format minimum 2605220001 (~2.605B), so the older-or-unparseable
branch fires and rewrites in place. New > old, so AXFR receivers
treat it as a clean forward bump.
Tests cover same-day, rollover, future-encoded no-regress, legacy
migration, non-CalVer reset, and no-SOA error.
When a request arrives with TSIG, attach a TSIG record to the response
so dns.ResponseWriter computes the MAC at write time using the secret
in TsigSecret. Without this, BIND nsupdate complains "expected a TSIG
or SIG(0)" on every UPDATE, even when the update applies successfully.
Two response paths fixed:
- handleUpdate success/per-rcode replies (update.go)
- ServeDNS rejection when TSIG verification fails (plugin.go)
The new helper in tsig.go is a no-op for unsigned requests. Unknown
keys still silently skip signing — we can't authenticate to a peer we
don't share a key with.
Tests verify both branches: signed request → response carries matching
TSIG (key name + algorithm); unsigned request → response stays plain.
Major architectural pivot per the user's "RFC 2136 mechanism for the
existing zonefiles, not a new in-memory thing" framing. The plugin no
longer maintains its own in-memory state OR serves any queries -- both
of those are now the auto plugin's job, reading the same zone files.
The plugin's sole responsibility is now: receive TSIG-authed UPDATE
messages, edit the matching zones/<zone>.zone file, bump the SOA
serial in CalVer (YYYYMMDDNN) form, and optionally auto-commit to git.
What changed:
- DELETED: store.go (in-memory recordStore), store_test.go (12 tests),
plugin_test.go (10 ServeDNS query tests), old update_test.go.
- NEW: zonefile.go -- file-backed authority for one zone. loadRRs via
miekg/dns zone parser; mutation helpers (lookupIn/nameExistsIn/
removeRRsetFrom/removeRRFrom/removeNameFrom/addRRTo) on []dns.RR
slices; bumpSerial with CalVer semantics + NN exhaustion handling;
writeAtomic via temp-file rename; commit shells to `git add && git
commit` with configurable author.
- NEW: zonefile_test.go -- 17 tests covering load/lookup/mutate/bump/
write paths.
- REWRITTEN: plugin.go -- ServeDNS is now thin: UPDATE → TSIG → handler;
everything else → Next. No synthetic SOA/NS, no query serving.
- REWRITTEN: update.go -- handleUpdate now opens the zoneFile, loads,
applies (with prereq checks against the loaded RRs), bumps serial,
writes, commits. Detects no-op updates to avoid spurious file writes.
- REWRITTEN: setup.go -- new directives: `zones-dir` (required),
`auto-commit` (default true), `git-author <name> <email>`. Dropped
`nameserver` and `persist`. Validates each declared zone has a file
on disk via os.Stat before CoreDNS finishes starting.
- REWRITTEN: setup_test.go -- 17 cases for the new grammar.
- REWRITTEN: update_test.go -- 11 cases using real temp zone files
via t.TempDir().
Total: 30 tests passing, 0 failures.
Next: Phase 2c (custom CoreDNS image, deploy, smoke test with nsupdate).
Replaces the Phase-1.3 refuseUpdate() stub with a real RFC 2136 handler.
Caddy via caddy-dns/rfc2136 can now inject and remove records.
UPDATE message handling (update.go):
- Zone section validation: must be exactly one SOA-typed record naming
a zone we're authoritative for. Returns FORMERR/NOTAUTH otherwise.
- Prerequisites (§3.2): name-exists, RRset-exists, name-NOT-exists,
RRset-NOT-exists semantics implemented. First failure short-circuits
with the spec's rcode (NXDOMAIN/NXRRSET/YXDOMAIN/YXRRSET).
- Updates (§3.4.2): add RR, delete RRset (CLASS=ANY+RDLEN=0), delete
all RRsets at name (CLASS=ANY+TYPE=ANY), delete specific RR (CLASS=
NONE).
- Apex SOA/NS protected: synthetic and cannot be added or removed via
UPDATE. Apex wipe (TYPE=ANY at apex) also refused.
- Default TTL applied to incoming records with TTL=0.
TSIG (tsig.go + setup.go):
- setup() now populates dnsserver.Config.TsigSecret so the underlying
dns.Server auto-verifies signatures via miekg/dns.
- checkTSIG() in ServeDNS gates UPDATEs: rejects if no TSIG, unknown
key name, algorithm-downgrade attempt, or w.TsigStatus() != nil.
- No TSIG keys configured → all UPDATEs refused (safety default).
- Algorithm pinning prevents downgrade attacks (e.g. forced HMAC-MD5).
Tests (update_test.go): 11 new cases covering happy paths and every
error rcode. Total: 35 top-level test passes, 0 failures.
ServeDNS dispatch now calls handleUpdate after auth gate. The
refuseUpdate() stub is gone. UPDATE end-to-end via nsupdate requires
the custom CoreDNS image (Phase 2) to verify TSIG plumbing on the
dns.Server side.
Sets up the package layout for a CoreDNS plugin that will accept RFC 2136
dynamic updates with TSIG authentication, primarily targeting self-hosted
ACME DNS-01 cert automation.
What this commit gives us:
- go.mod against coredns/caddy v1.1.4, coredns/coredns v1.14.3, miekg/dns v1.1.72
- plugin.go: RFC2136 struct + Handler interface (ServeDNS is pass-through)
- setup.go: init() registration + Corefile parser (skeleton — recognizes
tsig-key, ttl, persist directives but doesn't yet wire them)
- README.md, .gitignore
go build ./... clean. No tests yet — those come with Phase 1.2 alongside
the actual UPDATE handler and in-memory store.
Plan: ~/.claude/plans/dood-does-coredns-offer-enumerated-piglet.md