16 Commits

Author SHA1 Message Date
7367401734 Send DNS NOTIFY to secondaries after every UPDATE
Per RFC 1996, a master that mutates a zone SHOULD notify its
secondaries so they can immediately AXFR rather than wait for their
next SOA-refresh poll. Without this, propagation lag from UPDATE to
public DNS is bounded by the secondary's refresh interval (300s for
us) — which is borderline for ACME validation timing.

New Corefile directive:
    notify <host[:port]> [<host[:port]>...]

Targets accept bare hostnames (port 53 default), host:port, or
[ipv6]:port. The same list applies to every zone in the rfc2136
block.

Implementation: fire-and-forget UDP per target, each in its own
goroutine, capped by a 2s timeout. The UPDATE response to the client
is never held pending NOTIFY acks (RFC 1996 §4 explicitly decouples
them). Failures log at DEBUG only — a briefly-unreachable secondary
is normal and would otherwise spam logs.

Retires the external scripts/notify-secondaries.py workflow for any
deployment that wires the directive: secondaries now hear about
changes within seconds of the UPDATE landing, no cron or manual
invocation needed.

New tests:
- TestSendNotify_DeliversToTarget — packet arrives, opcode + zone correct
- TestSendNotify_NoTargets_NoCrash — empty list short-circuits
- TestSendNotify_BadTarget_LogsButDoesNotBlock — fire-and-forget timing
- TestNotifyOne_AppendsDefaultPort — host vs host:port normalization
v2026.05.22.3
2026-05-23 00:54:45 -06:00
89993ca207 L1/L2/L4: cleanup + README operational guide
L1 — Replace hand-rolled atoi/parseUint with strconv.ParseUint wrapped
in mustParseUint. Hamilton's reasoning: the comment "strconv adds
overhead we don't need" is the Lauren-Bug shape — we already validated
the input. Until we hadn't, on a path we couldn't predict. Stdlib's
edge-case coverage is the safer default; the wrapper panics on
malformed input so any future regression surfaces in CI, not as a
silent 0 serial.

L2 — applyUpdate no longer mutates the caller's RR header TTL. miekg/
dns parses the UPDATE message into RRs the caller still owns; silently
rewriting hdr.Ttl was a hygiene smell with no current functional
consequence but a clear documentation issue. Now we dns.Copy() the RR
before any header mutation.

L4 — README expanded with an "Operational constraints" section
documenting the contracts and limits operators should understand
before relying on this in production:
  - Single-process atomicity only (with rsync-race mitigation)
  - Process-global MsgAcceptFunc override
  - No-op UPDATE doesn't bump SOA (with touch-UPDATE workaround)
  - SOA invariants enforced strictly (zero, multi, non-apex SOA all
    refused)
  - Serial counter NNNN=9999 rollover semantics
  - TSIG replay window dependency on miekg/dns default
  - Git commit failure logged at ERROR, not rolled back
  - Per-key rate limit knobs

Every constraint maps to a Hamilton review finding; documenting the
contract in operator-facing prose closes the gap between code and
expectation that the review identified.
v2026.05.22.2
2026-05-22 21:33:37 -06:00
8d1477350a M8: per-key UPDATE rate limiting (token bucket)
Hamilton M8: a compromised TSIG key — or a misconfigured client
retrying forever — must not be able to drive unbounded UPDATE traffic.
Each UPDATE costs disk IOPS, a git commit, and a slot in the SOA
serial counter (now 9999/day per zone). Without a cap, a few hours of
runaway traffic could exhaust the SOA serial counter and brick the
zone for the day.

Implementation: per-key token bucket in ratelimit.go. Default 100
tokens / 60 seconds. New keys start full so legitimate clients see no
delay at boot. Refill is continuous, capped at the burst value.

Configurable in Corefile:
  rate-limit off                    # disable entirely
  rate-limit <burst> <period-secs>  # e.g., rate-limit 200 60

Enforcement runs in ServeDNS after TSIG verification — a request that
fails auth doesn't consume a token (and a forged TSIG can't be used to
deny service to a real key holder, since we never reached the rate
check).

100/min is well above ACME's needs: a worst-case full-renewal storm
across our ~84 zones emits maybe 200 UPDATEs total over several
minutes. Anything beyond is suspicious by definition.

New tests covering: first-call allowed, burst exhaustion, refill
behavior, per-key isolation, refill-cap (no idle-accumulation
overflow).
2026-05-22 21:31:17 -06:00
6ab2b6af6d H6/H7/M3/M4/M7: hardening + behavior documentation
H6 — TSIG replay-window test. New TestCheckTSIG_BadStatus_Refused
verifies that when miekg/dns reports a TSIG verification failure via
ResponseWriter.TsigStatus (the channel for fudge-window violations,
bad MACs, expired timestamps), our plugin refuses. The fudge tolerance
itself is miekg/dns's default (300s); documented in tsig.go so
operators know the dependency.

H7 — No-op UPDATE policy: documented explicitly in update.go. We do
NOT bump the SOA on a no-op (deduped) UPDATE — forcing downstream
secondaries to AXFR identical content wastes bandwidth and contradicts
RFC 2136's intent. Callers wanting to force a serial bump can send a
throwaway add+delete pair (touch-UPDATE pattern).

M3 — Delete-by-exact-match ignores TTL and class per RFC 2136 §2.5.4.
The previous rr.String() comparison included TTL, so an UPDATE with
CLASS=NONE TTL=0 (the protocol-required encoding for a delete) failed
to match stored RRs at CLASS=IN with non-zero TTL. Now we normalize
both sides (TTL=0, class=IN) before invoking dns.IsDuplicate.

M4 — validateZoneFiles now actually parses each zone at startup
(loadRRs invocation). Previously it only stat()'d the file; corrupt
zone content sailed through startup and produced SERVFAIL on the first
UPDATE with no startup-time signal. Combined with H3+H4's invariant
checks, this turns silent zone corruption into immediate startup
failure.

M7 — Commit-message sanitization. RR names are attacker-controlled
(TSIG only authenticates the sender; the payload is hostile by
default). Control characters in commit messages could inject newlines
into git log or ANSI sequences into downstream log renderers. New
sanitizeForCommitMessage escapes \n, \r, \t, and other C0 controls.

New tests:
- TestCheckTSIG_BadStatus_Refused (H6)
- TestUpdate_DeleteRR_IgnoresTTL (M3)
- TestSanitizeForCommitMessage (M7)
2026-05-22 21:29:13 -06:00
d9dad01798 H3/H4/H5: zone-integrity invariants
H3+H4 — Zone SOA invariant. After parsing, loadRRs enforces:
exactly one SOA, owned by the zone apex. Catches three failure modes
with a single guard:

  - Missing SOA (H4): a malformed line earlier in the file may have
    tripped miekg/dns's ZoneParser into dropping records without
    reporting an error via parser.Err(). If the SOA went missing, we
    refuse rather than treat the partial parse as authoritative.

  - Multiple SOAs (H3): zone files with accidental duplicate SOA
    records produce inconsistent zone state visible to AXFR clients.
    The old code's first-match SOA-bump would silently propagate the
    inconsistency. Now we refuse.

  - Non-apex SOA (H3): an SOA whose owner doesn't match the zone
    origin is either a parse error or a hand-edit mistake; bumping
    it would leave the real apex unchanged. Now we refuse.

assertSingleApexSOA returns a descriptive error so the failure mode
is actionable from logs alone.

H5 — MaxUint32 guard in bumpSerial. The old "+1 defensive advance"
branch would wrap to 0 if soa.Serial == MaxUint32, and downstream
secondaries per RFC 1982 §3.2 treat 0-after-MaxUint32 as "older"
(they refuse to AXFR and the zone goes dark). Now we explicitly check
and refuse with a loud message; operator must reset the serial
manually. Practical reach is zero for our deployment (10000 bumps/day
× 117 years would still fit uint32) but the defensive ceiling matters
for fuzz, hand-edit, or future code-path errors.

The full RFC 1982 wraparound-aware comparison was prototyped but
removed: it broke the legacy-format migration case where a tiny
non-CalVer serial (e.g., 12345) is "more than 2^31 distant" from a
new-format serial (~2.6B), which RFC 1982 reads as "going backwards"
and would block migration. Naive `>` is correct in practice; the
MaxUint32 case is the only real failure mode worth guarding.

New tests:
- TestBumpSerial_MaxUint32_RefusesWrap
- TestLoadRRs_NoSOA_Refused
- TestLoadRRs_MultipleSOAs_Refused
- TestLoadRRs_NonApexSOA_Refused
2026-05-22 21:25:35 -06:00
93ed180d8f H1/H2/M1: atomicity at the file boundary
H1 — Concurrent-modification detection. loadRRs now returns a
fileSnapshot capturing (mtime, size) at read time. handleUpdate calls
zf.checkUnchanged(snap) immediately before writeAtomic. If anything
modified the file between load and write — rsync push, manual edit,
`git checkout` — the UPDATE is refused with SERVFAIL. Caddy retries
with a fresh load. Protects against the CLAUDE.md-documented rsync
workflow racing the plugin.

H2 — Git commit-failure policy. The previous code logged at WARN and
continued, breaking the documented "file + git both updated" contract.
Now logs at ERROR with structured fields (zone, path, error, recovery
command) so operators discover the divergence. We do NOT roll back the
file write: by the time the commit fails, the auto plugin may have
already noticed the new mtime and reloaded; rolling back creates more
races than it solves. Recovery is `git -C <dir> status` + manual
commit.

M1 — exec.CommandContext with 10s timeout on git invocations. If git
hangs (NFS stall, gpg-sign prompt, broken pre-commit hook waiting on
stdin), the per-zone mutex would otherwise be held forever and queue
all subsequent UPDATEs. gitCommandTimeout caps the hang.

M2 deferred. Dropping the separate `git add` cleanly requires either
`-a` (wrong scope: auto-stages all tracked modifications) or `--include`
(still needs prior staging). The race window between add and commit is
theoretical for our setup (single-writer plugin + occasional `git
status`). M1's timeout already mitigates the worst hang case.

New tests:
- TestZoneFile_CheckUnchanged_DetectsExternalModification (H1)
2026-05-22 21:22:11 -06:00
8e421f925e C1/C2/M9: tighten security boundary at handleUpdate
C1 — Document the process-global MsgAcceptFunc mutation:
CoreDNS 1.14.3 doesn't expose per-Config MsgAcceptFunc (server.go:159
hardcodes the dns.Server struct), so the override has to be global. The
init()-level comment now explains the operational consequences in
detail, and setup() emits a loud INFO log calling out the global scope
for operator audit. Upstream support for per-Config MsgAcceptFunc would
let us delete the whole stanza.

C2 — handleUpdate now requires the caller to assert TSIG verification
via an explicit `verified bool` parameter. The security contract is
encoded in the function signature, not in convention. ServeDNS passes
verified=true after checkTSIG succeeds; verified=false produces an
immediate Refused with no state mutation. Future internal callers
(NOTIFY relay, admin RPC, refactor) physically cannot reach the
mutation code without proving the request was authenticated.

M9 — Don't sign TSIG-failure rejection responses. Per Hamilton's
finding, signing a rejection with the named key attests "yes, this
server holds that key" — useful intel for an attacker probing key
existence. Unsigned Refused is the right shape: nsupdate sees "no TSIG
on reply" and treats as auth failure, which is what actually happened.

New test TestUpdate_UnverifiedCaller_Refused proves the C2 contract:
handleUpdate(w, msg, false) refuses, zone file unchanged.
2026-05-22 21:18:47 -06:00
8466f08780 Widen SOA serial counter: YYMMDDNNNN, 10000 bumps/day
The previous YYYYMMDDNN encoding capped at NN=99 (100 bumps/day) and
hard-failed UPDATEs once the day's counter was exhausted — confirmed
in production on 2026-05-22 when ACME activity across the supported.
systems zone hit the cap and SERVFAILed every subsequent UPDATE.

New format: YYMMDD*10000+NNNN. With 4-digit NNNN we get 10000/day, and
dropping the century keeps a 2026-dated serial (2,605,229,999 max) under
uint32's 4,294,967,295 ceiling. A 4-digit year (e.g., 20260522*10000)
would overflow uint32 — RFC 1035's SOA serial type bounds this.

Three behavior changes:

1. On NNNN=9999, roll forward to the next encoded day with NNNN=0001
   rather than erroring. The encoded date drifts ahead of wall time on
   heavy churn days and catches up on quiet days; monotonic ordering
   (the only DNS requirement) holds.

2. Future-encoded serials (from a prior rollover) are honoured — the
   previous "older date" branch downgraded them back to today*100+1,
   producing a backwards serial. This bug also tripped a manual
   workaround on the same day. Now: future encoded dates bump their
   own NNNN.

3. Legacy YYYYMMDDNN serials migrate automatically on first bump. A
   value like 2026052299 (~2.026B) is numerically smaller than today's
   new-format minimum 2605220001 (~2.605B), so the older-or-unparseable
   branch fires and rewrites in place. New > old, so AXFR receivers
   treat it as a clean forward bump.

Tests cover same-day, rollover, future-encoded no-regress, legacy
migration, non-CalVer reset, and no-SOA error.
v2026.05.22.1
2026-05-22 11:51:45 -06:00
6268e6eafd Sign responses to TSIG-signed UPDATEs (RFC 8945 §5.4.2)
When a request arrives with TSIG, attach a TSIG record to the response
so dns.ResponseWriter computes the MAC at write time using the secret
in TsigSecret. Without this, BIND nsupdate complains "expected a TSIG
or SIG(0)" on every UPDATE, even when the update applies successfully.

Two response paths fixed:
  - handleUpdate success/per-rcode replies (update.go)
  - ServeDNS rejection when TSIG verification fails (plugin.go)

The new helper in tsig.go is a no-op for unsigned requests. Unknown
keys still silently skip signing — we can't authenticate to a peer we
don't share a key with.

Tests verify both branches: signed request → response carries matching
TSIG (key name + algorithm); unsigned request → response stays plain.
v2026.05.22
2026-05-22 09:24:12 -06:00
1fe95e3f6c Revert diagnostic ServeDNS log line 2026-05-21 11:46:49 -06:00
19e93e39e1 DIAGNOSTIC: log every ServeDNS call (will revert) 2026-05-21 11:35:20 -06:00
0f28127284 Phase 2b: refactor to file-backed storage; UPDATE writes zones/*.zone
Major architectural pivot per the user's "RFC 2136 mechanism for the
existing zonefiles, not a new in-memory thing" framing. The plugin no
longer maintains its own in-memory state OR serves any queries -- both
of those are now the auto plugin's job, reading the same zone files.

The plugin's sole responsibility is now: receive TSIG-authed UPDATE
messages, edit the matching zones/<zone>.zone file, bump the SOA
serial in CalVer (YYYYMMDDNN) form, and optionally auto-commit to git.

What changed:
- DELETED: store.go (in-memory recordStore), store_test.go (12 tests),
  plugin_test.go (10 ServeDNS query tests), old update_test.go.
- NEW: zonefile.go -- file-backed authority for one zone. loadRRs via
  miekg/dns zone parser; mutation helpers (lookupIn/nameExistsIn/
  removeRRsetFrom/removeRRFrom/removeNameFrom/addRRTo) on []dns.RR
  slices; bumpSerial with CalVer semantics + NN exhaustion handling;
  writeAtomic via temp-file rename; commit shells to `git add && git
  commit` with configurable author.
- NEW: zonefile_test.go -- 17 tests covering load/lookup/mutate/bump/
  write paths.
- REWRITTEN: plugin.go -- ServeDNS is now thin: UPDATE → TSIG → handler;
  everything else → Next. No synthetic SOA/NS, no query serving.
- REWRITTEN: update.go -- handleUpdate now opens the zoneFile, loads,
  applies (with prereq checks against the loaded RRs), bumps serial,
  writes, commits. Detects no-op updates to avoid spurious file writes.
- REWRITTEN: setup.go -- new directives: `zones-dir` (required),
  `auto-commit` (default true), `git-author <name> <email>`. Dropped
  `nameserver` and `persist`. Validates each declared zone has a file
  on disk via os.Stat before CoreDNS finishes starting.
- REWRITTEN: setup_test.go -- 17 cases for the new grammar.
- REWRITTEN: update_test.go -- 11 cases using real temp zone files
  via t.TempDir().

Total: 30 tests passing, 0 failures.

Next: Phase 2c (custom CoreDNS image, deploy, smoke test with nsupdate).
2026-05-21 11:26:50 -06:00
1d2d919728 Phase 1.4: UPDATE opcode handler + TSIG verification
Replaces the Phase-1.3 refuseUpdate() stub with a real RFC 2136 handler.
Caddy via caddy-dns/rfc2136 can now inject and remove records.

UPDATE message handling (update.go):
- Zone section validation: must be exactly one SOA-typed record naming
  a zone we're authoritative for. Returns FORMERR/NOTAUTH otherwise.
- Prerequisites (§3.2): name-exists, RRset-exists, name-NOT-exists,
  RRset-NOT-exists semantics implemented. First failure short-circuits
  with the spec's rcode (NXDOMAIN/NXRRSET/YXDOMAIN/YXRRSET).
- Updates (§3.4.2): add RR, delete RRset (CLASS=ANY+RDLEN=0), delete
  all RRsets at name (CLASS=ANY+TYPE=ANY), delete specific RR (CLASS=
  NONE).
- Apex SOA/NS protected: synthetic and cannot be added or removed via
  UPDATE. Apex wipe (TYPE=ANY at apex) also refused.
- Default TTL applied to incoming records with TTL=0.

TSIG (tsig.go + setup.go):
- setup() now populates dnsserver.Config.TsigSecret so the underlying
  dns.Server auto-verifies signatures via miekg/dns.
- checkTSIG() in ServeDNS gates UPDATEs: rejects if no TSIG, unknown
  key name, algorithm-downgrade attempt, or w.TsigStatus() != nil.
- No TSIG keys configured → all UPDATEs refused (safety default).
- Algorithm pinning prevents downgrade attacks (e.g. forced HMAC-MD5).

Tests (update_test.go): 11 new cases covering happy paths and every
error rcode. Total: 35 top-level test passes, 0 failures.

ServeDNS dispatch now calls handleUpdate after auth gate. The
refuseUpdate() stub is gone. UPDATE end-to-end via nsupdate requires
the custom CoreDNS image (Phase 2) to verify TSIG plumbing on the
dns.Server side.
2026-05-21 10:51:18 -06:00
1cca9a5aa7 Phase 1.3: in-memory store + ServeDNS query dispatch
ServeDNS now answers authoritatively for the configured zone(s):
- Apex SOA → synthetic SOA (serial = store generation counter)
- Apex NS  → synthetic NS pointing at p.Nameserver
- In-store lookups for any qtype
- NODATA vs NXDOMAIN correctly distinguished (SOA in authority section)
- UPDATE opcode → REFUSED (Phase 1.4 implements properly)
- Queries outside our zones pass through to Next

Added:
- store.go: recordStore with sync.RWMutex + atomic generation counter.
  Operations: Add (de-dupes), RemoveRRset, RemoveRR, RemoveName, Lookup
  (returns a copy so callers can't corrupt internal state), NameExists.
  All keyed on canonical lowercase + trailing-dot names.
- plugin.go: ServeDNS dispatch, findZone (longest-suffix match),
  syntheticSOA, syntheticNS. New Nameserver field.
- setup.go: nameserver directive. Default Nameserver = first zone apex.
  Store initialised at parse time.
- store_test.go: 12 unit tests covering add/dedupe/remove/lookup/
  generation/case-insensitivity/copy-safety.
- plugin_test.go: 10 dispatch tests covering pass-through, apex
  synthetics, in-store lookups, NXDOMAIN/NODATA semantics, UPDATE
  refusal, findZone longest-suffix-wins and case behavior.
- setup_test.go: 3 new cases for the nameserver directive + store init.

Total: 38 tests passing.

Module: git.supported.systems/rsp2k/coredns-rfc2136
2026-05-21 10:37:48 -06:00
eba6313ec0 Phase 1.2: wire parser → typed config + 13 unit tests
The Corefile parser now fully populates typed fields on RFC2136 instead
of just recognising directives. Validation happens at parse-time so
configuration errors fail loud at CoreDNS startup rather than silent at
request time.

Added:
- config.go: tsigKey type, TSIG algorithm allowlist (rejects HMAC-MD5
  deliberately), base64 secret decoder with 8-byte minimum length check,
  canonical-key-name normalisation (lowercase + trailing dot).
- plugin.go: RFC2136 struct now carries TSIGKeys map, TTL uint32,
  PersistPath string. DefaultTTL=60.
- setup.go: parse() validates and stores tsig-key/ttl/persist directives.
  Duplicate key names rejected. Multiple TSIG keys allowed (for rotation).
  At-least-one-zone is enforced.
- setup_test.go: 13 table-driven cases (5 happy + 8 error paths) using
  caddy.NewTestController. All pass.

ServeDNS still passes through — UPDATE handling lands in Phase 1.4.

Module path: git.supported.systems/rsp2k/coredns-rfc2136
2026-05-21 10:31:22 -06:00
e9d37f483c Initial commit: plugin skeleton, compiles against CoreDNS 1.14.3
Sets up the package layout for a CoreDNS plugin that will accept RFC 2136
dynamic updates with TSIG authentication, primarily targeting self-hosted
ACME DNS-01 cert automation.

What this commit gives us:
- go.mod against coredns/caddy v1.1.4, coredns/coredns v1.14.3, miekg/dns v1.1.72
- plugin.go: RFC2136 struct + Handler interface (ServeDNS is pass-through)
- setup.go: init() registration + Corefile parser (skeleton — recognizes
  tsig-key, ttl, persist directives but doesn't yet wire them)
- README.md, .gitignore

go build ./... clean. No tests yet — those come with Phase 1.2 alongside
the actual UPDATE handler and in-memory store.

Plan: ~/.claude/plans/dood-does-coredns-offer-enumerated-piglet.md
2026-05-20 18:25:36 -06:00