6 Commits

Author SHA1 Message Date
89993ca207 L1/L2/L4: cleanup + README operational guide
L1 — Replace hand-rolled atoi/parseUint with strconv.ParseUint wrapped
in mustParseUint. Hamilton's reasoning: the comment "strconv adds
overhead we don't need" is the Lauren-Bug shape — we already validated
the input. Until we hadn't, on a path we couldn't predict. Stdlib's
edge-case coverage is the safer default; the wrapper panics on
malformed input so any future regression surfaces in CI, not as a
silent 0 serial.

L2 — applyUpdate no longer mutates the caller's RR header TTL. miekg/
dns parses the UPDATE message into RRs the caller still owns; silently
rewriting hdr.Ttl was a hygiene smell with no current functional
consequence but a clear documentation issue. Now we dns.Copy() the RR
before any header mutation.

L4 — README expanded with an "Operational constraints" section
documenting the contracts and limits operators should understand
before relying on this in production:
  - Single-process atomicity only (with rsync-race mitigation)
  - Process-global MsgAcceptFunc override
  - No-op UPDATE doesn't bump SOA (with touch-UPDATE workaround)
  - SOA invariants enforced strictly (zero, multi, non-apex SOA all
    refused)
  - Serial counter NNNN=9999 rollover semantics
  - TSIG replay window dependency on miekg/dns default
  - Git commit failure logged at ERROR, not rolled back
  - Per-key rate limit knobs

Every constraint maps to a Hamilton review finding; documenting the
contract in operator-facing prose closes the gap between code and
expectation that the review identified.
2026-05-22 21:33:37 -06:00
6ab2b6af6d H6/H7/M3/M4/M7: hardening + behavior documentation
H6 — TSIG replay-window test. New TestCheckTSIG_BadStatus_Refused
verifies that when miekg/dns reports a TSIG verification failure via
ResponseWriter.TsigStatus (the channel for fudge-window violations,
bad MACs, expired timestamps), our plugin refuses. The fudge tolerance
itself is miekg/dns's default (300s); documented in tsig.go so
operators know the dependency.

H7 — No-op UPDATE policy: documented explicitly in update.go. We do
NOT bump the SOA on a no-op (deduped) UPDATE — forcing downstream
secondaries to AXFR identical content wastes bandwidth and contradicts
RFC 2136's intent. Callers wanting to force a serial bump can send a
throwaway add+delete pair (touch-UPDATE pattern).

M3 — Delete-by-exact-match ignores TTL and class per RFC 2136 §2.5.4.
The previous rr.String() comparison included TTL, so an UPDATE with
CLASS=NONE TTL=0 (the protocol-required encoding for a delete) failed
to match stored RRs at CLASS=IN with non-zero TTL. Now we normalize
both sides (TTL=0, class=IN) before invoking dns.IsDuplicate.

M4 — validateZoneFiles now actually parses each zone at startup
(loadRRs invocation). Previously it only stat()'d the file; corrupt
zone content sailed through startup and produced SERVFAIL on the first
UPDATE with no startup-time signal. Combined with H3+H4's invariant
checks, this turns silent zone corruption into immediate startup
failure.

M7 — Commit-message sanitization. RR names are attacker-controlled
(TSIG only authenticates the sender; the payload is hostile by
default). Control characters in commit messages could inject newlines
into git log or ANSI sequences into downstream log renderers. New
sanitizeForCommitMessage escapes \n, \r, \t, and other C0 controls.

New tests:
- TestCheckTSIG_BadStatus_Refused (H6)
- TestUpdate_DeleteRR_IgnoresTTL (M3)
- TestSanitizeForCommitMessage (M7)
2026-05-22 21:29:13 -06:00
d9dad01798 H3/H4/H5: zone-integrity invariants
H3+H4 — Zone SOA invariant. After parsing, loadRRs enforces:
exactly one SOA, owned by the zone apex. Catches three failure modes
with a single guard:

  - Missing SOA (H4): a malformed line earlier in the file may have
    tripped miekg/dns's ZoneParser into dropping records without
    reporting an error via parser.Err(). If the SOA went missing, we
    refuse rather than treat the partial parse as authoritative.

  - Multiple SOAs (H3): zone files with accidental duplicate SOA
    records produce inconsistent zone state visible to AXFR clients.
    The old code's first-match SOA-bump would silently propagate the
    inconsistency. Now we refuse.

  - Non-apex SOA (H3): an SOA whose owner doesn't match the zone
    origin is either a parse error or a hand-edit mistake; bumping
    it would leave the real apex unchanged. Now we refuse.

assertSingleApexSOA returns a descriptive error so the failure mode
is actionable from logs alone.

H5 — MaxUint32 guard in bumpSerial. The old "+1 defensive advance"
branch would wrap to 0 if soa.Serial == MaxUint32, and downstream
secondaries per RFC 1982 §3.2 treat 0-after-MaxUint32 as "older"
(they refuse to AXFR and the zone goes dark). Now we explicitly check
and refuse with a loud message; operator must reset the serial
manually. Practical reach is zero for our deployment (10000 bumps/day
× 117 years would still fit uint32) but the defensive ceiling matters
for fuzz, hand-edit, or future code-path errors.

The full RFC 1982 wraparound-aware comparison was prototyped but
removed: it broke the legacy-format migration case where a tiny
non-CalVer serial (e.g., 12345) is "more than 2^31 distant" from a
new-format serial (~2.6B), which RFC 1982 reads as "going backwards"
and would block migration. Naive `>` is correct in practice; the
MaxUint32 case is the only real failure mode worth guarding.

New tests:
- TestBumpSerial_MaxUint32_RefusesWrap
- TestLoadRRs_NoSOA_Refused
- TestLoadRRs_MultipleSOAs_Refused
- TestLoadRRs_NonApexSOA_Refused
2026-05-22 21:25:35 -06:00
93ed180d8f H1/H2/M1: atomicity at the file boundary
H1 — Concurrent-modification detection. loadRRs now returns a
fileSnapshot capturing (mtime, size) at read time. handleUpdate calls
zf.checkUnchanged(snap) immediately before writeAtomic. If anything
modified the file between load and write — rsync push, manual edit,
`git checkout` — the UPDATE is refused with SERVFAIL. Caddy retries
with a fresh load. Protects against the CLAUDE.md-documented rsync
workflow racing the plugin.

H2 — Git commit-failure policy. The previous code logged at WARN and
continued, breaking the documented "file + git both updated" contract.
Now logs at ERROR with structured fields (zone, path, error, recovery
command) so operators discover the divergence. We do NOT roll back the
file write: by the time the commit fails, the auto plugin may have
already noticed the new mtime and reloaded; rolling back creates more
races than it solves. Recovery is `git -C <dir> status` + manual
commit.

M1 — exec.CommandContext with 10s timeout on git invocations. If git
hangs (NFS stall, gpg-sign prompt, broken pre-commit hook waiting on
stdin), the per-zone mutex would otherwise be held forever and queue
all subsequent UPDATEs. gitCommandTimeout caps the hang.

M2 deferred. Dropping the separate `git add` cleanly requires either
`-a` (wrong scope: auto-stages all tracked modifications) or `--include`
(still needs prior staging). The race window between add and commit is
theoretical for our setup (single-writer plugin + occasional `git
status`). M1's timeout already mitigates the worst hang case.

New tests:
- TestZoneFile_CheckUnchanged_DetectsExternalModification (H1)
2026-05-22 21:22:11 -06:00
8466f08780 Widen SOA serial counter: YYMMDDNNNN, 10000 bumps/day
The previous YYYYMMDDNN encoding capped at NN=99 (100 bumps/day) and
hard-failed UPDATEs once the day's counter was exhausted — confirmed
in production on 2026-05-22 when ACME activity across the supported.
systems zone hit the cap and SERVFAILed every subsequent UPDATE.

New format: YYMMDD*10000+NNNN. With 4-digit NNNN we get 10000/day, and
dropping the century keeps a 2026-dated serial (2,605,229,999 max) under
uint32's 4,294,967,295 ceiling. A 4-digit year (e.g., 20260522*10000)
would overflow uint32 — RFC 1035's SOA serial type bounds this.

Three behavior changes:

1. On NNNN=9999, roll forward to the next encoded day with NNNN=0001
   rather than erroring. The encoded date drifts ahead of wall time on
   heavy churn days and catches up on quiet days; monotonic ordering
   (the only DNS requirement) holds.

2. Future-encoded serials (from a prior rollover) are honoured — the
   previous "older date" branch downgraded them back to today*100+1,
   producing a backwards serial. This bug also tripped a manual
   workaround on the same day. Now: future encoded dates bump their
   own NNNN.

3. Legacy YYYYMMDDNN serials migrate automatically on first bump. A
   value like 2026052299 (~2.026B) is numerically smaller than today's
   new-format minimum 2605220001 (~2.605B), so the older-or-unparseable
   branch fires and rewrites in place. New > old, so AXFR receivers
   treat it as a clean forward bump.

Tests cover same-day, rollover, future-encoded no-regress, legacy
migration, non-CalVer reset, and no-SOA error.
2026-05-22 11:51:45 -06:00
0f28127284 Phase 2b: refactor to file-backed storage; UPDATE writes zones/*.zone
Major architectural pivot per the user's "RFC 2136 mechanism for the
existing zonefiles, not a new in-memory thing" framing. The plugin no
longer maintains its own in-memory state OR serves any queries -- both
of those are now the auto plugin's job, reading the same zone files.

The plugin's sole responsibility is now: receive TSIG-authed UPDATE
messages, edit the matching zones/<zone>.zone file, bump the SOA
serial in CalVer (YYYYMMDDNN) form, and optionally auto-commit to git.

What changed:
- DELETED: store.go (in-memory recordStore), store_test.go (12 tests),
  plugin_test.go (10 ServeDNS query tests), old update_test.go.
- NEW: zonefile.go -- file-backed authority for one zone. loadRRs via
  miekg/dns zone parser; mutation helpers (lookupIn/nameExistsIn/
  removeRRsetFrom/removeRRFrom/removeNameFrom/addRRTo) on []dns.RR
  slices; bumpSerial with CalVer semantics + NN exhaustion handling;
  writeAtomic via temp-file rename; commit shells to `git add && git
  commit` with configurable author.
- NEW: zonefile_test.go -- 17 tests covering load/lookup/mutate/bump/
  write paths.
- REWRITTEN: plugin.go -- ServeDNS is now thin: UPDATE → TSIG → handler;
  everything else → Next. No synthetic SOA/NS, no query serving.
- REWRITTEN: update.go -- handleUpdate now opens the zoneFile, loads,
  applies (with prereq checks against the loaded RRs), bumps serial,
  writes, commits. Detects no-op updates to avoid spurious file writes.
- REWRITTEN: setup.go -- new directives: `zones-dir` (required),
  `auto-commit` (default true), `git-author <name> <email>`. Dropped
  `nameserver` and `persist`. Validates each declared zone has a file
  on disk via os.Stat before CoreDNS finishes starting.
- REWRITTEN: setup_test.go -- 17 cases for the new grammar.
- REWRITTEN: update_test.go -- 11 cases using real temp zone files
  via t.TempDir().

Total: 30 tests passing, 0 failures.

Next: Phase 2c (custom CoreDNS image, deploy, smoke test with nsupdate).
2026-05-21 11:26:50 -06:00