4 Commits

Author SHA1 Message Date
d9dad01798 H3/H4/H5: zone-integrity invariants
H3+H4 — Zone SOA invariant. After parsing, loadRRs enforces:
exactly one SOA, owned by the zone apex. Catches three failure modes
with a single guard:

  - Missing SOA (H4): a malformed line earlier in the file may have
    tripped miekg/dns's ZoneParser into dropping records without
    reporting an error via parser.Err(). If the SOA went missing, we
    refuse rather than treat the partial parse as authoritative.

  - Multiple SOAs (H3): zone files with accidental duplicate SOA
    records produce inconsistent zone state visible to AXFR clients.
    The old code's first-match SOA-bump would silently propagate the
    inconsistency. Now we refuse.

  - Non-apex SOA (H3): an SOA whose owner doesn't match the zone
    origin is either a parse error or a hand-edit mistake; bumping
    it would leave the real apex unchanged. Now we refuse.

assertSingleApexSOA returns a descriptive error so the failure mode
is actionable from logs alone.

H5 — MaxUint32 guard in bumpSerial. The old "+1 defensive advance"
branch would wrap to 0 if soa.Serial == MaxUint32, and downstream
secondaries per RFC 1982 §3.2 treat 0-after-MaxUint32 as "older"
(they refuse to AXFR and the zone goes dark). Now we explicitly check
and refuse with a loud message; operator must reset the serial
manually. Practical reach is zero for our deployment (10000 bumps/day
× 117 years would still fit uint32) but the defensive ceiling matters
for fuzz, hand-edit, or future code-path errors.

The full RFC 1982 wraparound-aware comparison was prototyped but
removed: it broke the legacy-format migration case where a tiny
non-CalVer serial (e.g., 12345) is "more than 2^31 distant" from a
new-format serial (~2.6B), which RFC 1982 reads as "going backwards"
and would block migration. Naive `>` is correct in practice; the
MaxUint32 case is the only real failure mode worth guarding.

New tests:
- TestBumpSerial_MaxUint32_RefusesWrap
- TestLoadRRs_NoSOA_Refused
- TestLoadRRs_MultipleSOAs_Refused
- TestLoadRRs_NonApexSOA_Refused
2026-05-22 21:25:35 -06:00
93ed180d8f H1/H2/M1: atomicity at the file boundary
H1 — Concurrent-modification detection. loadRRs now returns a
fileSnapshot capturing (mtime, size) at read time. handleUpdate calls
zf.checkUnchanged(snap) immediately before writeAtomic. If anything
modified the file between load and write — rsync push, manual edit,
`git checkout` — the UPDATE is refused with SERVFAIL. Caddy retries
with a fresh load. Protects against the CLAUDE.md-documented rsync
workflow racing the plugin.

H2 — Git commit-failure policy. The previous code logged at WARN and
continued, breaking the documented "file + git both updated" contract.
Now logs at ERROR with structured fields (zone, path, error, recovery
command) so operators discover the divergence. We do NOT roll back the
file write: by the time the commit fails, the auto plugin may have
already noticed the new mtime and reloaded; rolling back creates more
races than it solves. Recovery is `git -C <dir> status` + manual
commit.

M1 — exec.CommandContext with 10s timeout on git invocations. If git
hangs (NFS stall, gpg-sign prompt, broken pre-commit hook waiting on
stdin), the per-zone mutex would otherwise be held forever and queue
all subsequent UPDATEs. gitCommandTimeout caps the hang.

M2 deferred. Dropping the separate `git add` cleanly requires either
`-a` (wrong scope: auto-stages all tracked modifications) or `--include`
(still needs prior staging). The race window between add and commit is
theoretical for our setup (single-writer plugin + occasional `git
status`). M1's timeout already mitigates the worst hang case.

New tests:
- TestZoneFile_CheckUnchanged_DetectsExternalModification (H1)
2026-05-22 21:22:11 -06:00
8466f08780 Widen SOA serial counter: YYMMDDNNNN, 10000 bumps/day
The previous YYYYMMDDNN encoding capped at NN=99 (100 bumps/day) and
hard-failed UPDATEs once the day's counter was exhausted — confirmed
in production on 2026-05-22 when ACME activity across the supported.
systems zone hit the cap and SERVFAILed every subsequent UPDATE.

New format: YYMMDD*10000+NNNN. With 4-digit NNNN we get 10000/day, and
dropping the century keeps a 2026-dated serial (2,605,229,999 max) under
uint32's 4,294,967,295 ceiling. A 4-digit year (e.g., 20260522*10000)
would overflow uint32 — RFC 1035's SOA serial type bounds this.

Three behavior changes:

1. On NNNN=9999, roll forward to the next encoded day with NNNN=0001
   rather than erroring. The encoded date drifts ahead of wall time on
   heavy churn days and catches up on quiet days; monotonic ordering
   (the only DNS requirement) holds.

2. Future-encoded serials (from a prior rollover) are honoured — the
   previous "older date" branch downgraded them back to today*100+1,
   producing a backwards serial. This bug also tripped a manual
   workaround on the same day. Now: future encoded dates bump their
   own NNNN.

3. Legacy YYYYMMDDNN serials migrate automatically on first bump. A
   value like 2026052299 (~2.026B) is numerically smaller than today's
   new-format minimum 2605220001 (~2.605B), so the older-or-unparseable
   branch fires and rewrites in place. New > old, so AXFR receivers
   treat it as a clean forward bump.

Tests cover same-day, rollover, future-encoded no-regress, legacy
migration, non-CalVer reset, and no-SOA error.
2026-05-22 11:51:45 -06:00
0f28127284 Phase 2b: refactor to file-backed storage; UPDATE writes zones/*.zone
Major architectural pivot per the user's "RFC 2136 mechanism for the
existing zonefiles, not a new in-memory thing" framing. The plugin no
longer maintains its own in-memory state OR serves any queries -- both
of those are now the auto plugin's job, reading the same zone files.

The plugin's sole responsibility is now: receive TSIG-authed UPDATE
messages, edit the matching zones/<zone>.zone file, bump the SOA
serial in CalVer (YYYYMMDDNN) form, and optionally auto-commit to git.

What changed:
- DELETED: store.go (in-memory recordStore), store_test.go (12 tests),
  plugin_test.go (10 ServeDNS query tests), old update_test.go.
- NEW: zonefile.go -- file-backed authority for one zone. loadRRs via
  miekg/dns zone parser; mutation helpers (lookupIn/nameExistsIn/
  removeRRsetFrom/removeRRFrom/removeNameFrom/addRRTo) on []dns.RR
  slices; bumpSerial with CalVer semantics + NN exhaustion handling;
  writeAtomic via temp-file rename; commit shells to `git add && git
  commit` with configurable author.
- NEW: zonefile_test.go -- 17 tests covering load/lookup/mutate/bump/
  write paths.
- REWRITTEN: plugin.go -- ServeDNS is now thin: UPDATE → TSIG → handler;
  everything else → Next. No synthetic SOA/NS, no query serving.
- REWRITTEN: update.go -- handleUpdate now opens the zoneFile, loads,
  applies (with prereq checks against the loaded RRs), bumps serial,
  writes, commits. Detects no-op updates to avoid spurious file writes.
- REWRITTEN: setup.go -- new directives: `zones-dir` (required),
  `auto-commit` (default true), `git-author <name> <email>`. Dropped
  `nameserver` and `persist`. Validates each declared zone has a file
  on disk via os.Stat before CoreDNS finishes starting.
- REWRITTEN: setup_test.go -- 17 cases for the new grammar.
- REWRITTEN: update_test.go -- 11 cases using real temp zone files
  via t.TempDir().

Total: 30 tests passing, 0 failures.

Next: Phase 2c (custom CoreDNS image, deploy, smoke test with nsupdate).
2026-05-21 11:26:50 -06:00