L1/L2/L4: cleanup + README operational guide

L1 — Replace hand-rolled atoi/parseUint with strconv.ParseUint wrapped
in mustParseUint. Hamilton's reasoning: the comment "strconv adds
overhead we don't need" is the Lauren-Bug shape — we already validated
the input. Until we hadn't, on a path we couldn't predict. Stdlib's
edge-case coverage is the safer default; the wrapper panics on
malformed input so any future regression surfaces in CI, not as a
silent 0 serial.

L2 — applyUpdate no longer mutates the caller's RR header TTL. miekg/
dns parses the UPDATE message into RRs the caller still owns; silently
rewriting hdr.Ttl was a hygiene smell with no current functional
consequence but a clear documentation issue. Now we dns.Copy() the RR
before any header mutation.

L4 — README expanded with an "Operational constraints" section
documenting the contracts and limits operators should understand
before relying on this in production:
  - Single-process atomicity only (with rsync-race mitigation)
  - Process-global MsgAcceptFunc override
  - No-op UPDATE doesn't bump SOA (with touch-UPDATE workaround)
  - SOA invariants enforced strictly (zero, multi, non-apex SOA all
    refused)
  - Serial counter NNNN=9999 rollover semantics
  - TSIG replay window dependency on miekg/dns default
  - Git commit failure logged at ERROR, not rolled back
  - Per-key rate limit knobs

Every constraint maps to a Hamilton review finding; documenting the
contract in operator-facing prose closes the gap between code and
expectation that the review identified.
This commit is contained in:
Ryan Malloy 2026-05-22 21:33:37 -06:00
parent 8d1477350a
commit 89993ca207
3 changed files with 120 additions and 28 deletions

101
README.md
View File

@ -33,17 +33,24 @@ messages — zero static zone-file churn.
## Status
**Phase 1 (skeleton)**: compiles, registers with CoreDNS, parses the
Corefile directive. Does not yet handle UPDATE messages or serve any
records. ServeDNS is a pass-through. See `phases.md` for the roadmap.
**v2026.05.22.2**: production-ready. Handles UPDATE messages against
file-backed zones, TSIG-authenticates, bumps SOA serial in CalVer
YYMMDD*10000+NNNN form, atomically writes the zone file, optionally
git-commits each change for an audit trail. Designed to coexist with
CoreDNS's `auto` plugin (which serves queries from the same zone files
on its reload cycle).
## Configuration
```
rfc2136 <zone> [<zone>...] {
tsig-key <key-name> <algorithm> <base64-secret>
ttl <seconds>
persist <path>
zones-dir <path> # required
tsig-key <name> <algorithm> <base64-secret> # may repeat
ttl <seconds> # default 60
auto-commit <true|false> # default true
git-author <name> <email> # optional
rate-limit <burst> <period-seconds> # default 100 / 60s
rate-limit off # disable rate-limit
}
```
@ -52,14 +59,96 @@ Example:
```
.:53 auth.example.com {
rfc2136 auth.example.com {
zones-dir /var/lib/coredns/zones
tsig-key acme-key. hmac-sha256 BASE64SECRET==
ttl 60
auto-commit true
git-author "coredns-rfc2136" "rfc2136@coredns.example.com"
}
auto {
directory /var/lib/coredns/zones (.*)\.zone {1}
reload 30s
}
errors
log
}
```
## Operational constraints
A few behaviors operators should know before relying on this plugin:
### Single-process atomicity only
The per-zone mutex serializes UPDATEs *within one CoreDNS process*. It
does NOT coordinate with external file edits. If you `rsync` a zone
file from a workstation while the plugin is mid-UPDATE, you get a
race. The plugin defends against this with a snapshot-and-recheck:
loadRRs captures (mtime, size), and immediately before writing back,
we re-stat; if the file changed, the UPDATE is refused with SERVFAIL
and the client (Caddy etc.) retries on a fresh load. The window is
narrow but non-zero.
**Recommendation**: don't rsync zone files into a directory the plugin
is actively writing to. If you must, expect occasional SERVFAILs that
resolve on retry.
### MsgAcceptFunc is process-global
CoreDNS 1.14.3 doesn't expose a per-Config `MsgAcceptFunc`, so this
plugin overrides the miekg/dns package-level default at `init()` time.
**Every server block** in the process will accept the UPDATE opcode
at the wire layer — but only blocks with `rfc2136` in their plugin
chain do anything useful with it (others pass through and return
FormatError). The actual security boundary is TSIG, enforced both in
`ServeDNS` and as a defense-in-depth check inside `handleUpdate`.
### No-op UPDATEs do not bump the SOA serial
If an UPDATE adds an RR that's already present (deduped per RFC 2136
§3.4.2.2) or deletes one that doesn't exist, the file is not rewritten
and the SOA serial is not advanced. We return NOERROR. Downstream
secondaries are not asked to AXFR for a no-change.
If you need to force a serial bump (rare), send a touch-UPDATE: add a
throwaway RR then delete it.
### SOA invariants are enforced strictly
`loadRRs` refuses zone files with: zero SOAs, multiple SOAs, or an
SOA whose owner doesn't match the zone apex. Both at startup (via
`validateZoneFiles`) and on every UPDATE. Zone-file corruption fails
loud at boot rather than mysteriously on first ACME activity.
### Serial counter rolls over at NNNN=9999
Format is `YYMMDD*10000 + NNNN`. At NNNN=9999, the next bump rolls to
the next encoded day with NNNN=0001. On heavy days the encoded date
drifts ahead of wall time; on quiet days it catches back up. Monotonic
ordering (the only DNS requirement) holds. uint32 won't wrap for ~117
years at full 10000/day burn.
### TSIG replay window is miekg/dns's default (currently 300s)
The fudge window enforced by miekg/dns's TsigVerify is what gates
replay. If miekg/dns ever changes its default, this plugin's behavior
changes with it. A future enhancement is `tsig-fudge` as a Corefile
directive.
### Git commit failure is logged at ERROR, not rolled back
If `git commit` fails after a successful `writeAtomic`, the zone file
is correct but the audit trail diverges. We log at ERROR with a
recovery hint (`git -C <dir> status` + manual commit). We do NOT roll
back the file write — the auto plugin may have already noticed the
new mtime, and rolling back creates more races than it solves.
### Per-key rate limit
UPDATE traffic is token-bucket capped per TSIG key. Default 100
UPDATEs per 60 seconds. ACME storms are well within this; anything
beyond is suspicious. Tune via `rate-limit <burst> <period>`.
## Building
This plugin is consumed by a custom CoreDNS build via `plugin.cfg`:

View File

@ -297,8 +297,13 @@ func applyUpdate(zone string, defaultTTL uint32, rrs []dns.RR, rr dns.RR) ([]dns
log.Debugf("apex %s add refused", dns.TypeToString[hdr.Rrtype])
return rrs, dns.RcodeRefused, false
}
// Hamilton L2: don't mutate the caller's RR header. miekg/dns
// parses the UPDATE message into RRs the caller still owns;
// silently rewriting their TTL is a hygiene smell. Copy first,
// then apply our default if needed.
if hdr.Ttl == 0 {
hdr.Ttl = defaultTTL
rr = dns.Copy(rr)
rr.Header().Ttl = defaultTTL
}
before := len(rrs)
rrs = addRRTo(rrs, rr)

View File

@ -7,6 +7,7 @@ import (
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
"sync"
"time"
@ -346,9 +347,9 @@ func bumpSerial(rrs []dns.RR, now time.Time) error {
// serial that happens to look like a valid YYMMDD prefix but is in
// the past, which is the same handling: jump to today).
if curDate := cur[:6]; isValidYYMMDD(curDate) && curDate >= today {
nnnn := atoi(cur[6:10])
nnnn := int(mustParseUint(cur[6:10]))
if nnnn < 9999 {
soa.Serial = uint32(parseUint(curDate)*serialCounterMul + uint64(nnnn+1))
soa.Serial = uint32(mustParseUint(curDate)*serialCounterMul + uint64(nnnn+1))
return nil
}
// NNNN=9999: roll to next encoded day, NNNN=0001.
@ -357,13 +358,13 @@ func bumpSerial(rrs []dns.RR, now time.Time) error {
return fmt.Errorf("serial date %q unparseable: %w", curDate, err)
}
next := d.AddDate(0, 0, 1).Format("060102")
soa.Serial = uint32(parseUint(next)*serialCounterMul + 1)
soa.Serial = uint32(mustParseUint(next)*serialCounterMul + 1)
return nil
}
// Older or unparseable: jump to today*10000+1. Migration path for
// legacy YYYYMMDDNN serials lives here.
candidate := uint32(parseUint(today)*serialCounterMul + 1)
candidate := uint32(mustParseUint(today)*serialCounterMul + 1)
// H5 — explicit MaxUint32 guard. Plain `>` comparison is correct in
// practice (we'd never wrap during the zone's lifetime: 10000
@ -399,23 +400,20 @@ func isValidYYMMDD(s string) bool {
return err == nil
}
// atoi is a tiny helper that ignores errors — only called on a
// substring we already validated is two digits.
func atoi(s string) int {
n := 0
for _, c := range s {
n = n*10 + int(c-'0')
}
return n
}
// parseUint parses an all-digits string into a uint64. Used because
// strconv.ParseUint adds error-handling overhead we don't need on
// internally-controlled inputs.
func parseUint(s string) uint64 {
var n uint64
for _, c := range s {
n = n*10 + uint64(c-'0')
// mustParseUint parses an all-digit string into uint64. Panics on
// malformed input — the caller is responsible for passing only strings
// that were validated as digit-substrings (e.g., a fixed-width slice of
// a YYMMDDNNNN-formatted serial). Using strconv via this thin wrapper
// keeps the panic behavior explicit while sharing stdlib's robust
// parsing (Hamilton L1).
func mustParseUint(s string) uint64 {
n, err := strconv.ParseUint(s, 10, 64)
if err != nil {
// Programmer error if we ever hit this — every caller passes
// digits-only strings derived from time.Format or a sliced
// SOA serial. Panic so the bug surfaces in tests/CI rather
// than silently producing a 0 serial.
panic(fmt.Sprintf("mustParseUint(%q): %v", s, err))
}
return n
}