From 890a4214d66570799c58a81fdcdf09093d18d2d5 Mon Sep 17 00:00:00 2001 From: Ryan Malloy Date: Wed, 20 May 2026 11:32:25 -0600 Subject: [PATCH] =?UTF-8?q?CLAUDE.md:=20project=20knowledge=20=E2=80=94=20?= =?UTF-8?q?architecture,=20NOTIFY,=20SSH=20deploy,=20HE=20quirks?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- CLAUDE.md | 276 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 276 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..278543c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,276 @@ +# coredns — hidden-primary DNS for ~91 zones + +CoreDNS running on **dell01.mer.idahomuellers.net** (LAN `172.16.1.15`, +public `154.27.180.210`) acts as a hidden primary. Hurricane Electric's +free secondary service (`dns.he.net`) pulls each zone via AXFR and is +what the public actually sees. Git/this repo is the source of truth. + +## Architecture at a glance + +``` +edit zones/*.zone → make prep → CoreDNS auto-reloads (30s) + ↓ + scripts/notify-he.py + ↓ + NOTIFY → ns1.he.net (216.218.130.2) + ↓ + HE slave-puller (216.218.133.2) does AXFR + ↓ + HE anycast cluster replicates internally + ↓ + public sees new data +``` + +End-to-end propagation: typically **under 10 minutes** after `make prep`. +Worst case ~1 hour (HE's poll-only fallback if NOTIFY is missed). + +## Source of truth + +- **`zones/*.zone`** — 91 raw Vultr-style zone files. **Edit here.** +- **`zones-prepared/*.zone`** — generated by `scripts/prepare-zones.sh`: + injects SOA, replaces NS with `ns1-5.he.net`, dot-terminates rdata, + bumps serial. **Never edit directly.** Gitignored. +- **`Corefile`** — CoreDNS config with `(common)` snippet imported by + plain DNS (`. {}`), DoT (`tls://.:853`), and DoH (`https://.:443`) + server blocks. + +## Daily workflow — adding/changing a record + +```bash +# 1. Edit the source zone +$EDITOR zones/homestar.ink.zone + +# 2. Push, prep (auto-bumps serial), NOTIFY HE +rsync -avz -e "ssh -A" zones/homestar.ink.zone \ + rpm@dell01.mer.idahomuellers.net:~/coredns/zones/homestar.ink.zone +ssh -A rpm@dell01.mer.idahomuellers.net 'cd ~/coredns && make prep' + +# 3. Commit locally +git add -A && git commit -m "homestar.ink: add foo A 1.2.3.4" + +# 4. Verify +./scripts/check-he.sh foo.homestar.ink A +``` + +Wait ≤5 minutes for HE to AXFR. If serial doesn't flip on HE, +re-run NOTIFY: `ssh -A dell01... 'cd ~/coredns && ./scripts/notify-he.py'` + +## Publishing to dell01 + +The repo lives in two places: +- **Local** (`~/claude/coredns`): where you edit +- **Remote** (`rpm@dell01.mer.idahomuellers.net:~/coredns`): where + CoreDNS reads zone files via Docker bind-mount + +To push the whole project: + +```bash +rsync -avz --delete \ + --exclude '.git/' --exclude 'caddy-data/' --exclude 'caddy-config/' \ + --exclude 'certs/*.pem' --exclude 'zones-prepared/*.zone' \ + --exclude '.env.local' \ + -e "ssh -A" \ + ./ rpm@dell01.mer.idahomuellers.net:~/coredns/ +``` + +Per-file push for single-zone changes is also fine: +```bash +rsync -avz -e "ssh -A" zones/.zone \ + rpm@dell01.mer.idahomuellers.net:~/coredns/zones/.zone +``` + +`-A` forwards your ssh agent so `gh` and other remote git ops work +inside the dell01 session. + +## On dell01 + +```bash +ssh -A rpm@dell01.mer.idahomuellers.net +cd ~/coredns +make prep # re-prep zones (auto-bumps SOA + sends NOTIFY) +make logs # tail CoreDNS logs +make ps # container status +``` + +The Docker stack: `coredns` (server) + `coredns-caddy` (LE cert for +`dns.supported.systems`, used for DoT/DoH). + +## NOTIFY: external script, not CoreDNS-native + +We use `scripts/notify-he.py` to send NOTIFY messages to +`216.218.130.2` (ns1.he.net) on every `make prep`. Pure stdlib Python, +no deps. + +**Why a script instead of CoreDNS's built-in `transfer { to }`?** + +CoreDNS 1.11.3 and 1.12.2 both have a bug where `transfer { to }` +with **any specific IP** (single, multi-line, or space-separated) makes +the server blocks silently fail to start their listeners — zones load, +plugin loads, then `.:53` / `tls://.:853` / `https://.:443` never bind. +Only `transfer { to * }` works. + +So: +- `Corefile`: `transfer { to * }` — open AXFR (firewall does the + source-IP filtering on TCP/53 NAT anyway) +- `notify-he.py`: sends NOTIFY explicitly to the right IP + +NOTIFY happens automatically on `make prep`. To NOTIFY manually: +```bash +ssh -A dell01... 'cd ~/coredns && ./scripts/notify-he.py' +``` + +The script's output doubles as a **"what's on HE" inventory** — `✓` +for zones HE hosts, `✗ rcode=5` for zones HE doesn't yet host. + +**HE's NOTIFY behavior**: HE acks NOTIFY at the protocol level (rcode=0), +and *usually* triggers an immediate AXFR. Sometimes the batch NOTIFY +fired from `make prep` doesn't seem to wake them; re-running +`notify-he.py` manually almost always does. Per-zone NOTIFY is more +reliable than batch. + +## HE asymmetric IPs + +Hurricane Electric requires: +- **AXFR pull source**: `216.218.133.2` (`slave.dns.he.net` / + `ns4.he.net`) — does NOT serve public queries, only does AXFR pulls +- **NOTIFY destination**: `216.218.130.2` (`ns1.he.net`) +- **Public-facing anycast**: `ns1`, `ns2`, `ns3`, `ns5` (`.130.2`, + `.131.2`, `.132.2`, `216.66.1.2`) + +`scripts/check-he.sh [type]` queries all 4 public anycast IPs +in parallel and flags divergence. + +## HE two-stage propagation + +When you bump a serial, HE goes through: +1. **slave-puller pulls AXFR** — happens quickly after NOTIFY (~seconds) +2. **internal anycast replication** — propagates to public-facing PoPs + on HE's clock (1-15 min usually, can be longer) + +`check-he.sh` shows when stage 2 has completed (all 4 anycast NS report +the same serial + answer). + +## SOA timers + +`scripts/prepare-zones.sh` writes these for every zone: +``` +serial YYYYMMDDNN (auto-incrementing per-day counter) +refresh 300 (5 min — HE polls SOA this often) +retry 120 (2 min — HE retries failed polls) +expire 604800 (1 week) +minimum 60 (1 min — NXDOMAIN negative-cache TTL) +``` + +These are aggressive but appropriate for the hidden-primary pattern. +The 60s minimum keeps stale NXDOMAIN cache windows short after adding +a new name. + +## Empty-non-terminal trap (RFC 4592) + +If a name X has children in the zone (especially stale +`_acme-challenge..X` TXT records), X becomes an "empty +non-terminal." HE strictly follows RFC 4592 §2.2.3: wildcards do NOT +synthesize for empty non-terminals. So `*.` skips X even though +the wildcard would otherwise have caught it. + +**Symptom**: dell01 returns the wildcard answer (CoreDNS is lenient), +HE returns NODATA. Public clients see "broken" for X. + +**Fix**: add an explicit record at X (`X A 1.2.3.4` or `X CNAME @`). + +To find empty-non-terminals across zones: +```bash +# For each zone with a wildcard, find _acme-challenge. entries +# where has no explicit record at that exact name. +# See git log for 5afdb05 / f6111c2 / f8363e5 for the audit pattern. +``` + +## Wildcard depth + +HE follows RFC 4592 fully: `*.` matches **any depth** of names +under `` as long as no intermediate names exist in the zone +tree. So `*.demo` catches `something.demo` AND `deep.path.demo` (the +latter only if `path.demo` doesn't exist as a node). + +Intermediate empty non-terminals **do** block synthesis below them. + +## Zone-by-zone HE status + +`./scripts/notify-he.py` prints `✓` / `✗` per zone — `✓` means HE +hosts that zone as a secondary, `✗` (rcode=5) means HE doesn't yet +host it. As of the last NOTIFY run, ~11 of 91 zones are slaved on HE. +The other 80 are still served from Vultr at the registrar level. + +To migrate a zone fully to HE: +1. Add as Secondary DNS at `dns.he.net` with master IP `154.27.180.210` +2. Update registrar NS records: replace `ns1/ns2.vultr.com` with + `ns1-ns5.he.net` (some registrars limit to 4 NS — drop ns5 if so) +3. Wait for TLD propagation (minutes for gTLDs, hours for `.us` etc.) +4. Optionally clean up Vultr-side zone records + +`scripts/check-he.sh` will then show this zone live across HE's anycast. + +## TLS for DoT/DoH + +DoT (`:8853` external, `:853` internal) and DoH (`:8443` external, +`:443` internal) are terminated by CoreDNS using a Let's Encrypt cert +for `dns.supported.systems`. The cert is provisioned and auto-renewed +by `coredns-caddy` sidecar, which uses DNS-01 challenge via Vultr API +(needs `VULTR_API_KEY` in shell env at startup). + +Renewal happens automatically; Caddy uses ACME ARI to schedule it. + +## Key files + +| Path | Purpose | +|---|---| +| `zones/*.zone` | Source-of-truth zone files (edit here) | +| `zones-prepared/*.zone` | Generated, served by CoreDNS (gitignored) | +| `Corefile` | CoreDNS config | +| `scripts/prepare-zones.sh` | Zone prep + auto-bump serial | +| `scripts/notify-he.py` | Send NOTIFY to ns1.he.net | +| `scripts/check-he.sh` | Parallel HE anycast verification | +| `caddy/Caddyfile` + `caddy/Dockerfile` | Caddy sidecar config | +| `docker-compose.yml` | CoreDNS + Caddy stack | +| `Makefile` | `make prep`, `make up`, `make down`, `make logs`, etc. | +| `.env` | Image pins, ports | + +## Known operational quirks + +- **`make prep` errors first run of new day**: fixed in + `prepare-zones.sh` (grep with `|| true` for the serial-detection + step). Don't revert that. +- **Full `docker compose down + up` needed after Corefile changes that + touch `transfer`**: `restart` alone leaves sticky state that prevents + listener binding. +- **Vultr DNS still authoritative for ~80 zones** (registrar NS hasn't + been migrated to HE). The hidden-primary stack still serves them + locally and on dell01, but public DNS uses Vultr until you migrate. + +## Useful one-liners + +```bash +# Find records pointing at a specific IP +grep -rE '\b1\.2\.3\.4\b' zones/ + +# Find all _acme-challenge records (potential empty-non-terminal sources) +grep -E "_acme-challenge\." zones/.zone + +# Compare dell01 vs HE for a specific zone +ZONE=homestar.ink +echo "dell01: $(dig @dell01.mer.idahomuellers.net -p 5353 $ZONE SOA +short | awk '{print $3}')" +echo "HE: $(dig @ns1.he.net $ZONE SOA +short | awk '{print $3}')" + +# What's the current SOA serial across all HE anycast for a zone? +./scripts/check-he.sh SOA +``` + +## Don't do + +- **Don't edit `zones-prepared/`** — it's regenerated by `make prep` +- **Don't put `transfer { to }`** in Corefile — CoreDNS bug, + silently breaks listener startup. Stick to `transfer { to * }`. +- **Don't commit `.env.local`, `caddy-data/`, `certs/*.pem`** — these + are gitignored for a reason +- **Don't manually bump serials in zones-prepared** — `make prep` + handles it correctly via `prepare-zones.sh`'s auto-bumper