coredns/CLAUDE.md
Ryan Malloy 618e9504e7 secondary: scaffold public CoreDNS secondary on ns.supported.systems
Adds a second non-HE public secondary that pulls AXFR from dell01 (the
hidden primary at 154.27.180.210) and answers public queries on
ns.supported.systems (64.177.113.227, 2001:19f0:5c00:4daa:5400:6ff:fe2d:38fa).

  secondary/
    Corefile                            generated, 84 zones + REFUSED catch-all
    docker-compose.yml                  CoreDNS in host-net mode
    Makefile                            up/down/logs/regen/test/axfr-test
    .env / .env.example                 image pin + bind IPs
    scripts/generate-secondary-corefile.sh  reads ../zones/*.zone

  scripts/notify-he.py → notify-secondaries.py
                                        adds 64.177.113.227 as a second
                                        NOTIFY target alongside HE's
                                        216.218.130.2

Uses CoreDNS's `bind` plugin to avoid colliding with systemd-resolved
on loopback :53. Authoritative-only — non-listed zones get REFUSED, no
recursion. AXFR pull requires opening TCP/53 on dell01's FortiWiFi for
the secondary's IP (manual step, separate from this commit).
2026-05-20 18:40:11 -06:00

10 KiB

coredns — hidden-primary DNS for ~91 zones

CoreDNS running on dell01.mer.idahomuellers.net (LAN 172.16.1.15, public 154.27.180.210) acts as a hidden primary. Hurricane Electric's free secondary service (dns.he.net) pulls each zone via AXFR and is what the public actually sees. Git/this repo is the source of truth.

Architecture at a glance

edit zones/*.zone  →  make prep  →  CoreDNS auto-reloads (30s)
                            ↓
                  scripts/notify-secondaries.py
                            ↓
                  NOTIFY → ns1.he.net (216.218.130.2)
                            ↓
                  HE slave-puller (216.218.133.2) does AXFR
                            ↓
                  HE anycast cluster replicates internally
                            ↓
                  public sees new data

End-to-end propagation: typically under 10 minutes after make prep. Worst case ~1 hour (HE's poll-only fallback if NOTIFY is missed).

Source of truth

  • zones/*.zone — 91 raw Vultr-style zone files. Edit here.
  • zones-prepared/*.zone — generated by scripts/prepare-zones.sh: injects SOA, replaces NS with ns1-5.he.net, dot-terminates rdata, bumps serial. Never edit directly. Gitignored.
  • Corefile — CoreDNS config with (common) snippet imported by plain DNS (. {}), DoT (tls://.:853), and DoH (https://.:443) server blocks.

Daily workflow — adding/changing a record

# 1. Edit the source zone
$EDITOR zones/homestar.ink.zone

# 2. Push, prep (auto-bumps serial), NOTIFY HE
rsync -avz -e "ssh -A" zones/homestar.ink.zone \
    rpm@dell01.mer.idahomuellers.net:~/coredns/zones/homestar.ink.zone
ssh -A rpm@dell01.mer.idahomuellers.net 'cd ~/coredns && make prep'

# 3. Commit locally
git add -A && git commit -m "homestar.ink: add foo A 1.2.3.4"

# 4. Verify
./scripts/check-he.sh foo.homestar.ink A

Wait ≤5 minutes for HE to AXFR. If serial doesn't flip on HE, re-run NOTIFY: ssh -A dell01... 'cd ~/coredns && ./scripts/notify-secondaries.py'

Publishing to dell01

The repo lives in two places:

  • Local (~/claude/coredns): where you edit
  • Remote (rpm@dell01.mer.idahomuellers.net:~/coredns): where CoreDNS reads zone files via Docker bind-mount

To push the whole project:

rsync -avz --delete \
  --exclude '.git/' --exclude 'caddy-data/' --exclude 'caddy-config/' \
  --exclude 'certs/*.pem' --exclude 'zones-prepared/*.zone' \
  --exclude '.env.local' \
  -e "ssh -A" \
  ./ rpm@dell01.mer.idahomuellers.net:~/coredns/

Per-file push for single-zone changes is also fine:

rsync -avz -e "ssh -A" zones/<zone>.zone \
    rpm@dell01.mer.idahomuellers.net:~/coredns/zones/<zone>.zone

-A forwards your ssh agent so gh and other remote git ops work inside the dell01 session.

On dell01

ssh -A rpm@dell01.mer.idahomuellers.net
cd ~/coredns
make prep      # re-prep zones (auto-bumps SOA + sends NOTIFY)
make logs      # tail CoreDNS logs
make ps        # container status

The Docker stack: coredns (server) + coredns-caddy (LE cert for dns.supported.systems, used for DoT/DoH).

NOTIFY: external script, not CoreDNS-native

We use scripts/notify-secondaries.py to send NOTIFY messages to 216.218.130.2 (ns1.he.net) on every make prep. Pure stdlib Python, no deps.

Why a script instead of CoreDNS's built-in transfer { to <IP> }?

CoreDNS 1.11.3 and 1.12.2 both have a bug where transfer { to <IP> } with any specific IP (single, multi-line, or space-separated) makes the server blocks silently fail to start their listeners — zones load, plugin loads, then .:53 / tls://.:853 / https://.:443 never bind. Only transfer { to * } works.

So:

  • Corefile: transfer { to * } — open AXFR (firewall does the source-IP filtering on TCP/53 NAT anyway)
  • notify-secondaries.py: sends NOTIFY explicitly to each secondary's IP

NOTIFY happens automatically on make prep. To NOTIFY manually:

ssh -A dell01... 'cd ~/coredns && ./scripts/notify-secondaries.py'

The script's output doubles as a "what's on HE" inventory for zones HE hosts, ✗ rcode=5 for zones HE doesn't yet host.

HE's NOTIFY behavior: HE acks NOTIFY at the protocol level (rcode=0), and usually triggers an immediate AXFR. Sometimes the batch NOTIFY fired from make prep doesn't seem to wake them; re-running notify-secondaries.py manually almost always does. Per-zone NOTIFY is more reliable than batch.

HE asymmetric IPs

Hurricane Electric requires:

  • AXFR pull source: 216.218.133.2 (slave.dns.he.net / ns4.he.net) — does NOT serve public queries, only does AXFR pulls
  • NOTIFY destination: 216.218.130.2 (ns1.he.net)
  • Public-facing anycast: ns1, ns2, ns3, ns5 (.130.2, .131.2, .132.2, 216.66.1.2)

scripts/check-he.sh <name> [type] queries all 4 public anycast IPs in parallel and flags divergence.

HE two-stage propagation

When you bump a serial, HE goes through:

  1. slave-puller pulls AXFR — happens quickly after NOTIFY (~seconds)
  2. internal anycast replication — propagates to public-facing PoPs on HE's clock (1-15 min usually, can be longer)

check-he.sh shows when stage 2 has completed (all 4 anycast NS report the same serial + answer).

SOA timers

scripts/prepare-zones.sh writes these for every zone:

serial    YYYYMMDDNN   (auto-incrementing per-day counter)
refresh   300          (5 min — HE polls SOA this often)
retry     120          (2 min — HE retries failed polls)
expire    604800       (1 week)
minimum   60           (1 min — NXDOMAIN negative-cache TTL)

These are aggressive but appropriate for the hidden-primary pattern. The 60s minimum keeps stale NXDOMAIN cache windows short after adding a new name.

Empty-non-terminal trap (RFC 4592)

If a name X has children in the zone (especially stale _acme-challenge.<sub>.X TXT records), X becomes an "empty non-terminal." HE strictly follows RFC 4592 §2.2.3: wildcards do NOT synthesize for empty non-terminals. So *.<parent> skips X even though the wildcard would otherwise have caught it.

Symptom: dell01 returns the wildcard answer (CoreDNS is lenient), HE returns NODATA. Public clients see "broken" for X.

Fix: add an explicit record at X (X A 1.2.3.4 or X CNAME @).

To find empty-non-terminals across zones:

# For each zone with a wildcard, find _acme-challenge.<X> entries
# where <X> has no explicit record at that exact name.
# See git log for 5afdb05 / f6111c2 / f8363e5 for the audit pattern.

Wildcard depth

HE follows RFC 4592 fully: *.<parent> matches any depth of names under <parent> as long as no intermediate names exist in the zone tree. So *.demo catches something.demo AND deep.path.demo (the latter only if path.demo doesn't exist as a node).

Intermediate empty non-terminals do block synthesis below them.

Zone-by-zone HE status

./scripts/notify-secondaries.py prints / per zone — means HE hosts that zone as a secondary, (rcode=5) means HE doesn't yet host it. As of the last NOTIFY run, ~11 of 91 zones are slaved on HE. The other 80 are still served from Vultr at the registrar level.

To migrate a zone fully to HE:

  1. Add as Secondary DNS at dns.he.net with master IP 154.27.180.210
  2. Update registrar NS records: replace ns1/ns2.vultr.com with ns1-ns5.he.net (some registrars limit to 4 NS — drop ns5 if so)
  3. Wait for TLD propagation (minutes for gTLDs, hours for .us etc.)
  4. Optionally clean up Vultr-side zone records

scripts/check-he.sh will then show this zone live across HE's anycast.

TLS for DoT/DoH

DoT (:8853 external, :853 internal) and DoH (:8443 external, :443 internal) are terminated by CoreDNS using a Let's Encrypt cert for dns.supported.systems. The cert is provisioned and auto-renewed by coredns-caddy sidecar, which uses DNS-01 challenge via Vultr API (needs VULTR_API_KEY in shell env at startup).

Renewal happens automatically; Caddy uses ACME ARI to schedule it.

Key files

Path Purpose
zones/*.zone Source-of-truth zone files (edit here)
zones-prepared/*.zone Generated, served by CoreDNS (gitignored)
Corefile CoreDNS config
scripts/prepare-zones.sh Zone prep + auto-bump serial
scripts/notify-secondaries.py Send NOTIFY to ns1.he.net + ns.supported.systems
secondary/ Public secondary (CoreDNS in Docker) deployed to ns.supported.systems
scripts/check-he.sh Parallel HE anycast verification
caddy/Caddyfile + caddy/Dockerfile Caddy sidecar config
docker-compose.yml CoreDNS + Caddy stack
Makefile make prep, make up, make down, make logs, etc.
.env Image pins, ports

Known operational quirks

  • make prep errors first run of new day: fixed in prepare-zones.sh (grep with || true for the serial-detection step). Don't revert that.
  • Full docker compose down + up needed after Corefile changes that touch transfer: restart alone leaves sticky state that prevents listener binding.
  • Vultr DNS still authoritative for ~80 zones (registrar NS hasn't been migrated to HE). The hidden-primary stack still serves them locally and on dell01, but public DNS uses Vultr until you migrate.

Useful one-liners

# Find records pointing at a specific IP
grep -rE '\b1\.2\.3\.4\b' zones/

# Find all _acme-challenge records (potential empty-non-terminal sources)
grep -E "_acme-challenge\." zones/<zone>.zone

# Compare dell01 vs HE for a specific zone
ZONE=homestar.ink
echo "dell01: $(dig @dell01.mer.idahomuellers.net -p 5353 $ZONE SOA +short | awk '{print $3}')"
echo "HE:     $(dig @ns1.he.net $ZONE SOA +short | awk '{print $3}')"

# What's the current SOA serial across all HE anycast for a zone?
./scripts/check-he.sh <zone> SOA

Don't do

  • Don't edit zones-prepared/ — it's regenerated by make prep
  • Don't put transfer { to <IP> } in Corefile — CoreDNS bug, silently breaks listener startup. Stick to transfer { to * }.
  • Don't commit .env.local, caddy-data/, certs/*.pem — these are gitignored for a reason
  • Don't manually bump serials in zones-preparedmake prep handles it correctly via prepare-zones.sh's auto-bumper