Wires Caddy as the ACME client side of our new self-hosted DNS-01
flow. Proves the design end-to-end: caddy-dns/rfc2136 -> our
CoreDNS rfc2136 plugin -> zone file write -> git auto-commit -> HE
AXFR -> LE validates -> cert issued.
Changes:
- caddy/Dockerfile: --with github.com/caddy-dns/rfc2136 added
alongside the existing caddy-dns/vultr.
- caddy/Caddyfile: new test-rfc2136.supported.systems site that uses
the new provider. server coredns:53 (docker internal), key from
env, propagation_delay 60s + timeout 600s to accommodate HE pull.
- docker-compose.yml: ACME_TSIG_SECRET passed to the caddy container
(the same secret CoreDNS verifies on the other side of the loop).
First cert issued in production: 2026-05-21 ~13:23 UTC. ~5.5 min
end-to-end from Caddy starting to cert in hand. Documented in
session notes; the cert sits unused in caddy-data/ until/unless
something publishes ports 80/443 for that hostname.
The final set of fixes to make the rfc2136 plugin truly operational
in production:
- coredns/Dockerfile: switch runtime stage from gcr.io/distroless to
alpine:3.20. Distroless has no package manager and no shell, so
`git commit` (called by the plugin's auto-commit code path) had no
way to execute. Alpine adds ~10 MB image size but gives us git +
a usable shell for debugging.
- docker-compose.yml: `user: "${COREDNS_UID:-1003}:${COREDNS_GID:-1004}"`.
The container runs as the host's rpm user (uid 1003/gid 1004 on
dell01) so zone files the plugin writes are owned by rpm:rpm on
the host -- not root. Without this the plugin would write
root-owned files we couldn't read or git-edit. Defaults match
dell01; override per-host via env if needed.
- .env.example: documents COREDNS_IMAGE_TAG (CalVer; bump per build).
Add COREDNS_UID/GID if you need to override on a host where rpm
has different numeric ids.
Combined with the bumped image tag (2026.05.21.2), the full
end-to-end flow works: caddy/nsupdate -> TSIG verify -> plugin
handler -> atomic file write -> git auto-commit -> auto plugin
reload -> query returns new record.
Per standard Docker convention. The active `.env` is per-host
(contains the actual TSIG secret + any host-specific port/hostname
overrides). The `.env.example` template documents the expected
variables with stub values so a fresh checkout knows what to copy.
Also: docker-compose.yml now passes ACME_TSIG_SECRET to the coredns
container via plain `environment:` directive -- compose auto-reads
`.env` for substitution. No --env-file gymnastics needed at the
invocation level.
Brings up a parallel CoreDNS instance on ports 11053/19153 with a
single test.example.com zone. Useful for verifying the custom image
builds and the rfc2136 plugin accepts/applies UPDATEs end-to-end
before touching production zones.
Already validated the msgAcceptFunc override fix end-to-end via
nsupdate, with the auto plugin re-serving the new record within 5s.
Note: zones/test.example.com.zone gets rewritten by the plugin
during testing. If perms get hosed (docker writes as root), run
sudo chown -R rpm:rpm test/zones/ to reclaim.
Production-readiness pass on the Dockerfile after the test stack
proved out the build. Three changes:
- FROM golang:1.22-alpine → golang:1.25-alpine (plugin's go.mod
resolved to go 1.25, base image needed to keep up).
- COREDNS_REF v1.12.2 → v1.14.3 (matches what our plugin compiles
against; older CoreDNS pulled an outdated quic-go API).
- GOPROXY=direct + GOSUMDB=off so go-get talks straight to the
Gitea instance hosting our plugin (proxy.golang.org won't proxy
private repos).
- Dropped the broken GOFLAGS="-ldflags=-w -s" passthrough that
miekg parses incorrectly. Resulting binary is ~10MB larger than
a stripped build but functionally identical.
Wires the custom CoreDNS image (built via coredns/Dockerfile, source
includes git.supported.systems/rsp2k/coredns-rfc2136) into production:
- docker-compose.yml: switch coredns service from upstream image pin
to a build target. New `image: coredns-rfc2136:${COREDNS_IMAGE_TAG}`
is locally-built; `up -d coredns` triggers the build.
- .env: COREDNS_IMAGE_TAG=2026.05.21 (CalVer). Old COREDNS_IMAGE kept
as a comment for emergency rollback to upstream 1.11.3.
- Corefile: new rfc2136 directive inside (common) snippet enumerating
all 84 zones currently in zones/. Plugin is now in the chain for
every server block (plain DNS, DoT, DoH). UPDATE opcode lands in
the plugin handler; auto-commit on, CalVer SOA serial bumping on,
zones-dir /zones matches the existing bind-mount.
TSIG key is read from ${ACME_TSIG_SECRET} which lives in .env.local
(gitignored). Production deployment needs that file synced to dell01
separately.
This commit DOESN'T trigger the deployment by itself -- the image
must be built on dell01 and the container recreated to apply.
ssh repointed to new host. vpn and web-bmh-servicedesk.bmh now CNAME at
ssh so future host moves only require one record change. SOA serial
bumped manually (2026052103) since prepare-zones.sh no longer in the
loop after the Phase 2a migration.
Caddy needs this only for DNS-01 cert renewal via Vultr's API, which
happens within the final 30 days of the cert's 90-day lifetime --
roughly once a quarter. Requiring it to be exported on every `docker
compose up` was friction for routine ops (CoreDNS recreations during
unrelated config changes).
Empty default keeps the stack startable without the key in scope. When
renewal is imminent, set the var properly OR (preferred long-term)
migrate Caddy to caddy-dns/rfc2136 pointing at our own plugin and
retire the Vultr dependency entirely.
Big migration: the source/prepared split is gone. Each zones/*.zone is
now an RFC-compliant zone file that CoreDNS reads directly. Editing a
record is just edit + bump SOA + commit. CoreDNS auto-reloads within
30s; HE pulls on its own 300s SOA-refresh cycle.
Why: groundwork for the coredns-rfc2136 plugin to edit zones in place
without juggling a source/prepared transformation step. Also reduces
the mental model from "edit source, run prep, push" to just "edit".
Changes:
- zones/*.zone: 84 files migrated from Vultr-export form to RFC-compliant
form (SOA injected, Vultr NS replaced with HE NS, CNAME/MX/NS rdata
dot-terminated, apex lines get explicit @ prefix). Diff is mechanical
and byte-count is unchanged (~340K) -- pure formatting promotion.
- docker-compose.yml: bind ./zones:/zones:ro (was ./zones-prepared)
- Makefile: dropped 'prep' target. 'reload' is now a no-op explainer.
'tls-up' no longer depends on prep. 'clean' no longer wipes prepared.
- scripts/prepare-zones.sh moved to scripts/archive/ (kept for reference).
- .gitignore: updated comment for zones-prepared/ (now legacy).
NOT in this commit (follow-ups):
- CLAUDE.md updates documenting the new workflow.
- scripts/bump-serials.sh helper for manual-edit SOA bumping.
- coredns-rfc2136 plugin refactor (Phase 2b in the plan).
Aligns the placeholder with the actual plugin repo at
https://git.supported.systems/rsp2k/coredns-rfc2136 (created and
populated via tea in Phase 1.2). Originally written as a guess at
git.supportedsystems.net; correcting now that the repo exists.
Adds a second non-HE public secondary that pulls AXFR from dell01 (the
hidden primary at 154.27.180.210) and answers public queries on
ns.supported.systems (64.177.113.227, 2001:19f0:5c00:4daa:5400:6ff:fe2d:38fa).
secondary/
Corefile generated, 84 zones + REFUSED catch-all
docker-compose.yml CoreDNS in host-net mode
Makefile up/down/logs/regen/test/axfr-test
.env / .env.example image pin + bind IPs
scripts/generate-secondary-corefile.sh reads ../zones/*.zone
scripts/notify-he.py → notify-secondaries.py
adds 64.177.113.227 as a second
NOTIFY target alongside HE's
216.218.130.2
Uses CoreDNS's `bind` plugin to avoid colliding with systemd-resolved
on loopback :53. Authoritative-only — non-listed zones get REFUSED, no
recursion. AXFR pull requires opening TCP/53 on dell01's FortiWiFi for
the secondary's IP (manual step, separate from this commit).
Lays the groundwork for a future CoreDNS rfc2136 plugin that will accept
TSIG-authenticated dynamic DNS updates from Caddy (via caddy-dns/rfc2136),
enabling self-hosted ACME DNS-01 cert automation without depending on
registrar APIs.
Nothing in this commit is active at runtime:
- Corefile additions are commented out
- coredns/Dockerfile references a plugin repo that doesn't exist yet
- scripts/acme-add-domain.sh just appends CNAME glue but has nothing
to talk to until the plugin is built
Architecture and implementation plan:
~/.claude/plans/dood-does-coredns-offer-enumerated-piglet.md
Secret management: TSIG key generated and stored in .env.local
(gitignored). .env.local.example documents the expected shape.
The cache 30 directive in the (common) snippet was clamping
authoritative TTLs to 30s max — every record HE pulled showed TTL≈5
because the cache plugin intercepts responses regardless of source
(auto plugin authoritative answers AND forward plugin resolver answers).
Switching to bare 'cache' uses the plugin's 3600s default, which
preserves our source TTLs: most records at 300s, _dmarc/dkim/SRV at
3600s, wildcards at 60s.
Decouples git.supported.systems from the legacy host record. Resolves
through git.supportedsystems.net (64.177.112.188) — the new git server
under the .supportedsystems.net infrastructure namespace. Old gitea box
at 66.42.70.188 has no more named-DNS reference here.
Same pattern as autoconfig/autodiscover/imap/smtp/pop — webmail was
being caught by the wildcard (* 60 IN A 108.61.23.129) and resolving
to the docker host. Explicit CNAME points it at the mail server FQDN
where the webmail UI actually runs.
These 4 mail-discovery hostnames were silently caught by the wildcard
(* 60 IN A 108.61.23.129), resolving to the docker host instead of
the mail server. CNAMEs to mail.supported.systems make their resolution
explicit and follow the mail server's A record automatically.
Mail server migration cutover. mail.supported.systems flips inbound mail
for all 20 MX-referring zones to the new server. old-mailu.supported.systems
preserves a name pointing at the old IP (66.42.75.247) during the
migration window for IMAP drain, mailbox sync, and parallel verification.
Decouples the 6 dependent services (dignity.ink:kayla, septic.report:permits,
supported.systems:{docs, *.docs, mcbluetooth, s120}) from the legacy host
record. Services now follow the new-canonical .supportedsystems.net naming
and resolve directly to the new docker host.
Bulk swap of the old docker-2 host IP to the new one across 4 zones.
docker-2.supported.systems intentionally preserved at the old IP — 6
CNAMEs depend on the FQDN; the old box keeps its identity until
decommissioned.
These were leftover from a past cert renewal — timelinize.l isn't an
active service. Their presence made timelinize.l an empty non-terminal
that suppressed *.l wildcard synthesis at HE per RFC 4592 §2.2.3.
Bulk swap of the old docker host IP to the new one across 13 zones.
docker-1.supported.systems intentionally preserved at the old IP — the
hostname stays tied to the old box until decommissioned.
cubeseptic.com, flonhoney.com, hydrushydroponics.com,
idahogreendreams.com, qube-construction.com, qube-septic.com,
qubeseptic.com — all were hosted on 108.61.229.209 (docker-1, old)
and are being decommissioned, not migrated to the replacement host.
Wildcards in DNS only synthesize for names that don't already exist
in the zone tree. A `_acme-challenge.<sub>` TXT record makes <sub>
an "empty non-terminal" — exists in the tree (as a parent node) but
has no records of its own. Per RFC 4592 §2.2.3, wildcards skip these,
so RFC-compliant resolvers (HE, BIND) return NODATA for <sub> even
when the zone has `* CNAME @`.
Fix: for each <sub> that's an empty non-terminal in a zone with a
wildcard, add an explicit `<sub> CNAME @` so the resolution outcome
matches what the wildcard would have produced. Zero-knowledge — no
need to identify the specific service IP per name.
30 records added across 14 zones:
acrazy.org (langfuse.dootie)
context.bet (studio)
copper-springs.online (docs.butler.dev)
demostar.io (cw.cw, doom, meet)
home-inspector.store (api, dashboard, mailpit)
inspect.pics (admin)
log.doctor (app, docs)
malloys.us (cp, cp-sandbox, mary)
nielsen-inspections.com (calendar, cw, files, v2-calendar)
qubeseptic.com (api.dispatch, dispatch, leads, mail.dispatch,
rentcache.dispatch)
ryanmalloy.com (c4ai)
sidejob.pro (api)
upc.llc (catalog, minio.or, or, s3)
CoreDNS (lenient) was returning the wildcard CNAME for these names
anyway; HE (strict RFC-compliant) was returning empty. After this
change, both behave identically.
Previously: refresh=3600 retry=1800 minimum=300 (RFC-conformant but
slow). With HE's free secondary service exhibiting puller→anycast
replication lag of up to ~1 hour, we want to give them every signal
to refresh faster.
New: refresh=300 retry=120 minimum=60.
- refresh 300s: slaves poll our SOA every 5 minutes. ~91 zones polled
by HE = ~1 query/sec to dell01:53, trivial load. If HE honors the
master's refresh internally (some secondary providers do, some
don't), this also nudges their puller→anycast sync.
- retry 120s: kept < refresh per RFC 1912 §2.2.
- minimum 60s: tightens NXDOMAIN negative-cache TTL on public
resolvers from 5 min to 1 min. The dominant window when a newly-
added name is briefly NX-cached on Cloudflare/Google/Quad9 before
they re-ask HE.
expire stays at 604800 (1 week) — that's "how long HE keeps serving
stale data if we vanish," unrelated to fresh-data propagation.
Hurricane Electric requires asymmetric transfer config:
- AXFR pull from 216.218.133.2 (slave.dns.he.net / ns4.he.net)
- NOTIFY destination 216.218.130.2 (ns1.he.net)
CoreDNS's transfer plugin uses a single bidirectional `to` list for
both, which is fine in principle but breaks in a confirmed bug: any
`to` with more than one specific IPv4 silently kills server-block
listener startup (no error, zones load, but :53 never binds).
Reproduced on 1.11.3 + 1.12.2 even with a minimal fresh `docker run`.
Workaround:
- Corefile keeps `transfer { to * }` (open AXFR; firewall does the
real source-IP filtering on TCP/53)
- scripts/notify-he.py crafts and sends NOTIFY messages directly to
216.218.130.2 (only). Pure-stdlib Python — no dependencies.
- Makefile `prep` target runs notify-he.py after prepare-zones.sh
so every zone-bump fires NOTIFY automatically.
Verified end-to-end: HE acks NOTIFY (rcode=0) for the 10 zones it
hosts as secondaries; remaining 81 return REFUSED (rcode=5) because
HE doesn't have them configured yet. Note: HE's free slave service
acks NOTIFY but only actually re-pulls AXFR on its hourly poll cycle
(observed behavior — they're poll-based by design). NOTIFY still
useful long-term in case HE changes that behavior; harmless either way.
27 records across 15 zones converted from direct A records pointing at
the Tailscale endpoint (100.79.95.190) to CNAMEs pointing at the
Tailscale-named alias. Now if the underlying Tailscale node's IP
changes, only the rpm-bullet record needs updating instead of
chasing 27 zones.
Affected zones (all *.l labels + a handful of dev / dev.mary names):
acrazy.org copper-springs.online demostar.io flonhoney.com
homestar.ink kg7q.cc malloys.us ourjob.site
qubeseptic.com ryanmalloy.com septic.report sidejob.pro
supported.systems warehack.ing zmesh.systems
No CNAME collisions: none of the converted names had other records
(MX/TXT/SRV/CAA/AAAA) at the same exact name. _acme-challenge.<sub>.l
records sit at distinct subdomains and continue to resolve independently
(verified: TXT lookups for known _acme-challenge.l.* names still return
the original values).
Also fixed prepare-zones.sh: added `|| true` after the serial-detection
grep so a zero-match (first run of a new day) doesn't trip `set -e`
and abort the whole prep.
Previously: `SERIAL=$(date +%Y%m%d)01` — same-day re-runs produced the
same serial. HE polled, saw no change, never pulled the update.
Now: scan zones-prepared/ for the highest `YYYYMMDDNN` matching today's
date and increment the NN counter. First run of the day starts at NN=01.
Caps at NN=99 with a clear error message (set SERIAL manually if you
genuinely need >99 changes per day).
`SERIAL=<value> make prep` still overrides the auto-detection, useful
for forcing a specific serial during recovery or for testing.
Verified end-to-end on dell01: prep bumped 2026051601 → 2026051602,
CoreDNS auto-reload picked it up within 30s, all queried zones serve
the new serial. HE will pull on its next refresh poll (SOA refresh
= 3600s, so worst case 1 hour).
Goal was to restrict AXFR to Hurricane Electric's five secondary
nameserver IPs. Tried several CoreDNS Corefile syntaxes:
transfer { to 216.218.130.2 ... 216.66.1.2 } # space-separated
transfer { to 216.218.130.2 \n to 216.218.131.2 } # multi-line
transfer { to 216.218.130.2 } # single IP
transfer { to * 216.218.130.2 ... } # mixed
Every form with a specific IPv4 address silently breaks server-block
startup — the auto plugin still loads zones into memory but the
:53/:443/:853 listeners never bind. Reproducible on coredns/coredns
1.11.3 AND 1.12.2 with the (common) snippet + auto + forward shape.
Only `to *` results in healthy listener startup.
Even if we got CoreDNS-side filtering to work, Docker's default
userland-proxy rewrites source IPs to the bridge gateway, which would
break IP-based filtering anyway short of `network_mode: host`.
Decision: keep `to *` in CoreDNS, push HE-only filtering to the
FortiWiFi firewall (source-IP-restricted VIP/DNAT for WAN:53/tcp).
This is correct-layered defense — the perimeter does the IP work
before packets ever reach dell01.