caddy-sip-guardian/CODE_REVIEW_MATT_HOLT.md
Ryan Malloy a9d938c64c Apply Matt Holt code review quick fixes
Performance improvements:
- Fix O(n²) bubble sort → O(n log n) sort.Slice() in eviction (1000x faster)
- Remove custom min() function, use Go 1.25 builtin
- Eliminate string allocations in detectSuspiciousPattern hot path
  (was creating 80MB/sec garbage at 10k msg/sec)

Robustness improvements:
- Add IP validation in admin API endpoints (ban/unban)

Documentation:
- Add comprehensive CODE_REVIEW_MATT_HOLT.md with 19 issues identified
- Prioritized: 3 critical, 5 high, 8 medium, 3 low priority issues

Remaining work (see CODE_REVIEW_MATT_HOLT.md):
- Replace global registry with Caddy app system
- Move feature flags to struct fields
- Fix Prometheus integration
- Implement worker pool for storage writes
- Make config immutable after Provision
2025-12-24 22:05:40 -07:00

20 KiB

Caddy SIP Guardian - Matt Holt Code Review

Date: 2025-12-24 Reviewer: Matt Holt (simulated) Module Version: caddy-sip-guardian v0.1.0


Executive Summary

The SIP Guardian module demonstrates solid understanding of Caddy's module system and implements valuable SIP protection functionality. However, there are several critical architectural issues that violate Caddy best practices and could cause problems in production, especially around:

  1. Global mutable state (registry, metrics, feature flags)
  2. Resource leaks during config reloads
  3. Prometheus integration that doesn't follow Caddy patterns
  4. Performance anti-patterns (O(n²) sorting, inefficient string operations)

Severity Levels

  • 🔴 CRITICAL - Will cause panics, leaks, or data corruption
  • 🟠 HIGH - Violates Caddy best practices, will cause issues at scale
  • 🟡 MEDIUM - Suboptimal but functional, should be improved
  • 🔵 LOW - Nice-to-have improvements

1. Module Architecture & State Management

🔴 CRITICAL: Global Registry Without Cleanup Mechanism

File: registry.go:13-16

var (
	guardianRegistry = make(map[string]*SIPGuardian)
	registryMu       sync.RWMutex
)

Problem:

  • Global package-level map that grows unbounded across config reloads
  • When Caddy reloads config, new guardians are created but old ones stay in memory
  • Old guardians have running goroutines (cleanupLoop) that never stop
  • After 10 config reloads, you have 10 guardians with 10 cleanup goroutines

Impact: Memory leak, goroutine leak, eventual OOM in production

Fix: Caddy modules should be self-contained and not rely on global registries. Instead:

// Option 1: Use Caddy's app system
type SIPGuardianApp struct {
	guardians map[string]*SIPGuardian
	mu        sync.RWMutex
}

func (app *SIPGuardianApp) CaddyModule() caddy.ModuleInfo {
	return caddy.ModuleInfo{
		ID:  "sip_guardian",
		New: func() caddy.Module { return &SIPGuardianApp{guardians: make(map[string]*SIPGuardian)} },
	}
}

// Option 2: Use ctx.App() to store state in Caddy's lifecycle
func (h *SIPHandler) Provision(ctx caddy.Context) error {
	// Get guardian from Caddy's app context, not global var
	var app *SIPGuardianApp
	if err := ctx.App(&app); err != nil {
		return err
	}
	h.guardian = app.GetOrCreateGuardian("default", &h.SIPGuardian)
	return nil
}

Why this matters: Caddy can reload config every 5 minutes in some setups. This leak accumulates fast.


🔴 CRITICAL: Feature Flags as Global Mutable Variables

File: sipguardian.go:17-21

var (
	enableMetrics  = true
	enableWebhooks = true
	enableStorage  = true
)

Problem:

  • Mutable package-level globals shared across ALL guardian instances
  • Not thread-safe (no mutex protection)
  • Can't have different settings per guardian
  • Violates Caddy's philosophy of "config is everything"

Fix: Move to struct fields with proper configuration:

type SIPGuardian struct {
	// ... existing fields ...

	// Feature toggles (configurable per instance)
	EnableMetrics  bool `json:"enable_metrics,omitempty"`
	EnableWebhooks bool `json:"enable_webhooks,omitempty"`
	EnableStorage  bool `json:"enable_storage,omitempty"`
}

Or use build tags if these are compile-time options:

//go:build !nometrics

func (g *SIPGuardian) recordMetric() {
	// metrics code here
}

🟠 HIGH: Prometheus Metrics Use MustRegister

File: metrics.go:158-181

func RegisterMetrics() {
	if metricsRegistered {
		return
	}
	metricsRegistered = true

	prometheus.MustRegister(...)  // WILL PANIC on second call
}

Problem:

  • MustRegister panics if metrics already registered (e.g., during test runs, or if multiple modules register)
  • The metricsRegistered guard helps but is package-level state that doesn't reset
  • Not idempotent across tests or different Caddy instances in same process

Fix: Use Caddy's metrics integration or custom registry:

// Option 1: Use Caddy's admin metrics (recommended)
// https://caddyserver.com/docs/json/admin/config/metrics/

// Option 2: Custom registry per module instance
type SIPGuardian struct {
	metricsRegistry *prometheus.Registry
	// ... other fields ...
}

func (g *SIPGuardian) Provision(ctx caddy.Context) error {
	g.metricsRegistry = prometheus.NewRegistry()
	g.metricsRegistry.MustRegister(
		// your metrics
	)
}

Reference: See how caddy-prometheus module integrates: https://github.com/caddyserver/caddy/blob/master/modules/metrics/metrics.go


🟠 HIGH: mergeGuardianConfig Modifies Running Instance

File: registry.go:68-202

func mergeGuardianConfig(ctx caddy.Context, g *SIPGuardian, config *SIPGuardian) {
	g.mu.Lock()
	defer g.mu.Unlock()

	// Modifies MaxFailures, BanTime, whitelists while guardian is actively processing traffic
	if config.MaxFailures > 0 && config.MaxFailures != g.MaxFailures {
		g.MaxFailures = config.MaxFailures  // Race with RecordFailure()
	}
}

Problem:

  • Changes runtime behavior of actively-processing guardian
  • RecordFailure() reads g.MaxFailures without lock: if tracker.count >= g.MaxFailures (sipguardian.go:408)
  • Could miss bans or ban too aggressively during config merge

Fix: Caddy modules should be immutable once provisioned. Config changes should create NEW instances:

func GetOrCreateGuardianWithConfig(...) (*SIPGuardian, error) {
	registryMu.Lock()
	defer registryMu.Unlock()

	if g, exists := guardianRegistry[name]; exists {
		// Don't merge - existing instance is immutable
		// Log warning if configs differ
		if !configsMatch(g, config) {
			log.Warn("Cannot change guardian config after provision - config reload required")
		}
		return g, nil
	}

	// Create fresh instance
	g := createGuardian(config)
	guardianRegistry[name] = g
	return g, nil
}

2. Thread Safety & Concurrency

🟡 MEDIUM: RecordFailure Reads MaxFailures Without Lock

File: sipguardian.go:408

func (g *SIPGuardian) RecordFailure(ip, reason string) bool {
	// ...
	g.mu.Lock()
	defer g.mu.Unlock()
	// ...

	// Check if we should ban
	if tracker.count >= g.MaxFailures {  // RACE: g.MaxFailures can change during mergeGuardianConfig
		g.banIP(ip, reason)
		return true
	}
}

Problem:

  • g.MaxFailures read without lock protection
  • mergeGuardianConfig can modify it concurrently (registry.go:98-100)

Fix: Either:

  1. Make config immutable (recommended - see above)
  2. Or protect with RWMutex:
type SIPGuardian struct {
	// Separate locks for config vs runtime state
	configMu sync.RWMutex
	stateMu  sync.RWMutex

	// Config fields (protected by configMu)
	maxFailures int
	// Runtime fields (protected by stateMu)
	bannedIPs map[string]*BanEntry
}

func (g *SIPGuardian) RecordFailure(...) {
	g.configMu.RLock()
	maxFails := g.maxFailures
	g.configMu.RUnlock()

	g.stateMu.Lock()
	// ... use maxFails ...
	g.stateMu.Unlock()
}

🟡 MEDIUM: Storage Writes in Unbounded Goroutines

File: sipguardian.go:397-399, 464-468, 495-499

if g.storage != nil {
	go func() {
		g.storage.RecordFailure(ip, reason, nil)  // Fire and forget
	}()
}

Problem:

  • No limit on concurrent goroutines during attack (could spawn 100k goroutines)
  • No error handling if storage is closed
  • If storage write is slow, goroutines pile up

Fix: Use worker pool pattern:

type SIPGuardian struct {
	storageQueue chan storageTask
	storageWg    sync.WaitGroup
}

func (g *SIPGuardian) Provision(ctx caddy.Context) error {
	if g.storage != nil {
		g.storageQueue = make(chan storageTask, 1000)

		// Fixed pool of storage workers
		for i := 0; i < 4; i++ {
			g.storageWg.Add(1)
			go g.storageWorker()
		}
	}
}

func (g *SIPGuardian) storageWorker() {
	defer g.storageWg.Done()
	for task := range g.storageQueue {
		task.execute(g.storage)
	}
}

func (g *SIPGuardian) Cleanup() error {
	close(g.storageQueue)  // Stop accepting tasks
	g.storageWg.Wait()     // Wait for workers to finish
	// ... rest of cleanup ...
}

3. Performance Issues

🟠 HIGH: Bubble Sort for Eviction (O(n²))

File: sipguardian.go:635-641, 689-695

// Sort by time (oldest first)
for i := 0; i < len(entries)-1; i++ {
	for j := i + 1; j < len(entries); j++ {
		if entries[j].time.Before(entries[i].time) {
			entries[i], entries[j] = entries[j], entries[i]  // Bubble sort!
		}
	}
}

Problem:

  • O(n²) time complexity
  • For 100,000 entries (maxTrackedIPs), this is 10 billion comparisons
  • Will lock the mutex for seconds during high load

Fix: Use stdlib sort.Slice():

import "sort"

func (g *SIPGuardian) evictOldestTrackers(count int) {
	type ipTime struct {
		ip   string
		time time.Time
	}

	entries := make([]ipTime, 0, len(g.failureCounts))
	for ip, tracker := range g.failureCounts {
		entries = append(entries, ipTime{ip: ip, time: tracker.firstSeen})
	}

	// O(n log n) sort
	sort.Slice(entries, func(i, j int) bool {
		return entries[i].time.Before(entries[j].time)
	})

	// Evict oldest
	for i := 0; i < count && i < len(entries); i++ {
		delete(g.failureCounts, entries[i].ip)
	}
}

Impact: 100x-1000x faster for large maps


🟡 MEDIUM: String.ToLower() Allocates on Hot Path

File: l4handler.go:351

func detectSuspiciousPattern(data []byte) string {
	lower := strings.ToLower(string(data))  // Allocates new string + ToLower allocates again

	for _, def := range suspiciousPatternDefs {
		if strings.Contains(lower, def.pattern) {
			return def.name
		}
	}
	return ""
}

Problem:

  • Called on EVERY SIP message
  • string(data) allocates (copies 4KB)
  • ToLower() allocates another 4KB
  • Under load (10k msg/sec), this is 80MB/sec of garbage

Fix: Use bytes package or case-insensitive search:

func detectSuspiciousPattern(data []byte) string {
	for _, def := range suspiciousPatternDefs {
		// Case-insensitive search without allocation
		if bytes.Contains(bytes.ToLower(data), []byte(def.pattern)) {
			return def.name
		}
	}
	return ""
}

// Or for better performance, pre-compile lowercase patterns:
var suspiciousPatternDefs = []struct {
	name    string
	pattern []byte  // lowercase pattern
}{
	{"sipvicious", []byte("sipvicious")},
	// ...
}

func detectSuspiciousPattern(data []byte) string {
	lower := bytes.ToLower(data)  // Single allocation
	for _, def := range suspiciousPatternDefs {
		if bytes.Contains(lower, def.pattern) {
			return def.name
		}
	}
	return ""
}

🔵 LOW: min() Function Defined Locally

File: l4handler.go:367-372

func min(a, b int) int {
	if a < b {
		return a
	}
	return b
}

Problem: Go 1.21+ has builtin min() function. Your go.mod says go 1.25.

Fix:

// Delete the function, use builtin
sample := buf[:min(64, len(buf))]

4. Resource Management & Lifecycle

🟠 HIGH: Cleanup() Doesn't Respect Context

File: sipguardian.go:891-932

func (g *SIPGuardian) Cleanup() error {
	// ...
	select {
	case <-done:
		g.logger.Debug("All goroutines stopped cleanly")
	case <-time.After(5 * time.Second):  // Hardcoded timeout
		g.logger.Warn("Timeout waiting for goroutines to stop")
	}
}

Problem:

  • Caddy's shutdown might have different timeout
  • Should respect caddy.Context shutdown signal

Fix:

func (g *SIPGuardian) Cleanup() error {
	// Use the context from Provision or a configurable timeout
	ctx, cancel := context.WithTimeout(context.Background(), g.shutdownTimeout())
	defer cancel()

	// Signal stop
	close(g.stopCh)

	// Wait with context
	done := make(chan struct{})
	go func() {
		g.wg.Wait()
		close(done)
	}()

	select {
	case <-done:
		g.logger.Debug("Cleanup completed")
	case <-ctx.Done():
		g.logger.Warn("Cleanup timed out", zap.Duration("timeout", g.shutdownTimeout()))
		return ctx.Err()
	}

	// ... rest of cleanup ...
	return nil
}

func (g *SIPGuardian) shutdownTimeout() time.Duration {
	if g.CleanupTimeout > 0 {
		return time.Duration(g.CleanupTimeout)
	}
	return 10 * time.Second  // reasonable default
}

🟡 MEDIUM: DNS Whitelist Doesn't Stop on Cleanup

File: sipguardian.go:900-902

func (g *SIPGuardian) Cleanup() error {
	// Stop DNS whitelist background refresh
	if g.dnsWhitelist != nil {
		g.dnsWhitelist.Stop()  // Is this method implemented? Does it block?
	}
}

Question: Does dnsWhitelist.Stop() wait for goroutines to finish? Should add to WaitGroup.

Fix:

// In Provision
if g.dnsWhitelist != nil {
	g.wg.Add(1)
	go func() {
		defer g.wg.Done()
		g.dnsWhitelist.Run(g.stopCh)  // Respect stopCh
	}()
}

// Cleanup will automatically wait via g.wg.Wait()

5. Configuration & Validation

🟡 MEDIUM: Validate() Ranges Too Restrictive

File: sipguardian.go:936-966

func (g *SIPGuardian) Validate() error {
	if g.MaxFailures > 1000 {
		return fmt.Errorf("max_failures exceeds reasonable limit (1000), got %d", g.MaxFailures)
	}

	if time.Duration(g.FindTime) > 24*time.Hour {
		return fmt.Errorf("find_time exceeds reasonable limit (24h), got %v", time.Duration(g.FindTime))
	}
}

Question: Why are these "reasonable limits" hardcoded?

  • Some users might want max_failures = 2 for high-security environments
  • Some might want find_time = 7d for long-term tracking

Fix: Either document WHY these limits exist, or make them configurable:

const (
	maxFailuresLimit = 10000  // Prevent integer overflow in counter logic
	maxFindTimeLimit = 30 * 24 * time.Hour  // 30 days max for memory reasons
)

func (g *SIPGuardian) Validate() error {
	if g.MaxFailures < 1 || g.MaxFailures > maxFailuresLimit {
		return fmt.Errorf("max_failures must be 1-%d, got %d", maxFailuresLimit, g.MaxFailures)
	}
	// ... etc
}

🔵 LOW: UnmarshalCaddyfile Doesn't Validate Inline

File: sipguardian.go:746-889

Observation: Parsing and validation are separate (Validate called after Provision). This is fine per Caddy conventions, but some basic validation could happen during unmarshaling to give better error messages.

Example:

case "max_failures":
	// ... parse ...
	if val < 1 {
		return d.Errf("max_failures must be positive, got %d", val)
	}
	g.MaxFailures = val

6. Error Handling & Robustness

🟡 MEDIUM: Inconsistent Error Handling in Provision

File: sipguardian.go:103-241

func (g *SIPGuardian) Provision(ctx caddy.Context) error {
	// Parse whitelist CIDRs
	for _, cidr := range g.WhitelistCIDR {
		_, network, err := net.ParseCIDR(cidr)
		if err != nil {
			return fmt.Errorf("invalid whitelist CIDR %s: %v", cidr, err)  // FATAL
		}
		g.whitelistNets = append(g.whitelistNets, network)
	}

	// Initialize storage
	if enableStorage && g.StoragePath != "" {
		storage, err := InitStorage(g.logger, StorageConfig{Path: g.StoragePath})
		if err != nil {
			g.logger.Warn("Failed to initialize storage, continuing without persistence",
				zap.Error(err),  // WARNING - continues
			)
		}
	}
}

Problem: Why is invalid CIDR fatal but storage failure is just a warning?

Fix: Be consistent. Either:

  1. Make optional features non-fatal (recommended for flexibility)
  2. Or make all failures fatal (strict validation)

Document in config:

sip_guardian {
	whitelist 10.0.0/8  # Invalid - Provision will FAIL
	storage /bad/path   # Invalid - Provision will WARN and continue
}

🟡 MEDIUM: BanIP Creates Fake Failure Tracker

File: sipguardian.go:968-991

func (g *SIPGuardian) BanIP(ip, reason string) {
	// ...
	if _, exists := g.failureCounts[ip]; !exists {
		g.failureCounts[ip] = &failureTracker{
			count:     g.MaxFailures,  // Fake failures
			firstSeen: time.Now(),
			lastSeen:  time.Now(),
		}
	}
	g.banIP(ip, reason)
}

Problem: Why does manually banning an IP create failure tracking data? This pollutes metrics and storage.

Fix: Separate manual bans from automatic bans:

type BanEntry struct {
	IP        string
	Reason    string
	Source    string  // "automatic", "manual", "api"
	BannedAt  time.Time
	ExpiresAt time.Time
	HitCount  int  // 0 for manual bans
}

func (g *SIPGuardian) BanIP(ip, reason string) {
	// Don't create fake failure tracker
	entry := &BanEntry{
		IP:        ip,
		Source:    "manual",
		Reason:    reason,
		BannedAt:  time.Now(),
		ExpiresAt: time.Now().Add(time.Duration(g.BanTime)),
		HitCount:  0,  // No failures recorded
	}
	g.bannedIPs[ip] = entry
}

7. API Design

🟡 MEDIUM: Admin Handler URL Routing is Fragile

File: admin.go:56-70

func (h *AdminHandler) ServeHTTP(w http.ResponseWriter, r *http.Request, next caddyhttp.Handler) error {
	path := r.URL.Path

	switch {
	case strings.HasSuffix(path, "/bans"):  // Matches "/foo/bans" too
		return h.handleBans(w, r)
	case strings.Contains(path, "/unban/"):  // Matches "/foo/unban/bar/baz"
		return h.handleUnban(w, r, path)
	}
}

Problem:

  • strings.HasSuffix matches unintended paths
  • Path extraction is manual and error-prone

Fix: Use proper HTTP router or at least exact matching:

// Option 1: Use http.ServeMux patterns (Go 1.22+)
mux := http.NewServeMux()
mux.HandleFunc("GET /api/sip-guardian/bans", h.handleBans)
mux.HandleFunc("POST /api/sip-guardian/ban/{ip}", h.handleBan)
mux.HandleFunc("DELETE /api/sip-guardian/unban/{ip}", h.handleUnban)

// Option 2: Exact prefix matching
switch {
case r.URL.Path == "/api/sip-guardian/bans":
	return h.handleBans(w, r)
case strings.HasPrefix(r.URL.Path, "/api/sip-guardian/ban/"):
	ip := strings.TrimPrefix(r.URL.Path, "/api/sip-guardian/ban/")
	ip = strings.TrimSuffix(ip, "/")
	return h.handleBan(w, r, ip)
}

🔵 LOW: No IP Validation in Admin Endpoints

File: admin.go:196, 210

ip := strings.TrimSuffix(parts[1], "/")

// Use public BanIP method for proper encapsulation
h.guardian.BanIP(ip, body.Reason)  // What if ip = "not-an-ip"?

Fix:

ip := strings.TrimSuffix(parts[1], "/")
if net.ParseIP(ip) == nil {
	http.Error(w, "Invalid IP address", http.StatusBadRequest)
	return nil
}
h.guardian.BanIP(ip, body.Reason)

8. Testing Observations

🟡 MEDIUM: Tests Use Global State

Observation: Since metrics and registry are global, tests might interfere with each other.

Recommendation: Add test helpers to reset state:

// In sipguardian_test.go
func resetGlobalState() {
	registryMu.Lock()
	guardianRegistry = make(map[string]*SIPGuardian)
	registryMu.Unlock()

	metricsRegistered = false
	// ... reset other globals ...
}

func TestSomething(t *testing.T) {
	t.Cleanup(resetGlobalState)
	// ... test code ...
}

Summary of Recommendations

Must Fix (🔴 Critical)

  1. Eliminate global registry - use Caddy app system or context storage
  2. Remove global feature flags - make them config options
  3. Fix Prometheus integration - use custom registry or Caddy's metrics

Should Fix (🟠 High)

  1. Make configs immutable after provisioning
  2. Fix bubble sort - use sort.Slice()
  3. Bound storage goroutines - use worker pool

Nice to Have (🟡 Medium)

  1. Add context to Cleanup timeout
  2. Validate IPs in admin API
  3. Fix string allocation in hot path
  4. Separate manual vs automatic bans

Polish (🔵 Low)

  1. Remove custom min() function
  2. Improve URL routing in admin handler
  3. Add inline validation in UnmarshalCaddyfile

Positive Observations 👍

  1. Excellent Cleanup() implementation - proper WaitGroup usage, timeout handling
  2. Good Validate() implementation - validates config before use
  3. Interface guards - ensures compile-time type checking
  4. Comprehensive feature set - GeoIP, DNS whitelist, enumeration detection, etc.
  5. Good logging - structured logging with zap
  6. Test coverage - multiple test files covering features

Final Verdict

This is solid work with good understanding of Caddy modules. The core functionality is sound, but the global state issues are critical and must be addressed before production use. Once those are fixed, this would be a great module for the Caddy community.

The main pattern to change is: Think per-instance, not global. Every piece of state should live on the struct, not in package-level vars.


Estimated effort to fix critical issues: 4-6 hours Estimated effort for all recommendations: 8-12 hours

Would you like me to prioritize any specific issues or help implement fixes?