# Caddy SIP Guardian - Matt Holt Code Review **Date:** 2025-12-24 **Reviewer:** Matt Holt (simulated) **Module Version:** caddy-sip-guardian v0.1.0 --- ## Executive Summary The SIP Guardian module demonstrates solid understanding of Caddy's module system and implements valuable SIP protection functionality. However, there are several **critical architectural issues** that violate Caddy best practices and could cause problems in production, especially around: 1. **Global mutable state** (registry, metrics, feature flags) 2. **Resource leaks** during config reloads 3. **Prometheus integration** that doesn't follow Caddy patterns 4. **Performance anti-patterns** (O(n²) sorting, inefficient string operations) ### Severity Levels - šŸ”“ **CRITICAL** - Will cause panics, leaks, or data corruption - 🟠 **HIGH** - Violates Caddy best practices, will cause issues at scale - 🟔 **MEDIUM** - Suboptimal but functional, should be improved - šŸ”µ **LOW** - Nice-to-have improvements --- ## 1. Module Architecture & State Management ### šŸ”“ CRITICAL: Global Registry Without Cleanup Mechanism **File:** `registry.go:13-16` ```go var ( guardianRegistry = make(map[string]*SIPGuardian) registryMu sync.RWMutex ) ``` **Problem:** - Global package-level map that grows unbounded across config reloads - When Caddy reloads config, new guardians are created but old ones stay in memory - Old guardians have running goroutines (cleanupLoop) that never stop - After 10 config reloads, you have 10 guardians with 10 cleanup goroutines **Impact:** Memory leak, goroutine leak, eventual OOM in production **Fix:** Caddy modules should be **self-contained** and not rely on global registries. Instead: ```go // Option 1: Use Caddy's app system type SIPGuardianApp struct { guardians map[string]*SIPGuardian mu sync.RWMutex } func (app *SIPGuardianApp) CaddyModule() caddy.ModuleInfo { return caddy.ModuleInfo{ ID: "sip_guardian", New: func() caddy.Module { return &SIPGuardianApp{guardians: make(map[string]*SIPGuardian)} }, } } // Option 2: Use ctx.App() to store state in Caddy's lifecycle func (h *SIPHandler) Provision(ctx caddy.Context) error { // Get guardian from Caddy's app context, not global var var app *SIPGuardianApp if err := ctx.App(&app); err != nil { return err } h.guardian = app.GetOrCreateGuardian("default", &h.SIPGuardian) return nil } ``` **Why this matters:** Caddy can reload config every 5 minutes in some setups. This leak accumulates fast. --- ### šŸ”“ CRITICAL: Feature Flags as Global Mutable Variables **File:** `sipguardian.go:17-21` ```go var ( enableMetrics = true enableWebhooks = true enableStorage = true ) ``` **Problem:** - Mutable package-level globals shared across ALL guardian instances - Not thread-safe (no mutex protection) - Can't have different settings per guardian - Violates Caddy's philosophy of "config is everything" **Fix:** Move to struct fields with proper configuration: ```go type SIPGuardian struct { // ... existing fields ... // Feature toggles (configurable per instance) EnableMetrics bool `json:"enable_metrics,omitempty"` EnableWebhooks bool `json:"enable_webhooks,omitempty"` EnableStorage bool `json:"enable_storage,omitempty"` } ``` Or use build tags if these are compile-time options: ```go //go:build !nometrics func (g *SIPGuardian) recordMetric() { // metrics code here } ``` --- ### 🟠 HIGH: Prometheus Metrics Use MustRegister **File:** `metrics.go:158-181` ```go func RegisterMetrics() { if metricsRegistered { return } metricsRegistered = true prometheus.MustRegister(...) // WILL PANIC on second call } ``` **Problem:** - `MustRegister` panics if metrics already registered (e.g., during test runs, or if multiple modules register) - The `metricsRegistered` guard helps but is package-level state that doesn't reset - Not idempotent across tests or different Caddy instances in same process **Fix:** Use Caddy's metrics integration or custom registry: ```go // Option 1: Use Caddy's admin metrics (recommended) // https://caddyserver.com/docs/json/admin/config/metrics/ // Option 2: Custom registry per module instance type SIPGuardian struct { metricsRegistry *prometheus.Registry // ... other fields ... } func (g *SIPGuardian) Provision(ctx caddy.Context) error { g.metricsRegistry = prometheus.NewRegistry() g.metricsRegistry.MustRegister( // your metrics ) } ``` **Reference:** See how caddy-prometheus module integrates: https://github.com/caddyserver/caddy/blob/master/modules/metrics/metrics.go --- ### 🟠 HIGH: mergeGuardianConfig Modifies Running Instance **File:** `registry.go:68-202` ```go func mergeGuardianConfig(ctx caddy.Context, g *SIPGuardian, config *SIPGuardian) { g.mu.Lock() defer g.mu.Unlock() // Modifies MaxFailures, BanTime, whitelists while guardian is actively processing traffic if config.MaxFailures > 0 && config.MaxFailures != g.MaxFailures { g.MaxFailures = config.MaxFailures // Race with RecordFailure() } } ``` **Problem:** - Changes runtime behavior of actively-processing guardian - `RecordFailure()` reads `g.MaxFailures` without lock: `if tracker.count >= g.MaxFailures` (sipguardian.go:408) - Could miss bans or ban too aggressively during config merge **Fix:** Caddy modules should be immutable once provisioned. Config changes should create NEW instances: ```go func GetOrCreateGuardianWithConfig(...) (*SIPGuardian, error) { registryMu.Lock() defer registryMu.Unlock() if g, exists := guardianRegistry[name]; exists { // Don't merge - existing instance is immutable // Log warning if configs differ if !configsMatch(g, config) { log.Warn("Cannot change guardian config after provision - config reload required") } return g, nil } // Create fresh instance g := createGuardian(config) guardianRegistry[name] = g return g, nil } ``` --- ## 2. Thread Safety & Concurrency ### 🟔 MEDIUM: RecordFailure Reads MaxFailures Without Lock **File:** `sipguardian.go:408` ```go func (g *SIPGuardian) RecordFailure(ip, reason string) bool { // ... g.mu.Lock() defer g.mu.Unlock() // ... // Check if we should ban if tracker.count >= g.MaxFailures { // RACE: g.MaxFailures can change during mergeGuardianConfig g.banIP(ip, reason) return true } } ``` **Problem:** - `g.MaxFailures` read without lock protection - `mergeGuardianConfig` can modify it concurrently (registry.go:98-100) **Fix:** Either: 1. Make config immutable (recommended - see above) 2. Or protect with RWMutex: ```go type SIPGuardian struct { // Separate locks for config vs runtime state configMu sync.RWMutex stateMu sync.RWMutex // Config fields (protected by configMu) maxFailures int // Runtime fields (protected by stateMu) bannedIPs map[string]*BanEntry } func (g *SIPGuardian) RecordFailure(...) { g.configMu.RLock() maxFails := g.maxFailures g.configMu.RUnlock() g.stateMu.Lock() // ... use maxFails ... g.stateMu.Unlock() } ``` --- ### 🟔 MEDIUM: Storage Writes in Unbounded Goroutines **File:** `sipguardian.go:397-399, 464-468, 495-499` ```go if g.storage != nil { go func() { g.storage.RecordFailure(ip, reason, nil) // Fire and forget }() } ``` **Problem:** - No limit on concurrent goroutines during attack (could spawn 100k goroutines) - No error handling if storage is closed - If storage write is slow, goroutines pile up **Fix:** Use worker pool pattern: ```go type SIPGuardian struct { storageQueue chan storageTask storageWg sync.WaitGroup } func (g *SIPGuardian) Provision(ctx caddy.Context) error { if g.storage != nil { g.storageQueue = make(chan storageTask, 1000) // Fixed pool of storage workers for i := 0; i < 4; i++ { g.storageWg.Add(1) go g.storageWorker() } } } func (g *SIPGuardian) storageWorker() { defer g.storageWg.Done() for task := range g.storageQueue { task.execute(g.storage) } } func (g *SIPGuardian) Cleanup() error { close(g.storageQueue) // Stop accepting tasks g.storageWg.Wait() // Wait for workers to finish // ... rest of cleanup ... } ``` --- ## 3. Performance Issues ### 🟠 HIGH: Bubble Sort for Eviction (O(n²)) **File:** `sipguardian.go:635-641, 689-695` ```go // Sort by time (oldest first) for i := 0; i < len(entries)-1; i++ { for j := i + 1; j < len(entries); j++ { if entries[j].time.Before(entries[i].time) { entries[i], entries[j] = entries[j], entries[i] // Bubble sort! } } } ``` **Problem:** - O(n²) time complexity - For 100,000 entries (maxTrackedIPs), this is 10 billion comparisons - Will lock the mutex for seconds during high load **Fix:** Use stdlib `sort.Slice()`: ```go import "sort" func (g *SIPGuardian) evictOldestTrackers(count int) { type ipTime struct { ip string time time.Time } entries := make([]ipTime, 0, len(g.failureCounts)) for ip, tracker := range g.failureCounts { entries = append(entries, ipTime{ip: ip, time: tracker.firstSeen}) } // O(n log n) sort sort.Slice(entries, func(i, j int) bool { return entries[i].time.Before(entries[j].time) }) // Evict oldest for i := 0; i < count && i < len(entries); i++ { delete(g.failureCounts, entries[i].ip) } } ``` **Impact:** 100x-1000x faster for large maps --- ### 🟔 MEDIUM: String.ToLower() Allocates on Hot Path **File:** `l4handler.go:351` ```go func detectSuspiciousPattern(data []byte) string { lower := strings.ToLower(string(data)) // Allocates new string + ToLower allocates again for _, def := range suspiciousPatternDefs { if strings.Contains(lower, def.pattern) { return def.name } } return "" } ``` **Problem:** - Called on EVERY SIP message - `string(data)` allocates (copies 4KB) - `ToLower()` allocates another 4KB - Under load (10k msg/sec), this is 80MB/sec of garbage **Fix:** Use `bytes` package or case-insensitive search: ```go func detectSuspiciousPattern(data []byte) string { for _, def := range suspiciousPatternDefs { // Case-insensitive search without allocation if bytes.Contains(bytes.ToLower(data), []byte(def.pattern)) { return def.name } } return "" } // Or for better performance, pre-compile lowercase patterns: var suspiciousPatternDefs = []struct { name string pattern []byte // lowercase pattern }{ {"sipvicious", []byte("sipvicious")}, // ... } func detectSuspiciousPattern(data []byte) string { lower := bytes.ToLower(data) // Single allocation for _, def := range suspiciousPatternDefs { if bytes.Contains(lower, def.pattern) { return def.name } } return "" } ``` --- ### šŸ”µ LOW: min() Function Defined Locally **File:** `l4handler.go:367-372` ```go func min(a, b int) int { if a < b { return a } return b } ``` **Problem:** Go 1.21+ has builtin `min()` function. Your `go.mod` says `go 1.25`. **Fix:** ```go // Delete the function, use builtin sample := buf[:min(64, len(buf))] ``` --- ## 4. Resource Management & Lifecycle ### 🟠 HIGH: Cleanup() Doesn't Respect Context **File:** `sipguardian.go:891-932` ```go func (g *SIPGuardian) Cleanup() error { // ... select { case <-done: g.logger.Debug("All goroutines stopped cleanly") case <-time.After(5 * time.Second): // Hardcoded timeout g.logger.Warn("Timeout waiting for goroutines to stop") } } ``` **Problem:** - Caddy's shutdown might have different timeout - Should respect `caddy.Context` shutdown signal **Fix:** ```go func (g *SIPGuardian) Cleanup() error { // Use the context from Provision or a configurable timeout ctx, cancel := context.WithTimeout(context.Background(), g.shutdownTimeout()) defer cancel() // Signal stop close(g.stopCh) // Wait with context done := make(chan struct{}) go func() { g.wg.Wait() close(done) }() select { case <-done: g.logger.Debug("Cleanup completed") case <-ctx.Done(): g.logger.Warn("Cleanup timed out", zap.Duration("timeout", g.shutdownTimeout())) return ctx.Err() } // ... rest of cleanup ... return nil } func (g *SIPGuardian) shutdownTimeout() time.Duration { if g.CleanupTimeout > 0 { return time.Duration(g.CleanupTimeout) } return 10 * time.Second // reasonable default } ``` --- ### 🟔 MEDIUM: DNS Whitelist Doesn't Stop on Cleanup **File:** `sipguardian.go:900-902` ```go func (g *SIPGuardian) Cleanup() error { // Stop DNS whitelist background refresh if g.dnsWhitelist != nil { g.dnsWhitelist.Stop() // Is this method implemented? Does it block? } } ``` **Question:** Does `dnsWhitelist.Stop()` wait for goroutines to finish? Should add to WaitGroup. **Fix:** ```go // In Provision if g.dnsWhitelist != nil { g.wg.Add(1) go func() { defer g.wg.Done() g.dnsWhitelist.Run(g.stopCh) // Respect stopCh }() } // Cleanup will automatically wait via g.wg.Wait() ``` --- ## 5. Configuration & Validation ### 🟔 MEDIUM: Validate() Ranges Too Restrictive **File:** `sipguardian.go:936-966` ```go func (g *SIPGuardian) Validate() error { if g.MaxFailures > 1000 { return fmt.Errorf("max_failures exceeds reasonable limit (1000), got %d", g.MaxFailures) } if time.Duration(g.FindTime) > 24*time.Hour { return fmt.Errorf("find_time exceeds reasonable limit (24h), got %v", time.Duration(g.FindTime)) } } ``` **Question:** Why are these "reasonable limits" hardcoded? - Some users might want `max_failures = 2` for high-security environments - Some might want `find_time = 7d` for long-term tracking **Fix:** Either document WHY these limits exist, or make them configurable: ```go const ( maxFailuresLimit = 10000 // Prevent integer overflow in counter logic maxFindTimeLimit = 30 * 24 * time.Hour // 30 days max for memory reasons ) func (g *SIPGuardian) Validate() error { if g.MaxFailures < 1 || g.MaxFailures > maxFailuresLimit { return fmt.Errorf("max_failures must be 1-%d, got %d", maxFailuresLimit, g.MaxFailures) } // ... etc } ``` --- ### šŸ”µ LOW: UnmarshalCaddyfile Doesn't Validate Inline **File:** `sipguardian.go:746-889` **Observation:** Parsing and validation are separate (Validate called after Provision). This is fine per Caddy conventions, but some basic validation could happen during unmarshaling to give better error messages. Example: ```go case "max_failures": // ... parse ... if val < 1 { return d.Errf("max_failures must be positive, got %d", val) } g.MaxFailures = val ``` --- ## 6. Error Handling & Robustness ### 🟔 MEDIUM: Inconsistent Error Handling in Provision **File:** `sipguardian.go:103-241` ```go func (g *SIPGuardian) Provision(ctx caddy.Context) error { // Parse whitelist CIDRs for _, cidr := range g.WhitelistCIDR { _, network, err := net.ParseCIDR(cidr) if err != nil { return fmt.Errorf("invalid whitelist CIDR %s: %v", cidr, err) // FATAL } g.whitelistNets = append(g.whitelistNets, network) } // Initialize storage if enableStorage && g.StoragePath != "" { storage, err := InitStorage(g.logger, StorageConfig{Path: g.StoragePath}) if err != nil { g.logger.Warn("Failed to initialize storage, continuing without persistence", zap.Error(err), // WARNING - continues ) } } } ``` **Problem:** Why is invalid CIDR fatal but storage failure is just a warning? **Fix:** Be consistent. Either: 1. Make optional features non-fatal (recommended for flexibility) 2. Or make all failures fatal (strict validation) Document in config: ```caddyfile sip_guardian { whitelist 10.0.0/8 # Invalid - Provision will FAIL storage /bad/path # Invalid - Provision will WARN and continue } ``` --- ### 🟔 MEDIUM: BanIP Creates Fake Failure Tracker **File:** `sipguardian.go:968-991` ```go func (g *SIPGuardian) BanIP(ip, reason string) { // ... if _, exists := g.failureCounts[ip]; !exists { g.failureCounts[ip] = &failureTracker{ count: g.MaxFailures, // Fake failures firstSeen: time.Now(), lastSeen: time.Now(), } } g.banIP(ip, reason) } ``` **Problem:** Why does manually banning an IP create failure tracking data? This pollutes metrics and storage. **Fix:** Separate manual bans from automatic bans: ```go type BanEntry struct { IP string Reason string Source string // "automatic", "manual", "api" BannedAt time.Time ExpiresAt time.Time HitCount int // 0 for manual bans } func (g *SIPGuardian) BanIP(ip, reason string) { // Don't create fake failure tracker entry := &BanEntry{ IP: ip, Source: "manual", Reason: reason, BannedAt: time.Now(), ExpiresAt: time.Now().Add(time.Duration(g.BanTime)), HitCount: 0, // No failures recorded } g.bannedIPs[ip] = entry } ``` --- ## 7. API Design ### 🟔 MEDIUM: Admin Handler URL Routing is Fragile **File:** `admin.go:56-70` ```go func (h *AdminHandler) ServeHTTP(w http.ResponseWriter, r *http.Request, next caddyhttp.Handler) error { path := r.URL.Path switch { case strings.HasSuffix(path, "/bans"): // Matches "/foo/bans" too return h.handleBans(w, r) case strings.Contains(path, "/unban/"): // Matches "/foo/unban/bar/baz" return h.handleUnban(w, r, path) } } ``` **Problem:** - `strings.HasSuffix` matches unintended paths - Path extraction is manual and error-prone **Fix:** Use proper HTTP router or at least exact matching: ```go // Option 1: Use http.ServeMux patterns (Go 1.22+) mux := http.NewServeMux() mux.HandleFunc("GET /api/sip-guardian/bans", h.handleBans) mux.HandleFunc("POST /api/sip-guardian/ban/{ip}", h.handleBan) mux.HandleFunc("DELETE /api/sip-guardian/unban/{ip}", h.handleUnban) // Option 2: Exact prefix matching switch { case r.URL.Path == "/api/sip-guardian/bans": return h.handleBans(w, r) case strings.HasPrefix(r.URL.Path, "/api/sip-guardian/ban/"): ip := strings.TrimPrefix(r.URL.Path, "/api/sip-guardian/ban/") ip = strings.TrimSuffix(ip, "/") return h.handleBan(w, r, ip) } ``` --- ### šŸ”µ LOW: No IP Validation in Admin Endpoints **File:** `admin.go:196, 210` ```go ip := strings.TrimSuffix(parts[1], "/") // Use public BanIP method for proper encapsulation h.guardian.BanIP(ip, body.Reason) // What if ip = "not-an-ip"? ``` **Fix:** ```go ip := strings.TrimSuffix(parts[1], "/") if net.ParseIP(ip) == nil { http.Error(w, "Invalid IP address", http.StatusBadRequest) return nil } h.guardian.BanIP(ip, body.Reason) ``` --- ## 8. Testing Observations ### 🟔 MEDIUM: Tests Use Global State **Observation:** Since metrics and registry are global, tests might interfere with each other. **Recommendation:** Add test helpers to reset state: ```go // In sipguardian_test.go func resetGlobalState() { registryMu.Lock() guardianRegistry = make(map[string]*SIPGuardian) registryMu.Unlock() metricsRegistered = false // ... reset other globals ... } func TestSomething(t *testing.T) { t.Cleanup(resetGlobalState) // ... test code ... } ``` --- ## Summary of Recommendations ### Must Fix (šŸ”“ Critical) 1. **Eliminate global registry** - use Caddy app system or context storage 2. **Remove global feature flags** - make them config options 3. **Fix Prometheus integration** - use custom registry or Caddy's metrics ### Should Fix (🟠 High) 4. **Make configs immutable** after provisioning 5. **Fix bubble sort** - use sort.Slice() 6. **Bound storage goroutines** - use worker pool ### Nice to Have (🟔 Medium) 7. Add context to Cleanup timeout 8. Validate IPs in admin API 9. Fix string allocation in hot path 10. Separate manual vs automatic bans ### Polish (šŸ”µ Low) 11. Remove custom min() function 12. Improve URL routing in admin handler 13. Add inline validation in UnmarshalCaddyfile --- ## Positive Observations šŸ‘ 1. **Excellent Cleanup() implementation** - proper WaitGroup usage, timeout handling 2. **Good Validate() implementation** - validates config before use 3. **Interface guards** - ensures compile-time type checking 4. **Comprehensive feature set** - GeoIP, DNS whitelist, enumeration detection, etc. 5. **Good logging** - structured logging with zap 6. **Test coverage** - multiple test files covering features --- ## Final Verdict This is **solid work** with good understanding of Caddy modules. The core functionality is sound, but the **global state issues are critical** and must be addressed before production use. Once those are fixed, this would be a great module for the Caddy community. The main pattern to change is: **Think per-instance, not global**. Every piece of state should live on the struct, not in package-level vars. --- **Estimated effort to fix critical issues:** 4-6 hours **Estimated effort for all recommendations:** 8-12 hours Would you like me to prioritize any specific issues or help implement fixes?