caddy-sip-guardian/CODE_REVIEW_MATT_HOLT.md

# Caddy SIP Guardian - Matt Holt Code Review

**Date:** 2025-12-24
**Reviewer:** Matt Holt (simulated)
**Module Version:** caddy-sip-guardian v0.1.0

---

## Executive Summary

The SIP Guardian module demonstrates solid understanding of Caddy's module system and implements valuable SIP protection functionality. However, there are several **critical architectural issues** that violate Caddy best practices and could cause problems in production, especially around:

1. **Global mutable state** (registry, metrics, feature flags)
2. **Resource leaks** during config reloads
3. **Prometheus integration** that doesn't follow Caddy patterns
4. **Performance anti-patterns** (O(n²) sorting, inefficient string operations)

### Severity Levels
- 🔴 **CRITICAL** - Will cause panics, leaks, or data corruption
- 🟠 **HIGH** - Violates Caddy best practices, will cause issues at scale
- 🟡 **MEDIUM** - Suboptimal but functional, should be improved
- 🔵 **LOW** - Nice-to-have improvements

---

## 1. Module Architecture & State Management

### 🔴 CRITICAL: Global Registry Without Cleanup Mechanism
**File:** `registry.go:13-16`

```go
var (
	guardianRegistry = make(map[string]*SIPGuardian)
	registryMu       sync.RWMutex
)
```

**Problem:**
- Global package-level map that grows unbounded across config reloads
- When Caddy reloads config, new guardians are created but old ones stay in memory
- Old guardians have running goroutines (cleanupLoop) that never stop
- After 10 config reloads, you have 10 guardians with 10 cleanup goroutines

**Impact:** Memory leak, goroutine leak, eventual OOM in production

**Fix:**
Caddy modules should be **self-contained** and not rely on global registries. Instead:

```go
// Option 1: Use Caddy's app system
type SIPGuardianApp struct {
	guardians map[string]*SIPGuardian
	mu        sync.RWMutex
}

func (app *SIPGuardianApp) CaddyModule() caddy.ModuleInfo {
	return caddy.ModuleInfo{
		ID:  "sip_guardian",
		New: func() caddy.Module { return &SIPGuardianApp{guardians: make(map[string]*SIPGuardian)} },
	}
}

// Option 2: Use ctx.App() to store state in Caddy's lifecycle
func (h *SIPHandler) Provision(ctx caddy.Context) error {
	// Get guardian from Caddy's app context, not global var
	var app *SIPGuardianApp
	if err := ctx.App(&app); err != nil {
		return err
	}
	h.guardian = app.GetOrCreateGuardian("default", &h.SIPGuardian)
	return nil
}
```

**Why this matters:** Caddy can reload config every 5 minutes in some setups. This leak accumulates fast.

---

### 🔴 CRITICAL: Feature Flags as Global Mutable Variables
**File:** `sipguardian.go:17-21`

```go
var (
	enableMetrics  = true
	enableWebhooks = true
	enableStorage  = true
)
```

**Problem:**
- Mutable package-level globals shared across ALL guardian instances
- Not thread-safe (no mutex protection)
- Can't have different settings per guardian
- Violates Caddy's philosophy of "config is everything"

**Fix:**
Move to struct fields with proper configuration:

```go
type SIPGuardian struct {
	// ... existing fields ...

	// Feature toggles (configurable per instance)
	EnableMetrics  bool `json:"enable_metrics,omitempty"`
	EnableWebhooks bool `json:"enable_webhooks,omitempty"`
	EnableStorage  bool `json:"enable_storage,omitempty"`
}
```

Or use build tags if these are compile-time options:
```go
//go:build !nometrics

func (g *SIPGuardian) recordMetric() {
	// metrics code here
}
```

---

### 🟠 HIGH: Prometheus Metrics Use MustRegister
**File:** `metrics.go:158-181`

```go
func RegisterMetrics() {
	if metricsRegistered {
		return
	}
	metricsRegistered = true

	prometheus.MustRegister(...)  // WILL PANIC on second call
}
```

**Problem:**
- `MustRegister` panics if metrics already registered (e.g., during test runs, or if multiple modules register)
- The `metricsRegistered` guard helps but is package-level state that doesn't reset
- Not idempotent across tests or different Caddy instances in same process

**Fix:**
Use Caddy's metrics integration or custom registry:

```go
// Option 1: Use Caddy's admin metrics (recommended)
// https://caddyserver.com/docs/json/admin/config/metrics/

// Option 2: Custom registry per module instance
type SIPGuardian struct {
	metricsRegistry *prometheus.Registry
	// ... other fields ...
}

func (g *SIPGuardian) Provision(ctx caddy.Context) error {
	g.metricsRegistry = prometheus.NewRegistry()
	g.metricsRegistry.MustRegister(
		// your metrics
	)
}
```

**Reference:** See how caddy-prometheus module integrates: https://github.com/caddyserver/caddy/blob/master/modules/metrics/metrics.go

---

### 🟠 HIGH: mergeGuardianConfig Modifies Running Instance
**File:** `registry.go:68-202`

```go
func mergeGuardianConfig(ctx caddy.Context, g *SIPGuardian, config *SIPGuardian) {
	g.mu.Lock()
	defer g.mu.Unlock()

	// Modifies MaxFailures, BanTime, whitelists while guardian is actively processing traffic
	if config.MaxFailures > 0 && config.MaxFailures != g.MaxFailures {
		g.MaxFailures = config.MaxFailures  // Race with RecordFailure()
	}
}
```

**Problem:**
- Changes runtime behavior of actively-processing guardian
- `RecordFailure()` reads `g.MaxFailures` without lock: `if tracker.count >= g.MaxFailures` (sipguardian.go:408)
- Could miss bans or ban too aggressively during config merge

**Fix:**
Caddy modules should be immutable once provisioned. Config changes should create NEW instances:

```go
func GetOrCreateGuardianWithConfig(...) (*SIPGuardian, error) {
	registryMu.Lock()
	defer registryMu.Unlock()

	if g, exists := guardianRegistry[name]; exists {
		// Don't merge - existing instance is immutable
		// Log warning if configs differ
		if !configsMatch(g, config) {
			log.Warn("Cannot change guardian config after provision - config reload required")
		}
		return g, nil
	}

	// Create fresh instance
	g := createGuardian(config)
	guardianRegistry[name] = g
	return g, nil
}
```

---

## 2. Thread Safety & Concurrency

### 🟡 MEDIUM: RecordFailure Reads MaxFailures Without Lock
**File:** `sipguardian.go:408`

```go
func (g *SIPGuardian) RecordFailure(ip, reason string) bool {
	// ...
	g.mu.Lock()
	defer g.mu.Unlock()
	// ...

	// Check if we should ban
	if tracker.count >= g.MaxFailures {  // RACE: g.MaxFailures can change during mergeGuardianConfig
		g.banIP(ip, reason)
		return true
	}
}
```

**Problem:**
- `g.MaxFailures` read without lock protection
- `mergeGuardianConfig` can modify it concurrently (registry.go:98-100)

**Fix:**
Either:
1. Make config immutable (recommended - see above)
2. Or protect with RWMutex:

```go
type SIPGuardian struct {
	// Separate locks for config vs runtime state
	configMu sync.RWMutex
	stateMu  sync.RWMutex

	// Config fields (protected by configMu)
	maxFailures int
	// Runtime fields (protected by stateMu)
	bannedIPs map[string]*BanEntry
}

func (g *SIPGuardian) RecordFailure(...) {
	g.configMu.RLock()
	maxFails := g.maxFailures
	g.configMu.RUnlock()

	g.stateMu.Lock()
	// ... use maxFails ...
	g.stateMu.Unlock()
}
```

---

### 🟡 MEDIUM: Storage Writes in Unbounded Goroutines
**File:** `sipguardian.go:397-399, 464-468, 495-499`

```go
if g.storage != nil {
	go func() {
		g.storage.RecordFailure(ip, reason, nil)  // Fire and forget
	}()
}
```

**Problem:**
- No limit on concurrent goroutines during attack (could spawn 100k goroutines)
- No error handling if storage is closed
- If storage write is slow, goroutines pile up

**Fix:**
Use worker pool pattern:

```go
type SIPGuardian struct {
	storageQueue chan storageTask
	storageWg    sync.WaitGroup
}

func (g *SIPGuardian) Provision(ctx caddy.Context) error {
	if g.storage != nil {
		g.storageQueue = make(chan storageTask, 1000)

		// Fixed pool of storage workers
		for i := 0; i < 4; i++ {
			g.storageWg.Add(1)
			go g.storageWorker()
		}
	}
}

func (g *SIPGuardian) storageWorker() {
	defer g.storageWg.Done()
	for task := range g.storageQueue {
		task.execute(g.storage)
	}
}

func (g *SIPGuardian) Cleanup() error {
	close(g.storageQueue)  // Stop accepting tasks
	g.storageWg.Wait()     // Wait for workers to finish
	// ... rest of cleanup ...
}
```

---

## 3. Performance Issues

### 🟠 HIGH: Bubble Sort for Eviction (O(n²))
**File:** `sipguardian.go:635-641, 689-695`

```go
// Sort by time (oldest first)
for i := 0; i < len(entries)-1; i++ {
	for j := i + 1; j < len(entries); j++ {
		if entries[j].time.Before(entries[i].time) {
			entries[i], entries[j] = entries[j], entries[i]  // Bubble sort!
		}
	}
}
```

**Problem:**
- O(n²) time complexity
- For 100,000 entries (maxTrackedIPs), this is 10 billion comparisons
- Will lock the mutex for seconds during high load

**Fix:**
Use stdlib `sort.Slice()`:

```go
import "sort"

func (g *SIPGuardian) evictOldestTrackers(count int) {
	type ipTime struct {
		ip   string
		time time.Time
	}

	entries := make([]ipTime, 0, len(g.failureCounts))
	for ip, tracker := range g.failureCounts {
		entries = append(entries, ipTime{ip: ip, time: tracker.firstSeen})
	}

	// O(n log n) sort
	sort.Slice(entries, func(i, j int) bool {
		return entries[i].time.Before(entries[j].time)
	})

	// Evict oldest
	for i := 0; i < count && i < len(entries); i++ {
		delete(g.failureCounts, entries[i].ip)
	}
}
```

**Impact:** 100x-1000x faster for large maps

---

### 🟡 MEDIUM: String.ToLower() Allocates on Hot Path
**File:** `l4handler.go:351`

```go
func detectSuspiciousPattern(data []byte) string {
	lower := strings.ToLower(string(data))  // Allocates new string + ToLower allocates again

	for _, def := range suspiciousPatternDefs {
		if strings.Contains(lower, def.pattern) {
			return def.name
		}
	}
	return ""
}
```

**Problem:**
- Called on EVERY SIP message
- `string(data)` allocates (copies 4KB)
- `ToLower()` allocates another 4KB
- Under load (10k msg/sec), this is 80MB/sec of garbage

**Fix:**
Use `bytes` package or case-insensitive search:

```go
func detectSuspiciousPattern(data []byte) string {
	for _, def := range suspiciousPatternDefs {
		// Case-insensitive search without allocation
		if bytes.Contains(bytes.ToLower(data), []byte(def.pattern)) {
			return def.name
		}
	}
	return ""
}

// Or for better performance, pre-compile lowercase patterns:
var suspiciousPatternDefs = []struct {
	name    string
	pattern []byte  // lowercase pattern
}{
	{"sipvicious", []byte("sipvicious")},
	// ...
}

func detectSuspiciousPattern(data []byte) string {
	lower := bytes.ToLower(data)  // Single allocation
	for _, def := range suspiciousPatternDefs {
		if bytes.Contains(lower, def.pattern) {
			return def.name
		}
	}
	return ""
}
```

---

### 🔵 LOW: min() Function Defined Locally
**File:** `l4handler.go:367-372`

```go
func min(a, b int) int {
	if a < b {
		return a
	}
	return b
}
```

**Problem:**
Go 1.21+ has builtin `min()` function. Your `go.mod` says `go 1.25`.

**Fix:**
```go
// Delete the function, use builtin
sample := buf[:min(64, len(buf))]
```

---

## 4. Resource Management & Lifecycle

### 🟠 HIGH: Cleanup() Doesn't Respect Context
**File:** `sipguardian.go:891-932`

```go
func (g *SIPGuardian) Cleanup() error {
	// ...
	select {
	case <-done:
		g.logger.Debug("All goroutines stopped cleanly")
	case <-time.After(5 * time.Second):  // Hardcoded timeout
		g.logger.Warn("Timeout waiting for goroutines to stop")
	}
}
```

**Problem:**
- Caddy's shutdown might have different timeout
- Should respect `caddy.Context` shutdown signal

**Fix:**
```go
func (g *SIPGuardian) Cleanup() error {
	// Use the context from Provision or a configurable timeout
	ctx, cancel := context.WithTimeout(context.Background(), g.shutdownTimeout())
	defer cancel()

	// Signal stop
	close(g.stopCh)

	// Wait with context
	done := make(chan struct{})
	go func() {
		g.wg.Wait()
		close(done)
	}()

	select {
	case <-done:
		g.logger.Debug("Cleanup completed")
	case <-ctx.Done():
		g.logger.Warn("Cleanup timed out", zap.Duration("timeout", g.shutdownTimeout()))
		return ctx.Err()
	}

	// ... rest of cleanup ...
	return nil
}

func (g *SIPGuardian) shutdownTimeout() time.Duration {
	if g.CleanupTimeout > 0 {
		return time.Duration(g.CleanupTimeout)
	}
	return 10 * time.Second  // reasonable default
}
```

---

### 🟡 MEDIUM: DNS Whitelist Doesn't Stop on Cleanup
**File:** `sipguardian.go:900-902`

```go
func (g *SIPGuardian) Cleanup() error {
	// Stop DNS whitelist background refresh
	if g.dnsWhitelist != nil {
		g.dnsWhitelist.Stop()  // Is this method implemented? Does it block?
	}
}
```

**Question:** Does `dnsWhitelist.Stop()` wait for goroutines to finish? Should add to WaitGroup.

**Fix:**
```go
// In Provision
if g.dnsWhitelist != nil {
	g.wg.Add(1)
	go func() {
		defer g.wg.Done()
		g.dnsWhitelist.Run(g.stopCh)  // Respect stopCh
	}()
}

// Cleanup will automatically wait via g.wg.Wait()
```

---

## 5. Configuration & Validation

### 🟡 MEDIUM: Validate() Ranges Too Restrictive
**File:** `sipguardian.go:936-966`

```go
func (g *SIPGuardian) Validate() error {
	if g.MaxFailures > 1000 {
		return fmt.Errorf("max_failures exceeds reasonable limit (1000), got %d", g.MaxFailures)
	}

	if time.Duration(g.FindTime) > 24*time.Hour {
		return fmt.Errorf("find_time exceeds reasonable limit (24h), got %v", time.Duration(g.FindTime))
	}
}
```

**Question:** Why are these "reasonable limits" hardcoded?

- Some users might want `max_failures = 2` for high-security environments
- Some might want `find_time = 7d` for long-term tracking

**Fix:**
Either document WHY these limits exist, or make them configurable:

```go
const (
	maxFailuresLimit = 10000  // Prevent integer overflow in counter logic
	maxFindTimeLimit = 30 * 24 * time.Hour  // 30 days max for memory reasons
)

func (g *SIPGuardian) Validate() error {
	if g.MaxFailures < 1 || g.MaxFailures > maxFailuresLimit {
		return fmt.Errorf("max_failures must be 1-%d, got %d", maxFailuresLimit, g.MaxFailures)
	}
	// ... etc
}
```

---

### 🔵 LOW: UnmarshalCaddyfile Doesn't Validate Inline
**File:** `sipguardian.go:746-889`

**Observation:**
Parsing and validation are separate (Validate called after Provision). This is fine per Caddy conventions, but some basic validation could happen during unmarshaling to give better error messages.

Example:
```go
case "max_failures":
	// ... parse ...
	if val < 1 {
		return d.Errf("max_failures must be positive, got %d", val)
	}
	g.MaxFailures = val
```

---

## 6. Error Handling & Robustness

### 🟡 MEDIUM: Inconsistent Error Handling in Provision
**File:** `sipguardian.go:103-241`

```go
func (g *SIPGuardian) Provision(ctx caddy.Context) error {
	// Parse whitelist CIDRs
	for _, cidr := range g.WhitelistCIDR {
		_, network, err := net.ParseCIDR(cidr)
		if err != nil {
			return fmt.Errorf("invalid whitelist CIDR %s: %v", cidr, err)  // FATAL
		}
		g.whitelistNets = append(g.whitelistNets, network)
	}

	// Initialize storage
	if enableStorage && g.StoragePath != "" {
		storage, err := InitStorage(g.logger, StorageConfig{Path: g.StoragePath})
		if err != nil {
			g.logger.Warn("Failed to initialize storage, continuing without persistence",
				zap.Error(err),  // WARNING - continues
			)
		}
	}
}
```

**Problem:**
Why is invalid CIDR fatal but storage failure is just a warning?

**Fix:**
Be consistent. Either:
1. Make optional features non-fatal (recommended for flexibility)
2. Or make all failures fatal (strict validation)

Document in config:
```caddyfile
sip_guardian {
	whitelist 10.0.0/8  # Invalid - Provision will FAIL
	storage /bad/path   # Invalid - Provision will WARN and continue
}
```

---

### 🟡 MEDIUM: BanIP Creates Fake Failure Tracker
**File:** `sipguardian.go:968-991`

```go
func (g *SIPGuardian) BanIP(ip, reason string) {
	// ...
	if _, exists := g.failureCounts[ip]; !exists {
		g.failureCounts[ip] = &failureTracker{
			count:     g.MaxFailures,  // Fake failures
			firstSeen: time.Now(),
			lastSeen:  time.Now(),
		}
	}
	g.banIP(ip, reason)
}
```

**Problem:**
Why does manually banning an IP create failure tracking data? This pollutes metrics and storage.

**Fix:**
Separate manual bans from automatic bans:

```go
type BanEntry struct {
	IP        string
	Reason    string
	Source    string  // "automatic", "manual", "api"
	BannedAt  time.Time
	ExpiresAt time.Time
	HitCount  int  // 0 for manual bans
}

func (g *SIPGuardian) BanIP(ip, reason string) {
	// Don't create fake failure tracker
	entry := &BanEntry{
		IP:        ip,
		Source:    "manual",
		Reason:    reason,
		BannedAt:  time.Now(),
		ExpiresAt: time.Now().Add(time.Duration(g.BanTime)),
		HitCount:  0,  // No failures recorded
	}
	g.bannedIPs[ip] = entry
}
```

---

## 7. API Design

### 🟡 MEDIUM: Admin Handler URL Routing is Fragile
**File:** `admin.go:56-70`

```go
func (h *AdminHandler) ServeHTTP(w http.ResponseWriter, r *http.Request, next caddyhttp.Handler) error {
	path := r.URL.Path

	switch {
	case strings.HasSuffix(path, "/bans"):  // Matches "/foo/bans" too
		return h.handleBans(w, r)
	case strings.Contains(path, "/unban/"):  // Matches "/foo/unban/bar/baz"
		return h.handleUnban(w, r, path)
	}
}
```

**Problem:**
- `strings.HasSuffix` matches unintended paths
- Path extraction is manual and error-prone

**Fix:**
Use proper HTTP router or at least exact matching:

```go
// Option 1: Use http.ServeMux patterns (Go 1.22+)
mux := http.NewServeMux()
mux.HandleFunc("GET /api/sip-guardian/bans", h.handleBans)
mux.HandleFunc("POST /api/sip-guardian/ban/{ip}", h.handleBan)
mux.HandleFunc("DELETE /api/sip-guardian/unban/{ip}", h.handleUnban)

// Option 2: Exact prefix matching
switch {
case r.URL.Path == "/api/sip-guardian/bans":
	return h.handleBans(w, r)
case strings.HasPrefix(r.URL.Path, "/api/sip-guardian/ban/"):
	ip := strings.TrimPrefix(r.URL.Path, "/api/sip-guardian/ban/")
	ip = strings.TrimSuffix(ip, "/")
	return h.handleBan(w, r, ip)
}
```

---

### 🔵 LOW: No IP Validation in Admin Endpoints
**File:** `admin.go:196, 210`

```go
ip := strings.TrimSuffix(parts[1], "/")

// Use public BanIP method for proper encapsulation
h.guardian.BanIP(ip, body.Reason)  // What if ip = "not-an-ip"?
```

**Fix:**
```go
ip := strings.TrimSuffix(parts[1], "/")
if net.ParseIP(ip) == nil {
	http.Error(w, "Invalid IP address", http.StatusBadRequest)
	return nil
}
h.guardian.BanIP(ip, body.Reason)
```

---

## 8. Testing Observations

### 🟡 MEDIUM: Tests Use Global State
**Observation:** Since metrics and registry are global, tests might interfere with each other.

**Recommendation:**
Add test helpers to reset state:

```go
// In sipguardian_test.go
func resetGlobalState() {
	registryMu.Lock()
	guardianRegistry = make(map[string]*SIPGuardian)
	registryMu.Unlock()

	metricsRegistered = false
	// ... reset other globals ...
}

func TestSomething(t *testing.T) {
	t.Cleanup(resetGlobalState)
	// ... test code ...
}
```

---

## Summary of Recommendations

### Must Fix (🔴 Critical)
1. **Eliminate global registry** - use Caddy app system or context storage
2. **Remove global feature flags** - make them config options
3. **Fix Prometheus integration** - use custom registry or Caddy's metrics

### Should Fix (🟠 High)
4. **Make configs immutable** after provisioning
5. **Fix bubble sort** - use sort.Slice()
6. **Bound storage goroutines** - use worker pool

### Nice to Have (🟡 Medium)
7. Add context to Cleanup timeout
8. Validate IPs in admin API
9. Fix string allocation in hot path
10. Separate manual vs automatic bans

### Polish (🔵 Low)
11. Remove custom min() function
12. Improve URL routing in admin handler
13. Add inline validation in UnmarshalCaddyfile

---

## Positive Observations 👍

1. **Excellent Cleanup() implementation** - proper WaitGroup usage, timeout handling
2. **Good Validate() implementation** - validates config before use
3. **Interface guards** - ensures compile-time type checking
4. **Comprehensive feature set** - GeoIP, DNS whitelist, enumeration detection, etc.
5. **Good logging** - structured logging with zap
6. **Test coverage** - multiple test files covering features

---

## Final Verdict

This is **solid work** with good understanding of Caddy modules. The core functionality is sound, but the **global state issues are critical** and must be addressed before production use. Once those are fixed, this would be a great module for the Caddy community.

The main pattern to change is: **Think per-instance, not global**. Every piece of state should live on the struct, not in package-level vars.

---

**Estimated effort to fix critical issues:** 4-6 hours
**Estimated effort for all recommendations:** 8-12 hours

Would you like me to prioritize any specific issues or help implement fixes?