Hamilton review fixes: validator literal preservation, cache cluster id, CSS impact partial-failure reporting

Three findings from a margaret-hamilton-style review of the MCP server, fixed with regression tests written first (red → green). One bonus finding (huntpilotqueue column name) was surfaced by the third fix itself — exactly the audit-trust failure mode that fix exists to expose. CRITICAL #1 — sql_validator: comment-strip mutated string literals. The cleaned query returned by validate_select() is what travels to AXL. Previously, the comment-strip pass ran before the literal-aware pass, so `--` or `/* */` markers inside a string literal were silently eaten: input: WHERE description = 'Smith -- old line' to AXL: WHERE description = 'Smith (truncated mid-literal) The LLM saw rows that looked plausible but were not what its query asked for. "Confidently wrong" is exactly the failure mode the review was hunting. Fix: only strip comments on the analysis-only copy used for keyword detection. The cleaned output preserves the input verbatim (modulo trailing semicolon and outer whitespace). 6 new tests covering literal preservation across `--`, `/* */`, LIKE patterns with embedded comment markers, and forbidden keywords inside real comments. CRITICAL #2 — cache key omitted cluster identity. The on-disk cache key was `method::args_json`. An operator swapping AXL_URL between test and prod (or between two clusters) would silently serve stale data from cluster A as if from cluster B. The audit report would be confidently wrong with no signal anything happened. Fix: AxlCache now takes cluster_id and prefixes all keys with it. Server bootstrap derives cluster_id as a 12-char SHA-256 prefix of AXL_URL. cache_stats() surfaces both the current cluster_id and a `foreign_cluster_entries` count so an env-swap is visible. Schema migration handles pre-fix cache files via PRAGMA table_info introspection plus a one-shot ALTER TABLE ADD COLUMN. 5 new tests covering isolation, shared-id sharing, stats reporting, legacy DB upgrade, and per-cluster clear() scoping. MAJOR #3 — find_devices_using_css summary undercounted partial failures. The function is per-category resilient (one failed query doesn't kill the whole impact analysis), but the resilience never propagated up to the response. total_returned and any_truncated only reflected SUCCESSFUL categories. An LLM consuming "47 references" had no way to know 5 categories errored and the real number was likely much higher. Fix: response now includes complete: bool, categories_with_errors: int, and error_categories: [list]. The LLM/auditor sees the partial-failure state and can decide whether to act on incomplete data. 5 new tests using a FakeAxlClient stand-in to simulate per-category failures. BONUS finding (uncovered by Major #3 fix): huntpilotqueue join used the wrong column. Three CSS impact categories (huntpilot_max_wait_css, huntpilot_no_agent_css, huntpilot_queue_full_css) were silently erroring with "Column (fknumplan) not found" because huntpilotqueue joins via fknumplan_pilot, not fknumplan. With the Major #3 fix in place, this surfaced immediately as `complete: False, error_categories: [3 huntpilot_*]` against the live cluster. Fixed inline; live re-run now reports `complete: True, total_returned: 163` for Internal-CSS. 87 unit tests passing (up from 70). Live cluster smoke test (cucm-pub.binghammemorial.org, CUCM 15.0.1.12900-234) verifies all three fixes plus the bonus finding work end-to-end.
2026-04-25 23:09:55 -06:00 · 2026-04-25 23:09:55 -06:00 · dee5fdacda
commit dee5fdacda
parent 82d8fbe563
7 changed files with 407 additions and 38 deletions
--- a/src/mcp_cucm_axl/cache.py
+++ b/src/mcp_cucm_axl/cache.py
@ -1,9 +1,14 @@
 """SQLite-backed TTL cache for AXL responses.

-Keyed on (method_name, sorted_kwargs_json). Cache survives server restarts,
-which makes exploratory audit sessions dramatically faster — the LLM can
-re-run the same `listPhone` queries across conversations without paying
+Keyed on (cluster_id, method_name, sorted_kwargs_json). Cache survives server
+restarts, which makes exploratory audit sessions dramatically faster — the LLM
+can re-run the same `listPhone` queries across conversations without paying
 the SOAP round-trip every time.
+
+Hamilton review CRITICAL #2: cache key now includes a `cluster_id` so that
+the same on-disk database can hold entries from multiple clusters without
+silently serving cluster A's data when bound to cluster B. Operators who
+swap `AXL_URL` between test and prod no longer see cross-cluster contamination.
 """

 from __future__ import annotations
@ -15,30 +20,70 @@ from pathlib import Path
 from typing import Any


-SCHEMA = """
+# Split into TABLE_DDL (idempotent table creation) and INDEX_DDL (run AFTER
+# any column-adding migration, so indexes that reference newer columns don't
+# fail against legacy databases).
+TABLE_DDL = """
 CREATE TABLE IF NOT EXISTS axl_cache (
    cache_key   TEXT PRIMARY KEY,
+    cluster_id  TEXT NOT NULL DEFAULT '',
    method      TEXT NOT NULL,
    args_json   TEXT NOT NULL,
    result_json TEXT NOT NULL,
    created_at  REAL NOT NULL,
    expires_at  REAL NOT NULL
 );
+"""

+INDEX_DDL = """
 CREATE INDEX IF NOT EXISTS axl_cache_method_idx ON axl_cache(method);
 CREATE INDEX IF NOT EXISTS axl_cache_expires_idx ON axl_cache(expires_at);
+CREATE INDEX IF NOT EXISTS axl_cache_cluster_idx ON axl_cache(cluster_id);
 """


 class AxlCache:
    """SQLite TTL cache. Thread-safe via per-call connections."""

-    def __init__(self, db_path: Path, default_ttl: int):
+    def __init__(
+        self,
+        db_path: Path,
+        default_ttl: int,
+        cluster_id: str | None = None,
+    ):
        self.db_path = db_path
        self.default_ttl = default_ttl
+        # Empty string when unset — matches the column DEFAULT and keeps
+        # SQL filtering simple. Pre-fix databases will have '' for legacy
+        # entries, which is fine: a server now passing cluster_id="prod"
+        # won't see them, which is the correct cautious behavior.
+        self.cluster_id = cluster_id or ""
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        with self._conn() as c:
-            c.executescript(SCHEMA)
+            # 1) Make sure table exists (no-op if already present)
+            c.executescript(TABLE_DDL)
+            # 2) Bring legacy schemas forward (adds cluster_id if missing)
+            self._migrate(c)
+            # 3) NOW create indexes — safe because all columns exist
+            c.executescript(INDEX_DDL)
+
+    @staticmethod
+    def _migrate(c: sqlite3.Connection) -> None:
+        """Bring pre-existing databases up to the current schema.
+
+        `CREATE TABLE IF NOT EXISTS` is idempotent for table existence but
+        does not add columns to an already-existing table. Pre-fix caches
+        lack `cluster_id`; rather than failing the next INSERT with
+        `no such column`, we add it here. Defaults to '' which makes the
+        legacy entries belong to the "unknown cluster" — invisible to any
+        new client passing an actual cluster_id, which is the cautious
+        outcome.
+        """
+        cols = {row[1] for row in c.execute("PRAGMA table_info(axl_cache)").fetchall()}
+        if "cluster_id" not in cols:
+            c.execute(
+                "ALTER TABLE axl_cache ADD COLUMN cluster_id TEXT NOT NULL DEFAULT ''"
+            )

    def _conn(self) -> sqlite3.Connection:
        conn = sqlite3.connect(self.db_path, isolation_level=None)
@ -46,10 +91,13 @@ class AxlCache:
        conn.execute("PRAGMA synchronous=NORMAL")
        return conn

-    @staticmethod
-    def _make_key(method: str, kwargs: dict) -> str:
-        # sort_keys gives us a deterministic key regardless of dict order
-        return f"{method}::{json.dumps(kwargs, sort_keys=True, default=str)}"
+    def _make_key(self, method: str, kwargs: dict) -> str:
+        # cluster_id prefix isolates entries by cluster identity. sort_keys
+        # gives us a deterministic key regardless of dict order.
+        return (
+            f"{self.cluster_id}::{method}::"
+            f"{json.dumps(kwargs, sort_keys=True, default=str)}"
+        )

    def get(self, method: str, kwargs: dict) -> Any | None:
        if self.default_ttl <= 0:
@ -75,11 +123,13 @@ class AxlCache:
            c.execute(
                """
                INSERT OR REPLACE INTO axl_cache
-                  (cache_key, method, args_json, result_json, created_at, expires_at)
-                VALUES (?, ?, ?, ?, ?, ?)
+                  (cache_key, cluster_id, method, args_json, result_json,
+                   created_at, expires_at)
+                VALUES (?, ?, ?, ?, ?, ?, ?)
                """,
                (
                    key,
+                    self.cluster_id,
                    method,
                    json.dumps(kwargs, sort_keys=True, default=str),
                    json.dumps(result, default=str),
@ -91,39 +141,66 @@ class AxlCache:
    def stats(self) -> dict:
        now = time.time()
        with self._conn() as c:
-            total = c.execute("SELECT COUNT(*) FROM axl_cache").fetchone()[0]
+            # Entries scoped to THIS cluster_id. The on-disk file may also
+            # contain entries from other clusters; those are intentionally
+            # invisible here.
+            total = c.execute(
+                "SELECT COUNT(*) FROM axl_cache WHERE cluster_id = ?",
+                (self.cluster_id,),
+            ).fetchone()[0]
            live = c.execute(
-                "SELECT COUNT(*) FROM axl_cache WHERE expires_at > ?", (now,)
+                "SELECT COUNT(*) FROM axl_cache "
+                "WHERE cluster_id = ? AND expires_at > ?",
+                (self.cluster_id, now),
            ).fetchone()[0]
            by_method = {
                row[0]: row[1]
                for row in c.execute(
                    "SELECT method, COUNT(*) FROM axl_cache "
-                    "WHERE expires_at > ? GROUP BY method ORDER BY 2 DESC",
-                    (now,),
+                    "WHERE cluster_id = ? AND expires_at > ? "
+                    "GROUP BY method ORDER BY 2 DESC",
+                    (self.cluster_id, now),
                ).fetchall()
            }
+            # Diagnostic: how many entries from OTHER clusters live in the
+            # same file. Useful for spotting an env-var swap that would
+            # otherwise be invisible.
+            foreign = c.execute(
+                "SELECT COUNT(*) FROM axl_cache WHERE cluster_id != ?",
+                (self.cluster_id,),
+            ).fetchone()[0]
        return {
            "db_path": str(self.db_path),
+            "cluster_id": self.cluster_id,
            "default_ttl_seconds": self.default_ttl,
            "total_entries": total,
            "live_entries": live,
            "expired_entries": total - live,
+            "foreign_cluster_entries": foreign,
            "by_method": by_method,
        }

    def clear(self, method_pattern: str | None = None) -> int:
+        # Only clears entries for THIS cluster — never touches a sibling
+        # cluster's cached data even if it lives in the same file.
        with self._conn() as c:
            if method_pattern:
                cursor = c.execute(
-                    "DELETE FROM axl_cache WHERE method LIKE ?",
-                    (method_pattern.replace("*", "%"),),
+                    "DELETE FROM axl_cache "
+                    "WHERE cluster_id = ? AND method LIKE ?",
+                    (self.cluster_id, method_pattern.replace("*", "%")),
                )
            else:
-                cursor = c.execute("DELETE FROM axl_cache")
+                cursor = c.execute(
+                    "DELETE FROM axl_cache WHERE cluster_id = ?",
+                    (self.cluster_id,),
+                )
            return cursor.rowcount

    def purge_expired(self) -> int:
+        # Purges expired entries across ALL clusters in this file.
+        # Expired entries are never useful regardless of which cluster
+        # they belong to, so per-cluster scoping isn't needed here.
        with self._conn() as c:
            cursor = c.execute("DELETE FROM axl_cache WHERE expires_at <= ?", (time.time(),))
            return cursor.rowcount
--- a/src/mcp_cucm_axl/route_plan.py
+++ b/src/mcp_cucm_axl/route_plan.py
@ -575,7 +575,7 @@ _CSS_REFERENCE_QUERIES: dict[str, dict] = {
        "sql": """
            SELECT np.dnorpattern AS name, rp.name AS context, np.description AS description
            FROM huntpilotqueue hpq
-            JOIN numplan np ON hpq.fknumplan = np.pkid
+            JOIN numplan np ON hpq.fknumplan_pilot = np.pkid
            LEFT OUTER JOIN routepartition rp ON np.fkroutepartition = rp.pkid
            WHERE hpq.fkcallingsearchspace_maxwaittime = '{pkid}'
        """,
@ -585,7 +585,7 @@ _CSS_REFERENCE_QUERIES: dict[str, dict] = {
        "sql": """
            SELECT np.dnorpattern AS name, rp.name AS context, np.description AS description
            FROM huntpilotqueue hpq
-            JOIN numplan np ON hpq.fknumplan = np.pkid
+            JOIN numplan np ON hpq.fknumplan_pilot = np.pkid
            LEFT OUTER JOIN routepartition rp ON np.fkroutepartition = rp.pkid
            WHERE hpq.fkcallingsearchspace_noagent = '{pkid}'
        """,
@ -595,7 +595,7 @@ _CSS_REFERENCE_QUERIES: dict[str, dict] = {
        "sql": """
            SELECT np.dnorpattern AS name, rp.name AS context, np.description AS description
            FROM huntpilotqueue hpq
-            JOIN numplan np ON hpq.fknumplan = np.pkid
+            JOIN numplan np ON hpq.fknumplan_pilot = np.pkid
            LEFT OUTER JOIN routepartition rp ON np.fkroutepartition = rp.pkid
            WHERE hpq.fkcallingsearchspace_pilotqueuefull = '{pkid}'
        """,
@ -678,11 +678,22 @@ def find_devices_using_css(

    total_returned = sum(c.get("returned_count", 0) for c in grouped.values())
    any_truncated = any(c.get("truncated") for c in grouped.values())
+    # Hamilton review MAJOR #3: per-category errors must propagate to the
+    # top-level summary, otherwise an LLM consuming `total_returned: 47`
+    # has no way to know that 5 categories errored and the real count is
+    # higher. "Software that understands itself reports its own degradation."
+    error_categories = sorted(
+        label for label, cat in grouped.items() if "error" in cat
+    )
+    complete = len(error_categories) == 0
    return {
        "css_name": css_name,
        "css_pkid": css_pkid,
        "total_returned": total_returned,
        "any_truncated": any_truncated,
+        "complete": complete,
+        "categories_with_errors": len(error_categories),
+        "error_categories": error_categories,
        "max_per_category": max_per_category,
        "references_by_category": grouped,
    }
--- a/src/mcp_cucm_axl/server.py
+++ b/src/mcp_cucm_axl/server.py
@ -558,9 +558,20 @@ def main() -> None:
    )
    cache_dir.mkdir(parents=True, exist_ok=True)
    ttl = int(os.environ.get("AXL_CACHE_TTL", "3600"))
-    _cache = AxlCache(cache_dir / "axl_responses.sqlite", default_ttl=ttl)
+    # Cluster-id derived from AXL_URL. Hash keeps the key compact and
+    # avoids leaking the URL into log output where the cache key gets
+    # printed. Hostname-only fallback when AXL_URL is unset (test mode).
+    import hashlib
+    axl_url_for_id = os.environ.get("AXL_URL", "no-axl-url-configured")
+    cluster_id = hashlib.sha256(axl_url_for_id.encode()).hexdigest()[:12]
+    _cache = AxlCache(
+        cache_dir / "axl_responses.sqlite",
+        default_ttl=ttl,
+        cluster_id=cluster_id,
+    )
    print(
-        f"[mcp-cucm-axl] cache: {_cache.db_path} (ttl={ttl}s)",
+        f"[mcp-cucm-axl] cache: {_cache.db_path} "
+        f"(ttl={ttl}s, cluster_id={cluster_id})",
        file=sys.stderr,
        flush=True,
    )
--- a/src/mcp_cucm_axl/sql_validator.py
+++ b/src/mcp_cucm_axl/sql_validator.py
@ -33,28 +33,36 @@ def validate_select(query: str) -> str:

    Accepts SELECT and WITH (CTEs that ultimately return SELECT). Rejects
    anything else, and any query containing forbidden keywords as standalone
-    tokens *outside* string literals.
+    tokens *outside* string literals and comments.

-    The cleaned query (with comments stripped) is what gets returned and sent
-    to AXL — string literals are NOT modified, only ignored during keyword
-    tokenization. So a query selecting WHERE name = 'Call Forward-CSS' is
-    safe: the literal "Call" inside quotes is invisible to the keyword check,
-    while the actual SQL with the unmodified literal travels intact to AXL.
+    Hamilton review CRITICAL #1: the output we return MUST preserve the input
+    byte-for-byte (modulo trailing semicolon and outer whitespace). Earlier
+    versions ran a non-literal-aware comment strip on the output, which would
+    silently eat `--` and `/* */` markers that legitimately appeared inside
+    string literals like `WHERE description = 'Smith -- old line'`. The query
+    going to AXL must be exactly what the caller intended — comment stripping
+    is an analysis-only operation, never a mutation of the wire query.
    """
    if not query or not query.strip():
        raise SqlValidationError("Query is empty.")

-    cleaned = _COMMENT_BLOCK.sub(" ", query)
-    cleaned = _COMMENT_LINE.sub(" ", cleaned).strip().rstrip(";").strip()
+    # The query we'll send to AXL: original input, with only outer whitespace
+    # and a single trailing semicolon trimmed. NO mutation of literals or
+    # in-string comment markers.
+    cleaned = query.strip().rstrip(";").strip()
    if not cleaned:
-        raise SqlValidationError("Query is empty after stripping comments.")
+        raise SqlValidationError("Query is empty after trimming.")

-    # Strip string literals before tokenizing so that words inside quoted
-    # values (e.g. CSS names containing "Call", DN descriptions containing
-    # "DELETE") don't trip the forbidden-keyword check. The cleaned query
-    # we return still contains the literals — only the analysis copy strips
-    # them.
+    # Analysis-only copy: strip string literals AND comments (in either order
+    # is safe here, since each strip uses its own regex on a non-AXL-bound
+    # buffer). Order chosen: literals first, then comments, so that any
+    # comment markers genuinely outside literals can be detected.
    for_analysis = _STRING_LITERAL.sub(" ", cleaned)
+    for_analysis = _COMMENT_BLOCK.sub(" ", for_analysis)
+    for_analysis = _COMMENT_LINE.sub(" ", for_analysis)
+
+    if not for_analysis.strip():
+        raise SqlValidationError("Query is empty after stripping comments.")

    upper_tokens = [t.upper() for t in _WORD_RE.findall(for_analysis)]
    if not upper_tokens:
--- a/tests/test_cache.py
+++ b/tests/test_cache.py
@ -85,3 +85,95 @@ def test_purge_expired(tmp_path: Path):
    purged = c.purge_expired()
    assert purged == 1
    assert c.stats()["live_entries"] == 1
+
+
+class TestClusterIsolation:
+    """Hamilton review CRITICAL #2: cache key omitted cluster identity.
+
+    Prior to the fix, `AXL_URL` swap (test → prod, or one cluster to another)
+    served stale results from cluster A as if from cluster B. The cache
+    couldn't tell the data came from a different mission. Now each cache
+    handle is bound to a cluster_id, and entries from a different cluster
+    must miss.
+    """
+
+    def test_different_cluster_ids_isolate_get(self, tmp_path: Path):
+        # Both caches point at the same DB file, but bound to different
+        # cluster IDs. A's writes must not be visible to B.
+        db = tmp_path / "shared.sqlite"
+        a = AxlCache(db, default_ttl=60, cluster_id="cluster-A")
+        b = AxlCache(db, default_ttl=60, cluster_id="cluster-B")
+
+        a.set("getCCMVersion", {}, {"version": "12.5"})
+        assert a.get("getCCMVersion", {}) == {"version": "12.5"}
+        assert b.get("getCCMVersion", {}) is None, (
+            "cluster-B must not see cluster-A's cached value"
+        )
+
+    def test_same_cluster_id_shares_cache(self, tmp_path: Path):
+        # Two handles with the SAME cluster_id should share results.
+        db = tmp_path / "shared.sqlite"
+        a = AxlCache(db, default_ttl=60, cluster_id="cluster-X")
+        a.set("listPhone", {"name": "SEP1"}, {"rows": ["one"]})
+        b = AxlCache(db, default_ttl=60, cluster_id="cluster-X")
+        assert b.get("listPhone", {"name": "SEP1"}) == {"rows": ["one"]}
+
+    def test_cluster_id_in_stats(self, tmp_path: Path):
+        c = AxlCache(tmp_path / "s.sqlite", default_ttl=60, cluster_id="cluster-Y")
+        c.set("getCCMVersion", {}, {"v": "15"})
+        stats = c.stats()
+        assert stats.get("cluster_id") == "cluster-Y", (
+            "stats must surface cluster_id so operators can verify which cluster they're caching"
+        )
+
+    def test_no_cluster_id_still_works_legacy(self, tmp_path: Path):
+        # Backward compat: no cluster_id keeps the old (but now risky) shape.
+        # The cache still functions; we just don't get isolation.
+        c = AxlCache(tmp_path / "legacy.sqlite", default_ttl=60)
+        c.set("x", {}, "y")
+        assert c.get("x", {}) == "y"
+
+    def test_clear_only_affects_current_cluster(self, tmp_path: Path):
+        db = tmp_path / "shared.sqlite"
+        a = AxlCache(db, default_ttl=60, cluster_id="cluster-A")
+        b = AxlCache(db, default_ttl=60, cluster_id="cluster-B")
+        a.set("x", {}, "from-A")
+        b.set("x", {}, "from-B")
+        deleted = a.clear()
+        assert deleted == 1, "clear() must only affect this cluster's entries"
+        assert b.get("x", {}) == "from-B", "cluster-B's entry must survive A's clear"
+
+    def test_migrate_legacy_database(self, tmp_path: Path):
+        """A cache database created before the cluster_id fix must
+        upgrade transparently — no `no such column` error on next INSERT.
+        """
+        import sqlite3
+        db = tmp_path / "legacy.sqlite"
+        # Manually create the OLD schema (no cluster_id column)
+        conn = sqlite3.connect(db)
+        conn.executescript(
+            """
+            CREATE TABLE axl_cache (
+                cache_key   TEXT PRIMARY KEY,
+                method      TEXT NOT NULL,
+                args_json   TEXT NOT NULL,
+                result_json TEXT NOT NULL,
+                created_at  REAL NOT NULL,
+                expires_at  REAL NOT NULL
+            );
+            INSERT INTO axl_cache VALUES
+              ('legacy-key', 'oldMethod', '{}', '"old-value"', 0, 9999999999);
+            """
+        )
+        conn.commit()
+        conn.close()
+
+        # Open with the new code — must not raise, must add the column
+        c = AxlCache(db, default_ttl=60, cluster_id="new-cluster")
+        # The new client should NOT see the legacy entry (it has no cluster_id)
+        # — this is the cautious behavior; legacy entries are isolated to the
+        # "unknown cluster" bucket.
+        assert c.get("oldMethod", {}) is None
+        # And it must be able to write/read its own entries
+        c.set("newMethod", {"a": 1}, "new-value")
+        assert c.get("newMethod", {"a": 1}) == "new-value"
--- a/tests/test_css_impact.py
+++ b/tests/test_css_impact.py
@ -0,0 +1,119 @@
+"""Hamilton review MAJOR #3: find_devices_using_css must surface partial failures.
+
+The function is per-category resilient by design — if one schema query fails,
+the others still produce results. But the top-level summary previously hid
+that some categories errored out: `total_returned` and `any_truncated` only
+reflected the SUCCESSFUL categories. An LLM consuming "47 references, low
+impact" wouldn't know that 5 categories errored and the real number is
+likely much higher.
+
+After the fix: the response includes `complete: bool`, `categories_with_errors`,
+and `error_categories`, so an LLM (or human auditor) can see the partial-failure
+state and act on it.
+"""
+
+import pytest
+
+from mcp_cucm_axl.route_plan import find_devices_using_css
+
+
+class FakeAxlClient:
+    """Minimal stand-in for AxlClient that lets us simulate per-query failures.
+
+    Returns a fake CSS pkid for the lookup query, then either a single fake row
+    or an exception based on substring matching.
+    """
+
+    def __init__(self, error_on_columns: list[str] | None = None):
+        self.error_on_columns = error_on_columns or []
+        self.queries: list[str] = []
+
+    def execute_sql_query(self, sql: str) -> dict:
+        self.queries.append(sql)
+        # The CSS lookup query — return a fake pkid
+        if "callingsearchspace WHERE name" in sql:
+            return {"row_count": 1, "rows": [{"pkid": "fake-css-pkid"}]}
+        # Any query referencing an "error trigger" column → simulate failure
+        for trigger in self.error_on_columns:
+            if trigger in sql:
+                raise RuntimeError(f"simulated cluster failure on {trigger}")
+        # Otherwise return one fake reference row so the category isn't empty
+        return {
+            "row_count": 1,
+            "rows": [{"name": "FakeRef", "context": "FakePart", "description": "fake"}],
+        }
+
+
+def test_no_errors_reports_complete():
+    """Baseline: when every category succeeds, complete=True and no error fields populated."""
+    client = FakeAxlClient()
+    result = find_devices_using_css(client, "Some-CSS")
+    assert result["complete"] is True
+    assert result["categories_with_errors"] == 0
+    assert result["error_categories"] == []
+    # And total_returned reflects the successful categories
+    assert result["total_returned"] >= 1
+
+
+def test_one_errored_category_marks_incomplete():
+    """The audit-trust failure mode: one category errors out and the summary lies.
+    Fix: complete=False, categories_with_errors >= 1.
+    """
+    client = FakeAxlClient(error_on_columns=["fkcallingsearchspace_cgpnunknown"])
+    result = find_devices_using_css(client, "Some-CSS")
+    assert result["complete"] is False, (
+        "complete must be False when any category errored"
+    )
+    assert result["categories_with_errors"] >= 1
+    assert "device_cgpn_unknown_css" in result["error_categories"]
+
+
+def test_multiple_errors_all_listed():
+    """All errored categories must be enumerated in error_categories."""
+    client = FakeAxlClient(
+        error_on_columns=[
+            "fkcallingsearchspace_cgpnunknown",
+            "fkcallingsearchspace_reroute",
+            "fkcallingsearchspace_pilotqueuefull",
+        ]
+    )
+    result = find_devices_using_css(client, "Some-CSS")
+    assert result["complete"] is False
+    assert result["categories_with_errors"] == 3
+    assert set(result["error_categories"]) == {
+        "device_cgpn_unknown_css",
+        "device_reroute_css",
+        "huntpilot_queue_full_css",
+    }
+
+
+def test_total_returned_does_not_include_error_categories():
+    """An errored category contributes 0 to total_returned (correct behavior).
+    What's NEW: the response also flags that the count is partial.
+    """
+    client = FakeAxlClient(error_on_columns=["fkcallingsearchspace_cgpnunknown"])
+    result = find_devices_using_css(client, "Some-CSS")
+    # The count itself is unchanged from before — what's new is the warning
+    assert result["complete"] is False
+    # The error category has no rows in references_by_category
+    err_cat = result["references_by_category"].get("device_cgpn_unknown_css", {})
+    assert "error" in err_cat
+
+
+def test_css_not_found_returns_error_not_partial():
+    """If the CSS lookup itself fails (CSS doesn't exist), we return the
+    'not found' error early, NOT a partial-failure response. Distinct
+    failure modes deserve distinct shapes.
+    """
+
+    class CssNotFoundClient:
+        def execute_sql_query(self, sql):
+            if "callingsearchspace WHERE name" in sql:
+                return {"row_count": 0, "rows": []}
+            return {"row_count": 1, "rows": [{}]}
+
+    result = find_devices_using_css(CssNotFoundClient(), "Nonexistent-CSS")
+    assert "error" in result
+    assert "complete" not in result, (
+        "CSS-not-found is a hard error; we shouldn't dress it up as partial"
+    )
--- a/tests/test_sql_validator.py
+++ b/tests/test_sql_validator.py
@ -124,3 +124,54 @@ class TestStringLiterals:
    def test_multiple_literals(self):
        q = "SELECT 1 FROM numplan WHERE name = 'CALL' AND description = 'UPDATE pending'"
        assert validate_select(q)
+
+
+class TestLiteralPreservedInOutput:
+    """Hamilton review CRITICAL #1: comment-strip mutated string literals.
+
+    The query SENT to AXL must preserve the literal contents byte-for-byte.
+    Previously, the comment-strip pass ran before the literal-aware pass,
+    so `--` or `/* */` inside a quoted string were silently eaten on the
+    way to the cluster. An LLM dialing `description LIKE '%-- old%'` got
+    a different query than it asked for.
+    """
+
+    def test_dash_dash_inside_literal_preserved(self):
+        q = "SELECT * FROM numplan WHERE description = 'Smith -- old line'"
+        result = validate_select(q)
+        assert "Smith -- old line" in result, (
+            f"line-comment marker inside literal must NOT be stripped; got: {result!r}"
+        )
+
+    def test_block_comment_marker_inside_literal_preserved(self):
+        q = "SELECT * FROM device WHERE name = 'before /* still in literal */ after'"
+        result = validate_select(q)
+        assert "/* still in literal */" in result
+        assert "before" in result and "after" in result
+
+    def test_like_pattern_with_dash_dash_preserved(self):
+        # Real-world case: an LLM searches for descriptions containing "--"
+        q = "SELECT pkid FROM numplan WHERE description LIKE '%-- old%'"
+        result = validate_select(q)
+        assert "'%-- old%'" in result
+
+    def test_actual_line_comment_outside_literal_still_handled(self):
+        # An actual --comment outside any literal is fine (AXL handles it),
+        # and the keyword check ignores it.
+        q = "SELECT 1 FROM device  -- a real comment at the end"
+        result = validate_select(q)
+        # We don't strip from output, so the comment stays in the returned text.
+        # The important thing is the validator passes and a forbidden keyword
+        # in the comment wouldn't trip the check (covered separately).
+        assert "SELECT 1 FROM device" in result
+
+    def test_forbidden_keyword_inside_real_comment_does_not_trip(self):
+        # Real comment, with a forbidden keyword in it, should not trip the validator
+        q = "SELECT 1 FROM device  -- TODO: someone DELETE the old test data"
+        result = validate_select(q)
+        assert "SELECT 1" in result
+
+    def test_block_literal_with_drop_inside_preserved(self):
+        q = "SELECT 1 FROM numplan WHERE description = 'log: DROP detected'"
+        result = validate_select(q)
+        assert "'log: DROP detected'" in result