2 Commits

Author SHA1 Message Date
39d4b29392 Add RisPort70 for real-time registration state + rate-limit backoff
Two ideas borrowed from cisco-cucm-mcp (calltelemetry/cisco-cucm-mcp,
MIT licensed): real-time device registration via RisPort70, and
exponential-backoff retry on transient HTTP 5xx errors. Both are
purpose-built for the audit use case rather than general-purpose
ports — RisPort tools exist to inform audit findings, not as a
standalone "look at my devices" interface.

Rate limit / 503 backoff (~30 lines + 3 tests):
  AxlClient now mounts an HTTPAdapter with a urllib3 Retry policy
  (3 retries, exponential backoff, status_forcelist=[502,503,504]).
  Configurable via AXL_RATE_LIMIT_RETRIES (default 3, 0 disables).
  Surfaces in connection_status() so operators can see the policy.
  Closes a real reliability gap: CUCM SOAP rate-limits under load
  during change windows or with multiple concurrent admins; pre-fix
  any 503 was a hard failure.

RisPort70 (new src/risport.py + 2 tools + prompt update):
  Hand-coded SOAP client for /realtimeservice2/services/RISService70
  (avoids dragging in another zeep instance for one operation).
  Reuses AXL_URL/USER/PASS env vars — RisPort lives on the same host.

  New tools:
    device_registration_status(device_class, status, name_filter, page_size)
    device_registration_summary()  — cluster-wide breakdown by class

  Live-cluster verification (cucm-pub.binghammemorial.org):
    Phone:    803  registered=679  unregistered=123  rejected=1
    Gateway:   85  registered=41   rejected=44   ← real audit finding
    SIPTrunk:  22  registered=18   unregistered=4
    HuntList:  28  registered=28
    H323/CTI:  0   (cluster doesn't use these)

  Discovered while live-verifying: CUCM 15 wraps the RisPort response
  in an extra <SelectCmDeviceResult> element inside <selectCmDeviceReturn>.
  Older CUCM versions exposed the fields directly. The parser falls
  back to either shape; tests cover both (test_legacy_response_shape_still_parses
  asserts the older shape still works).

phone_inventory_report prompt updated:
  New Step 3 — "Cross-reference with real-time registration" — recommends
  device_registration_summary() + device_registration_status(status="UnRegistered")
  to surface configured-but-never-registered phones (strongest orphan signal),
  PartiallyRegistered phones (firewall/cert/version mismatch indicator),
  and registration-state vs config-state mismatches.

Tooling delta worth noting:
  AXL device count:    1,377 phones
  RisPort device count:   803 phones
  Delta (~574)         likely templates, hidden phones, or stale config —
                       itself an audit finding the new tool will surface
                       to anyone running phone_inventory_report.

README updated:
  - Added health(), device_registration_status, device_registration_summary
  - Added "Scope and complement" section recommending @calltelemetry/cisco-cucm-mcp
    alongside for operational debugging (logs, perfmon, packet capture,
    service control). The two servers answer different questions; the LLM
    with both can compose audit findings with operational state.
  - Listed all 10 prompts (was 4 outdated entries).

Tests: 134 → 155 (+21).
2026-04-26 10:28:04 -06:00
90227ab391 Hamilton review fixes part 2: bounded regex, connection recovery, _to_int diagnostic, consistent error shapes
Closes the four remaining findings from the margaret-hamilton review.
13 new regression tests; all 100 pass; live cluster smoke verified.

MAJOR #4 — wildcard regex catastrophic backtracking + silent malformed.

Two changes to _wildcard_to_regex():

a) Bounded the `!` and `@` wildcards to \d{1,50} (was \d+). Adjacent
   `!` patterns previously compiled to (\d+)(\d+)... which has
   exponential backtracking on near-miss inputs. CUCM dial strings
   are practically capped well below 50 digits; the bound keeps
   complexity polynomial without losing real-world coverage.
   Verified: 10 adjacent `!` against a 30-digit near-miss now finishes
   in ~240ms (was unbounded; could have been minutes on real
   pathological cases).

b) Unclosed `[` now raises ValueError instead of silently treating the
   bracket as a literal. _pattern_matches_number catches the error
   and returns False so a single bad pattern doesn't crash
   translation_chain — but the bad pattern is no longer invisibly
   producing wrong matches. The previous silent fallback meant a
   pattern like `[0-9` (typo, missing `]`) would match input
   containing the literal characters `[` `0` `-` `9`.

3 new tests covering: bounded-regex shape (`\d{1,N}`), pathological
input completes quickly, unclosed bracket raises explicitly,
well-formed character class still works.

MAJOR #5 — distinguish config errors from operational errors.

Pre-fix: any first-time connection failure set `_connection_error`
and pinned it forever. A transient network blip or session timeout
required restarting the MCP server. Hamilton's framing: Apollo's
software was *designed* to recover from transient faults; pinning
forever is the antithesis of "design the error path first."

Fix: split into two state fields:
  _config_error  — permanent until restart (missing env vars only)
  _last_error    — last operational failure, NOT a pin

Operational failures (zeep Client construction, network, TLS, session)
clear from the next call's perspective: the next call attempts fresh.
Configuration errors (missing AXL_URL etc.) stay pinned because
they don't get better on retry.

Added _ConfigError as a private subclass to make the distinction
explicit at the raise site, and connection_status() to expose
connected/connected_at/config_error/last_error for diagnostic
transparency.

3 new tests: config errors pin, operational errors don't pin,
connection_status() reports state.

MINOR #6 — _to_int silent coercion of bad data.

Pre-fix: a non-numeric value from the cluster (data corruption,
schema drift across CUCM versions) silently became None, which
downstream sort logic defaulted to 0 — jumbling the failover order
in the displayed result with no warning.

Fix: still returns None on bad data (caller error path unchanged),
but logs the offending value to stderr so an operator notices
something's wrong at the data layer. None itself is silent
(legitimately-unset column).

2 new tests: real None is silent, bad string logs to stderr with
the offending value visible.

MINOR #7 — standardize tool failure shapes; add health() tool.

Pre-fix: cache_stats and cache_clear returned `{"error": "..."}`
when _cache was None, while AXL-touching tools raised RuntimeError.
LLM consumers had to handle two shapes.

Fix: _require_cache() helper raises RuntimeError consistently with
_client(). All tool failures now use the same exception shape.
Added health() tool that reports cache/axl/docs initialization
status plus the AXL connection_status — gives operators a
self-diagnostic when something fails at bootstrap.

3 new tests: cache_stats raises, cache_clear raises, health()
reports each subsystem.
2026-04-25 23:19:32 -06:00