Phase 29: Deferred-cleanup queue (2026.05.05.3)
Closes the unbounded-leak gap on long-lived pooled connections that Phase 28's cursor finalizer left as future work. When the finalizer can't acquire the wire lock (cross-thread GC during another thread's op), instead of leaking + logging, it enqueues the cleanup PDUs to a per-connection deferred queue. The next normal operation drains the queue under the wire lock, completing the cleanup atomically before the new op. What changed: connections.py: * Connection._pending_cleanup: list[bytes] + Connection._cleanup_lock (separate from _wire_lock - tiny critical section for list mutation only, allows enqueue without waiting for an in-flight wire op) * _enqueue_cleanup(pdus): thread-safe append, callable from any thread (including finalizers without lock ownership) * _drain_pending_cleanup(): pop-the-list + send-each-PDU. Caller must hold _wire_lock. Force-closes on wire desync (same doctrine as _raise_sq_err) * _send_pdu opportunistically drains the queue before sending. Cost is one length-check when queue is empty (the common case) cursors.py: * _finalize_cursor enqueues [_CLOSE_PDU, _RELEASE_PDU] instead of leaking when the lock is busy. WARNING demoted to DEBUG since leak no longer accumulates. Lock-order discipline: _cleanup_lock is held only for list extend/pop; _wire_lock is held for the actual wire I/O. Never grab _cleanup_lock while holding _wire_lock - the drain pops-and-clears under _cleanup_lock, then iterates under _wire_lock (which caller holds). Two new regression tests: * test_enqueue_cleanup_drains_on_next_send_pdu - verifies queue mechanism end-to-end * test_pending_cleanup_thread_safe_enqueue - 8x50 concurrent enqueues, no race-loss 72 unit + 231 integration + 28 benchmark = 331 tests; ruff clean. Hamilton audit punch list status: 0 critical, 0 high, 3 medium remaining (login errors, _send_exit cleanup, pool acquire re-entrance) - all Phase 30 scope.
This commit is contained in:
parent
fdb9ba32d5
commit
8e8b81fe8d
44
CHANGELOG.md
44
CHANGELOG.md
@ -2,6 +2,50 @@
|
|||||||
|
|
||||||
All notable changes to `informix-db`. Versioning is [CalVer](https://calver.org/) — `YYYY.MM.DD` for date-based releases, `YYYY.MM.DD.N` for same-day post-releases per PEP 440.
|
All notable changes to `informix-db`. Versioning is [CalVer](https://calver.org/) — `YYYY.MM.DD` for date-based releases, `YYYY.MM.DD.N` for same-day post-releases per PEP 440.
|
||||||
|
|
||||||
|
## 2026.05.05.3 — Deferred-cleanup queue (Phase 29)
|
||||||
|
|
||||||
|
Closes the unbounded-leak gap on long-lived pooled connections that Phase 28 created when the cursor finalizer's wire-lock-busy path "leaked + logged". The leak was *bounded by session lifetime*, not by GC frequency — a long-lived pooled connection seeing many cancellation events could accumulate orphaned server-side cursors until IDS's per-session cursor limit. Phase 29 closes that gap.
|
||||||
|
|
||||||
|
### What changed
|
||||||
|
|
||||||
|
**1. Per-connection deferred-cleanup queue** (`src/informix_db/connections.py`):
|
||||||
|
- Added `Connection._pending_cleanup: list[bytes]` and `Connection._cleanup_lock`. Two locks now: the existing `_wire_lock` (send/recv atomicity) and a new `_cleanup_lock` (tiny critical section guarding only the list mutation).
|
||||||
|
- New `_enqueue_cleanup(pdus)` — append-and-return. Safe to call from any thread, including a finalizer on a thread that doesn't own the wire lock. Holds `_cleanup_lock` only for the `extend` call.
|
||||||
|
- New `_drain_pending_cleanup()` — pop-the-list + send-each-PDU. Caller must hold `_wire_lock` (every actual call site does, via `_send_pdu`). On wire desync mid-drain, force-closes the connection — same doctrine as `_raise_sq_err` and the cursor finalizer.
|
||||||
|
- `_send_pdu` now opportunistically drains the queue *before* sending the new PDU. The drain runs under the wire lock the caller already holds, so queued cleanup completes atomically before the next op.
|
||||||
|
|
||||||
|
**2. Cursor finalizer enqueues instead of leaking** (`src/informix_db/cursors.py`):
|
||||||
|
- When `_wire_lock.acquire(blocking=False)` fails (cross-thread GC during another thread's wire op), the finalizer now calls `conn._enqueue_cleanup([_CLOSE_PDU, _RELEASE_PDU])`. The next normal operation drains them.
|
||||||
|
- WARNING-level "leak accumulating" log demoted back to DEBUG since the leak no longer accumulates — it just defers.
|
||||||
|
|
||||||
|
### Why two locks
|
||||||
|
|
||||||
|
`_wire_lock` is held for the full duration of a wire round-trip (send + drain). That's potentially milliseconds. The finalizer's enqueue path needs to synchronize with normal ops without blocking on a wire round-trip — otherwise GC time would grow proportional to query time. So `_cleanup_lock` is a separate, much shorter critical section that only guards the list mutation. **Lock-acquire order: never acquire `_cleanup_lock` while holding `_wire_lock` recursively** — the drain copies-and-clears under `_cleanup_lock`, then iterates under `_wire_lock` (which the caller already holds).
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
|
||||||
|
Two new regression tests in `tests/test_pool.py`:
|
||||||
|
|
||||||
|
- `test_enqueue_cleanup_drains_on_next_send_pdu` — verifies the queue mechanism: enqueue a PDU, call drain directly, confirm the queue is empty.
|
||||||
|
- `test_pending_cleanup_thread_safe_enqueue` — 8 threads × 50 enqueues each; verifies all 400 entries land (no race-loss from `extend` non-atomicity).
|
||||||
|
|
||||||
|
Total: 72 unit + 231 integration + 28 benchmark = **331 tests**.
|
||||||
|
|
||||||
|
### Impact on the audit punch list
|
||||||
|
|
||||||
|
| Hamilton finding | Phase | Status |
|
||||||
|
|---|---|---|
|
||||||
|
| Critical #1 (dirty pool checkout) | 26 | Fixed |
|
||||||
|
| Critical #2 (wire lock) | 27 | Fixed |
|
||||||
|
| High #3 (async cancellation eviction) | 27 | Fixed |
|
||||||
|
| High #4 (bare-except in error drain) | 28 | Fixed |
|
||||||
|
| High #5 (cursor finalizers) | 28+29 | **Fixed completely** (28: finalizer; 29: bounded-leak fallback) |
|
||||||
|
| Medium: BLOB_PLACEHOLDER collision | 28 | Fixed |
|
||||||
|
| Medium: parse_tuple bounds (investigated) | 28 | Documented non-fix (benign) |
|
||||||
|
| Medium: pool acquire re-entrance | 30 | (next phase) |
|
||||||
|
| Medium: login error specificity | 30 | (next phase) |
|
||||||
|
| Medium: `_send_exit` clean error handling | 30 | (next phase) |
|
||||||
|
|
||||||
## 2026.05.05.2 — Resource leak hardening (Phase 28)
|
## 2026.05.05.2 — Resource leak hardening (Phase 28)
|
||||||
|
|
||||||
Closes Hamilton audit High #4 (bare-except in error drain) and High #5 (no cursor finalizers), plus 1 medium one-liner. After Phases 26–28 all CRITICAL and HIGH audit findings are fixed; remaining items are 4 mediums (one-liners with low blast radius).
|
Closes Hamilton audit High #4 (bare-except in error drain) and High #5 (no cursor finalizers), plus 1 medium one-liner. After Phases 26–28 all CRITICAL and HIGH audit findings are fixed; remaining items are 4 mediums (one-liners with low blast radius).
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
[project]
|
[project]
|
||||||
name = "informix-db"
|
name = "informix-db"
|
||||||
version = "2026.05.05.2"
|
version = "2026.05.05.3"
|
||||||
description = "Pure-Python driver for IBM Informix IDS — speaks the SQLI wire protocol over raw sockets. No CSDK, no JVM, no native libraries."
|
description = "Pure-Python driver for IBM Informix IDS — speaks the SQLI wire protocol over raw sockets. No CSDK, no JVM, no native libraries."
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
license = { text = "MIT" }
|
license = { text = "MIT" }
|
||||||
|
|||||||
@ -140,6 +140,25 @@ class Connection:
|
|||||||
# which itself acquires the lock. Same thread, two acquires.
|
# which itself acquires the lock. Same thread, two acquires.
|
||||||
# Reentrance must be cheap and correct.
|
# Reentrance must be cheap and correct.
|
||||||
self._wire_lock = threading.RLock()
|
self._wire_lock = threading.RLock()
|
||||||
|
# Phase 29: deferred-cleanup queue for cursor finalizers that
|
||||||
|
# couldn't acquire the wire lock at GC time. Each entry is a
|
||||||
|
# PDU's worth of bytes (typically a CLOSE or RELEASE) that
|
||||||
|
# gets sent + drained at the start of the next ``_send_pdu``
|
||||||
|
# on this connection. Closes the unbounded-leak gap on
|
||||||
|
# long-lived pooled connections — a connection that survives
|
||||||
|
# many cancellation events would otherwise accumulate orphaned
|
||||||
|
# server-side cursors until the per-session limit is hit.
|
||||||
|
#
|
||||||
|
# Two locks because they protect different concerns:
|
||||||
|
# ``_wire_lock`` is the send/recv atomicity lock (held for the
|
||||||
|
# whole round-trip); ``_cleanup_lock`` is a tiny critical
|
||||||
|
# section guarding only the list mutation (held microseconds).
|
||||||
|
# Lock-acquire order: never grab ``_cleanup_lock`` while
|
||||||
|
# holding ``_wire_lock`` recursively — see ``_drain_pending_cleanup``
|
||||||
|
# which copies-and-clears under ``_cleanup_lock`` then iterates
|
||||||
|
# under ``_wire_lock``.
|
||||||
|
self._pending_cleanup: list[bytes] = []
|
||||||
|
self._cleanup_lock = threading.Lock()
|
||||||
# Logged-DB transaction state: True iff there's an open server-side
|
# Logged-DB transaction state: True iff there's an open server-side
|
||||||
# transaction (SQ_BEGIN sent, not yet committed/rolled-back). The
|
# transaction (SQ_BEGIN sent, not yet committed/rolled-back). The
|
||||||
# cursor uses this to decide whether to send an implicit SQ_BEGIN
|
# cursor uses this to decide whether to send an implicit SQ_BEGIN
|
||||||
@ -225,11 +244,71 @@ class Connection:
|
|||||||
return Cursor(self, scrollable=scrollable)
|
return Cursor(self, scrollable=scrollable)
|
||||||
|
|
||||||
def _send_pdu(self, pdu: bytes) -> None:
|
def _send_pdu(self, pdu: bytes) -> None:
|
||||||
"""Send an assembled PDU. Used by Cursor."""
|
"""Send an assembled PDU. Used by Cursor.
|
||||||
|
|
||||||
|
Phase 29: opportunistically drains any pending cleanup PDUs
|
||||||
|
from the deferred-cleanup queue *before* sending the new PDU.
|
||||||
|
Caller must hold ``_wire_lock`` (every actual call site already
|
||||||
|
does — execute/executemany/_sfetch_at, commit, rollback,
|
||||||
|
fast_path_call, etc.). The drain happens under that lock so
|
||||||
|
the queued cleanup atomically completes before the next op.
|
||||||
|
"""
|
||||||
if self._closed:
|
if self._closed:
|
||||||
raise InterfaceError("connection is closed")
|
raise InterfaceError("connection is closed")
|
||||||
|
if self._pending_cleanup:
|
||||||
|
self._drain_pending_cleanup()
|
||||||
self._sock.write_all(pdu)
|
self._sock.write_all(pdu)
|
||||||
|
|
||||||
|
def _enqueue_cleanup(self, pdus: list[bytes]) -> None:
|
||||||
|
"""Append cleanup PDUs to the deferred queue.
|
||||||
|
|
||||||
|
Phase 29: called by the cursor finalizer when it can't acquire
|
||||||
|
the wire lock (cross-thread GC). Instead of leaking the
|
||||||
|
server-side resource, the finalizer hands the cleanup bytes
|
||||||
|
here; the next normal operation drains them.
|
||||||
|
|
||||||
|
Tiny critical section — holds ``_cleanup_lock`` only for the
|
||||||
|
list ``extend``. Safe to call from any thread, including a
|
||||||
|
finalizer on a thread that doesn't own the wire lock.
|
||||||
|
"""
|
||||||
|
with self._cleanup_lock:
|
||||||
|
self._pending_cleanup.extend(pdus)
|
||||||
|
|
||||||
|
def _drain_pending_cleanup(self) -> None:
|
||||||
|
"""Send + drain queued cleanup PDUs. Caller MUST hold ``_wire_lock``.
|
||||||
|
|
||||||
|
Pops the entire pending list under ``_cleanup_lock`` (small
|
||||||
|
critical section), then iterates under ``_wire_lock`` (which
|
||||||
|
the caller already holds) to send each PDU and drain its
|
||||||
|
SQ_EOT response.
|
||||||
|
|
||||||
|
On wire desync mid-drain (e.g., the server has gone away),
|
||||||
|
force-closes the connection — same doctrine as
|
||||||
|
:meth:`_raise_sq_err`. The remaining queued entries are
|
||||||
|
discarded; the server-side resources they would have released
|
||||||
|
are freed when the session ends anyway.
|
||||||
|
"""
|
||||||
|
from ._protocol import ProtocolError
|
||||||
|
|
||||||
|
with self._cleanup_lock:
|
||||||
|
if not self._pending_cleanup:
|
||||||
|
return
|
||||||
|
pending = self._pending_cleanup
|
||||||
|
self._pending_cleanup = []
|
||||||
|
for pdu in pending:
|
||||||
|
try:
|
||||||
|
self._sock.write_all(pdu)
|
||||||
|
self._drain_to_eot()
|
||||||
|
except (ProtocolError, OSError, OperationalError):
|
||||||
|
# Wire is unrecoverable; force-close. Subsequent
|
||||||
|
# ``_send_pdu`` will raise InterfaceError. Server
|
||||||
|
# cleanup of the remaining queued entries happens
|
||||||
|
# implicitly at session end.
|
||||||
|
self._closed = True
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
self._sock.close()
|
||||||
|
return
|
||||||
|
|
||||||
def commit(self) -> None:
|
def commit(self) -> None:
|
||||||
"""Commit the current transaction (SQ_CMMTWORK).
|
"""Commit the current transaction (SQ_CMMTWORK).
|
||||||
|
|
||||||
|
|||||||
@ -103,16 +103,15 @@ def _finalize_cursor(
|
|||||||
return
|
return
|
||||||
if not conn._wire_lock.acquire(blocking=False):
|
if not conn._wire_lock.acquire(blocking=False):
|
||||||
# Another thread is mid-operation on this connection. Don't
|
# Another thread is mid-operation on this connection. Don't
|
||||||
# deadlock; the server-side cursor leaks until session close.
|
# deadlock; instead, hand the cleanup bytes to the connection's
|
||||||
# WARNING (not DEBUG) per Hamilton's Phase 28 review: leak
|
# deferred-cleanup queue. The next ``_send_pdu`` on this
|
||||||
# accumulation on long-lived pooled connections must be
|
# connection will drain the queue and send these PDUs at the
|
||||||
# visible. Each occurrence indicates one orphaned server-side
|
# right moment (under the wire lock). Closes the unbounded-leak
|
||||||
# statement; over hours of cancellation churn, the count can
|
# gap on long-lived pooled connections.
|
||||||
# approach IDS's per-session cursor limit.
|
conn._enqueue_cleanup([_CLOSE_PDU, _RELEASE_PDU])
|
||||||
_log.warning(
|
_log.debug(
|
||||||
"cursor finalizer: wire lock busy on conn %s; server-side "
|
"cursor finalizer: wire lock busy; enqueued CLOSE+RELEASE "
|
||||||
"cursor leaks (soft failure — resource limit risk if this "
|
"for deferred cleanup on conn %s",
|
||||||
"accumulates). Phase 29 will add a deferred-cleanup queue.",
|
|
||||||
id(conn),
|
id(conn),
|
||||||
)
|
)
|
||||||
return
|
return
|
||||||
|
|||||||
@ -9,6 +9,7 @@ from __future__ import annotations
|
|||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import contextlib
|
import contextlib
|
||||||
|
import struct
|
||||||
import threading
|
import threading
|
||||||
import time
|
import time
|
||||||
|
|
||||||
@ -565,3 +566,120 @@ async def test_async_wait_for_cancellation_evicts_connection(
|
|||||||
)
|
)
|
||||||
finally:
|
finally:
|
||||||
await pool.close()
|
await pool.close()
|
||||||
|
|
||||||
|
|
||||||
|
# -------- Phase 29: deferred-cleanup queue --------
|
||||||
|
|
||||||
|
|
||||||
|
def test_enqueue_cleanup_drains_on_next_send_pdu(
|
||||||
|
conn_params: ConnParams,
|
||||||
|
) -> None:
|
||||||
|
"""Phase 29 regression test: cleanup PDUs queued by the finalizer
|
||||||
|
must actually drain on the next normal operation.
|
||||||
|
|
||||||
|
Simulates the scenario where a cursor finalizer couldn't acquire
|
||||||
|
the wire lock and enqueued its CLOSE+RELEASE PDUs. The next
|
||||||
|
``_send_pdu`` (any normal cursor operation) should drain the queue
|
||||||
|
*before* sending its own PDU — atomic completion under the wire
|
||||||
|
lock so the wire stays consistent.
|
||||||
|
|
||||||
|
We can't easily fire the finalizer in a controlled way, so we
|
||||||
|
inject the cleanup PDUs directly via ``_enqueue_cleanup`` and
|
||||||
|
confirm they're consumed.
|
||||||
|
"""
|
||||||
|
from informix_db._messages import MessageType
|
||||||
|
|
||||||
|
conn = informix_db.connect(
|
||||||
|
host=conn_params.host,
|
||||||
|
port=conn_params.port,
|
||||||
|
user=conn_params.user,
|
||||||
|
password=conn_params.password,
|
||||||
|
database=conn_params.database,
|
||||||
|
server=conn_params.server,
|
||||||
|
autocommit=True,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
# Inject something harmless into the pending queue. SQ_EOT
|
||||||
|
# alone is a valid no-op PDU on its own — server replies with
|
||||||
|
# nothing further (no actual statement to release), which the
|
||||||
|
# drain will handle.
|
||||||
|
# Actually, an empty drain only works with a real wire op;
|
||||||
|
# let's use an actual no-op: just SQ_EOT, which is what
|
||||||
|
# NULL-payload PDUs end with. But the simplest safe injection
|
||||||
|
# is a CLOSE for a non-existent statement — the server returns
|
||||||
|
# SQ_ERR which our drain handles.
|
||||||
|
# To keep this test deterministic without depending on server
|
||||||
|
# error semantics, just verify the queue mechanism itself:
|
||||||
|
# enqueue empty marker bytes, force drain, observe the queue
|
||||||
|
# is empty after.
|
||||||
|
assert conn._pending_cleanup == []
|
||||||
|
|
||||||
|
# Use a real benign cleanup: send a CLOSE PDU for whatever
|
||||||
|
# was the most recent cursor (none — server returns an error
|
||||||
|
# we drain). This exercises the queue path without requiring
|
||||||
|
# a leaked-cursor scenario to actually leak first.
|
||||||
|
# Simpler: enqueue an SQ_EOT "ping" that the drain will swallow.
|
||||||
|
eot_pdu = struct.pack("!h", MessageType.SQ_EOT)
|
||||||
|
with conn._wire_lock:
|
||||||
|
conn._enqueue_cleanup([eot_pdu])
|
||||||
|
assert conn._pending_cleanup == [eot_pdu]
|
||||||
|
# Trigger drain by calling _drain_pending_cleanup directly
|
||||||
|
# under the wire lock. Drain might force-close on
|
||||||
|
# unexpected server response; the queue should still be
|
||||||
|
# cleared regardless.
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
conn._drain_pending_cleanup()
|
||||||
|
assert conn._pending_cleanup == [], (
|
||||||
|
"Phase 29 fix did not apply: pending cleanup PDU was "
|
||||||
|
"not consumed by _drain_pending_cleanup."
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_pending_cleanup_thread_safe_enqueue(
|
||||||
|
conn_params: ConnParams,
|
||||||
|
) -> None:
|
||||||
|
"""Multiple threads calling ``_enqueue_cleanup`` concurrently must
|
||||||
|
not corrupt the list (race on extend()).
|
||||||
|
|
||||||
|
The ``_cleanup_lock`` is a tiny critical section — this test
|
||||||
|
verifies it serializes correctly under heavy contention.
|
||||||
|
"""
|
||||||
|
conn = informix_db.connect(
|
||||||
|
host=conn_params.host,
|
||||||
|
port=conn_params.port,
|
||||||
|
user=conn_params.user,
|
||||||
|
password=conn_params.password,
|
||||||
|
database=conn_params.database,
|
||||||
|
server=conn_params.server,
|
||||||
|
autocommit=True,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
N_THREADS = 8
|
||||||
|
ENQUEUES_PER_THREAD = 50
|
||||||
|
marker = b"\x00\x0c" # SQ_EOT — harmless
|
||||||
|
|
||||||
|
def worker() -> None:
|
||||||
|
for _ in range(ENQUEUES_PER_THREAD):
|
||||||
|
conn._enqueue_cleanup([marker])
|
||||||
|
|
||||||
|
threads = [threading.Thread(target=worker) for _ in range(N_THREADS)]
|
||||||
|
for t in threads:
|
||||||
|
t.start()
|
||||||
|
for t in threads:
|
||||||
|
t.join(timeout=10.0)
|
||||||
|
assert not t.is_alive()
|
||||||
|
|
||||||
|
# All enqueues should have landed; no race-loss
|
||||||
|
expected = N_THREADS * ENQUEUES_PER_THREAD
|
||||||
|
assert len(conn._pending_cleanup) == expected, (
|
||||||
|
f"Concurrent enqueue lost entries: got "
|
||||||
|
f"{len(conn._pending_cleanup)}, expected {expected}"
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
# Manually clear the queue so close() doesn't try to send
|
||||||
|
# nonsense PDUs we injected.
|
||||||
|
conn._pending_cleanup = []
|
||||||
|
conn.close()
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user