informix-db

Author	SHA1	Message	Date
Ryan Malloy	5825d5c55e	Extend scaling benches: 100-column case + 100k memory profile + 1M gating Adds three things to test_scaling_perf.py: 1. 100-column wide-row SELECT - codec stress test at extreme widths. 1k rows x 100 cols = 19.4 ms (~194 us/row, ~1.94 us/column-decode). Per-column cost continues to drop with width thanks to loop amortization (5 cols: 480 ns/col -> 100 cols: 194 ns/col). 2. 100k-row memory profile - samples RSS pre-execute, post-execute (materialization cost), and during iteration. Real numbers: pre-execute: 45.8 MB post-execute: 71.2 MB (+25.4 MB = ~259 bytes/row materialization) iteration: 0 KB extra (just walks the existing list) Documents the in-memory cursor's actual cost: 100k rows = 25 MB, 1M rows = ~250 MB. Fair regression baseline (tripped at 500 MB). 3. 1M-row scaling gated behind IFX_BENCH_1M=1 env var. Default off because the dev container's rootdbs runs out of space. For production-sized servers users can opt in. The implementation is linear-extrapolation-correct (executemany 100k -> 1M = ~15s, SELECT 100k -> 1M = ~3s). Note on the dev-container size limit: dev image's rootdbs is sized for typical developer workloads, not stress testing. A 1M-row INSERT exceeds the available pages and fails with -242 ISAM -113 (out of space). This is correct behavior - the limit is enforced at the storage layer. Switched RSS sampling from ru_maxrss (peak, monotonic) to /proc/self/status VmRSS (current). Earlier runs showed flat RSS because peak from earlier in the test session masked the fluctuation.	2026-05-05 13:10:32 -06:00
Ryan Malloy	270155d2de	Phase 36: IfxPy scaling comparison + honest comparison numbers (2026.05.05.9) Extends the IfxPy comparison bench script with scaling workloads (1k/10k/100k rows for both executemany and SELECT). Re-runs the full comparison with consistent measurement methodology and updates the README with the actually-correct numbers. Earlier comparison runs reported informix-db winning all 5 benchmarks. Re-running select_bench_table_all with consistent measurement gives 3.04 ms, not the 891 us I cited earlier - a 3.4x discrepancy attributable to noisy warmup + small-fixture artifacts. The "we win everything" framing was wrong. Corrected comparison reveals two clear stories: Bulk-insert: pure-Python wins 1.6x at scale. executemany(10k): IfxPy 259ms -> us 161ms (1.6x faster) executemany(100k): IfxPy 2376ms -> us 1487ms (1.6x faster) Reason: Phase 33's pipelining eliminates per-row RTT. IfxPy's per-call API can't pipeline. Large-fetch: IfxPy wins 2.3-2.4x at scale. SELECT 1k rows: IfxPy 1.2ms / us 2.7ms (IfxPy 2.3x) SELECT 10k rows: IfxPy 11.3ms / us 25.8ms (IfxPy 2.3x) SELECT 100k rows: IfxPy 112ms / us 271ms (IfxPy 2.4x) Reason: C-level fetch_tuple at ~1.1us/row beats Python parse_tuple_payload at ~2.7us/row. Real C-vs-Python codec gap showing up at scale. For everyday workloads (single SELECT in a request, INSERT a handful of rows), drivers are within 5-25%. For workloads where the gap widens, direction depends on what you're doing - bulk- write favors us, bulk-read favors IfxPy. README's "Compared to IfxPy" section rewritten with the corrected numbers and an honest "when to prefer which" subsection. tests/benchmarks/compare/README.md mirror updated. Net narrative: a "faster at bulk-write, slower at bulk-read, comparable elsewhere" comparison story is more honest and more durable than a "we win everything" claim that would have collapsed the first time a user ran their own benchmark. Side note (lint): one ambiguous unicode `×` in cursors.py replaced with `x`. Phase 37 ticket: parse_tuple_payload is the bottleneck at scale. Closing the 1.6 us/row gap to IfxPy would make us competitive on bulk-fetch too. Possible approaches: Cython codec, deeper inlining, per-column dispatch pre-bake.	2026-05-05 12:44:52 -06:00
Ryan Malloy	8eb19f7534	Phase 34: Scaling benchmarks (1k/10k/100k rows; 5/20/50 cols) (2026.05.05.8) Adds tests/benchmarks/test_scaling_perf.py with parametrized benchmarks across row-count, column-width, and type-mix axes. Caught the NFETCH-loop bug (Phase 35) immediately on first run. Headline numbers: Bulk insert (executemany in transaction): 1k rows: 23 ms (23 us/row) 10k rows: 161 ms (16 us/row) 100k rows: 1487 ms (15 us/row, ~67k rows/sec sustained) SELECT (linear scaling, near-constant per-row): 1k rows: 2.7 ms (2.7 us/row) 10k rows: 25.8 ms (2.6 us/row) 100k rows: 271 ms (2.7 us/row) Wide-row SELECT (1k rows x N cols): 5 cols: 2.4 ms 20 cols: 5.1 ms 50 cols: 10.1 ms Type-mix SELECT (INT + VARCHAR + DECIMAL + DATE + FLOAT + SMALLINT): 1000 rows: 4.7 ms (4.7 us/row, ~1.7x baseline) Per-row codec cost is essentially constant from 1k to 100k rows (2.7 us/row), proving parse_tuple_payload optimizations (Phases 23-25) hold at 100x scale with no GC-pause amplification or memory-pressure degradation. Per-row insert cost actually DECREASES with scale (23us at 1k to 15us at 100k) - Phase 33's pipelining amortizes prepare/release overhead better at larger N. 10 new parametrized benchmarks. Total: 77 unit + 249 integration + 43 benchmark = 369 tests.	2026-05-05 12:38:07 -06:00
Ryan Malloy	362ecb3d63	Phase 33: Pipelined executemany - 2.85x faster bulk insert (2026.05.05.6) The serial-loop executemany paid one wire round-trip per row (~30us/ row on loopback). It was the one benchmark where IfxPy beat us in the comparison work - 10% slower at executemany(1000) in txn. Phase 33 pipelines the BIND+EXECUTE PDUs: build all N PDUs, send them back-to-back, then drain all N responses. Eliminates per-row RTT entirely. Performance impact: * executemany(1000) in txn: 31.3 ms -> 11.0 ms (2.85x faster) * executemany(100) autocommit: 173 ms -> 154 ms (11% faster) * executemany(1000) autocommit: 1740 ms -> 1590 ms (9% faster) (Autocommit gets smaller wins because server-side log flushes dominate - Phase 21.1's "autocommit cliff".) IfxPy comparison flipped: us 10% slower -> us 2.05x faster on bulk inserts. We now win all 5 head-to-head benchmarks against the C-bound driver. Margaret Hamilton review surfaced one CRITICAL concern (C1) - the pipeline assumes Informix sends N responses for N pipelined PDUs even when one fails. If the server cut the stream short, the drain loop would deadlock on the next read. Verified by 3 new integration tests in tests/test_executemany_pipeline.py: * test_pipelined_executemany_mid_batch_constraint_violation (row 500/1000) * test_pipelined_executemany_first_row_fails (row 0/100) * test_pipelined_executemany_last_row_fails (row 99/100) All confirm Informix sends N responses; wire stays aligned; connection is usable after. Plus 4 lower-priority fixes Hamilton recommended: * H1: documented _raise_sq_err self-drains-SQ_EOT invariant + tripwire * H2: docstring warning about O(N) lock duration; chunk for huge batches * M1: prepend row-index to exception message rather than reformat * M2: documented sendall-no-timeout caveat on hostile networks 77 unit + 239 integration + 33 benchmark = 349 tests; ruff clean. Note: Phase 32 (Tier 1+2 benchmarks) was tagged without bumping pyproject.toml's version string. .5 was git-tag-only; .6 is the next published version increment.	2026-05-05 12:26:15 -06:00
Ryan Malloy	01757415a5	Phase 32: Benchmark improvements (Tier 1 + Tier 2) Tier 1 — make existing benchmarks reliable: * Bumped slow-bench rounds: cold_connect_disconnect 5->15, executemany series 3->10. Single-round outliers no longer dominate. * Switched bench reporting to median + IQR. Mean was being moved by individual GC pauses / scheduler hiccups (IfxPy executemany IQR was 8.2 ms on a 28 ms median - 29% spread - mean was unreliable). * Updated ifxpy_bench.py to also report median + IQR alongside mean for cross-comparable numbers. * Makefile bench targets now show median, iqr, mean, stddev, ops, rounds. The robust statistics flipped the comparison story: Old (mean, 3 rounds): us 9% faster / IfxPy 30% faster on 2 of 5 New (median, 10+ rds): us faster on 4 of 5 benchmarks \| Benchmark \| IfxPy \| informix-db \| Δ \| \|---\|---\|---\|---\| \| select_one_row \| 170us \| 119us \| us 30% faster \| \| select_systables_first_10 \| 186us \| 142us \| us 24% faster \| \| select_bench_table_all 1k \| 980us \| 832us \| us 15% faster \| \| executemany 1k in txn \| 28.3ms \| 31.3ms \| us 10% slower \| \| cold_connect_disconnect \| 12.0ms \| 10.7ms \| us 11% faster \| Tier 2 — add benchmarks for claims we make but don't verify: tests/benchmarks/test_observability_perf.py: * test_streaming_fetch_memory_profile — RSS sampling during a cursor iteration. Documents memory growth shape; regression wall at 100 MB / 1k rows. Currently flat (in-memory cursor doesn't grow detectably for 278 rows). * test_select_1_latency_percentiles — 1000-query distribution with p50/p90/p95/p99/max. Result: p99/p50 = 1.42x (tight tail). p50=108us, p99=153us. * test_concurrent_pool_throughput[2,4,8] — N worker threads through pool, measures aggregate QPS + per-thread fairness. Plateaus at ~6K QPS (server-bound); per-thread latency scales ~linearly with N (server serialization expected). README.md (project root): updated Compared-to-IfxPy table with the median-based numbers + IQR awareness note. tests/benchmarks/compare/README.md: added "Statistical robustness" section explaining why median over mean for fair comparison. 236 integration tests pass; ruff clean.	2026-05-05 12:01:11 -06:00
Ryan Malloy	a9e1f17bae	Phase 31: Head-to-head benchmark vs IfxPy (the C-bound PyPI driver) Adds a paired benchmark of informix-db (pure Python) against IfxPy 3.0.5 (IBM's C-bound driver via OneDB ODBC) on identical workloads against the same Informix dev container. Headline result: pure Python is competitive — and faster on 2/5 benchmarks where wire round-trip dominates over codec/marshaling. \| Benchmark \| IfxPy \| informix-db \| Result \| \|---\|---:\|---:\|---:\| \| select_one_row (single-row latency) \| 128 us \| 116 us \| us 9% faster \| \| select_systables_first_10 \| 126 us \| 184 us \| IfxPy 32% faster \| \| select_bench_table_all (1k rows) \| 969 us \| 855 us \| us 12% faster \| \| executemany(1000) in txn \| 21.5 ms \| 30.8 ms \| IfxPy 30% slower \| \| cold_connect_disconnect \| 11.0 ms \| 10.9 ms \| comparable \| Why the surprising wins: IfxPy's path is Python -> OneDB ODBC -> libifdmr -> wire. Ours is Python -> wire. When wire round-trip dominates (single-row, bulk fetch), the missing abstraction layer makes us faster. When per-row marshaling dominates (executemany), IfxPy's C-level execute(stmt, tuple) beats Python BIND-PDU build. Files added under tests/benchmarks/compare/: * Dockerfile.ifxpy — Ubuntu 20.04 base with IfxPy + OneDB drivers * ifxpy_bench.py — IfxPy benchmark workloads matching test__perf.py README.md — methodology, results, install gauntlet, reproduction The IfxPy install gauntlet itself is part of the comparison story: modern Python 3.11 (not 3.13), setuptools <58, permissive CFLAGS, manual download of 92MB OneDB ODBC tarball, four LD_LIBRARY_PATH directories, libcrypt.so.1 (deprecated 2018, missing on Arch / Fedora 35+ / RHEL 9). Versus our `pip install informix-db`. README.md (project root): added "Compared to IfxPy" section under Performance with the headline numbers and a pointer to the full methodology. .gitignore: keep Dockerfile/script/README under tests/benchmarks/ compare/, exclude the 92MB OneDB tarball and the local venv.	2026-05-05 11:41:47 -06:00
Ryan Malloy	e9aed6ce59	Phase 25: Branch reorder + invariant tripwires (2026.05.04.10) Third-pass optimization on parse_tuple_payload's hot loop. Previous phases removed redundant work; this one removes correct-but-wasteful work: the if/elif chain checked branches in implementation order, not frequency order. Fixed-width types (INT, FLOAT, DATE, BIGINT - the most common columns in real queries) sat at the bottom, paying ~7 frozenset misses per column. Changes (src/informix_db/_resultset.py): * Added _FIXED_WIDTH_TYPES = frozenset(FIXED_WIDTHS.keys()) at module load. * New fast-path branch at the TOP of parse_tuple_payload's loop body that handles every _FIXED_WIDTH_TYPES column inline: one frozenset check, one dict lookup, one decode, continue. Skips every other branch. * Cleaned up the bottom fall-through; it now genuinely only catches unknown types. Performance vs Phase 24 baseline: * parse_tuple_5cols_iso8859: 1659 ns -> 1400 ns (-16%) * parse_tuple_5cols_utf8: 1649 ns -> 1341 ns (-19%) Cumulative vs Phase 21 baseline (before any optimization): * parse_tuple_5cols: 2796 ns -> 1400 ns (-50%) - HALF the time * decode_int: 230 ns -> 139 ns (-40%) Margaret Hamilton review surfaced one HIGH finding addressed before tagging: * H: The fast-path optimization assumes every FIXED_WIDTHS key is decodable WITHOUT qualifier inspection (encoded_length etc.). True today, but a future contributor adding a fixed-width type that needs qualifier bits (like DATETIME does) would silently get wrong decode behavior - Lauren-Bug class failure. Fix: added INVARIANT comment to FIXED_WIDTHS in converters.py AND added tests/test_resultset_invariants.py with three CI tripwire tests: - _FIXED_WIDTH_TYPES is disjoint from every other dispatch branch - Every FIXED_WIDTHS key has a DECODERS entry - DECODERS keys stay < 0x100 (Phase 24 collision-free guarantee) The tests carry instructions: if one fires, don't update the test to match - either restore the property or refactor the optimization. Comments rot when nobody reads them; tests fail loudly. baseline.json refreshed; 72 unit + 224 integration + 28 bench = 324 tests; ruff clean.	2026-05-04 23:34:05 -06:00
Ryan Malloy	dfa60ea501	Phase 24: Decoder dispatch split + struct precompilation (2026.05.04.9) Second pass of hot-path optimization on parse_tuple_payload. Two changes to converters.py: 1. Split decode() into public + internal. Added _decode_base(base_tc, raw, encoding) that takes an already-base-typed code and skips the redundant base_type() call. Public decode() is now a one-line wrapper. parse_tuple_payload's 4 call sites swapped to use _decode_base directly. _fastpath.py's external decode() caller is unaffected. 2. Pre-compiled struct.Struct unpackers. The fixed-width integer/float decoders (_decode_smallint, _decode_int, _decode_bigint, _decode_smfloat, _decode_float, _decode_date) switched from per-call struct.unpack(fmt, raw) to module-level bound methods like _UNPACK_INT = struct.Struct("!i").unpack. Format-string parsed once at module load. Measured 37% faster than per-call struct.unpack on CPython 3.13 micro. Performance vs Phase 23 baseline: * decode_int: 173 ns -> 139 ns (-20%) * decode_bigint: 188 ns -> 150 ns (-20%) * parse_tuple_5cols: 2047 ns -> 1592 ns (-22%) * 1k-row SELECT: 1255 us -> 989 us (-21%) Cumulative vs original Phase 21 baseline: * decode_int: 230 ns -> 139 ns (-40%) * parse_tuple_5cols: 2796 ns -> 1592 ns (-43%) * 1k-row SELECT: 1477 us -> 989 us (-33%) Real-world fetch ceiling: 358K rows/sec -> ~620K rows/sec. Margaret Hamilton review surfaced one HIGH-severity finding addressed before tagging: * H: The no-collision guarantee that makes _decode_base safe is structural but undocumented (all DECODERS keys are ≤ 0xFF, all flag bits are ≥ 0x100, so flagged inputs cannot coincidentally match). Added load-bearing INVARIANT comment at DECODERS dict explaining the constraint and what to do if violated. Cross-referenced from _decode_base's docstring for bidirectional traceability. baseline.json refreshed; all 224 integration tests pass; ruff clean.	2026-05-04 19:31:21 -06:00
Ryan Malloy	f3e589c5bf	Phase 23: Hot-path optimization for parse_tuple_payload (2026.05.04.8) Per-row decode is hit on every row of every SELECT. The original code had three forms of waste in the inner loop: 1. Redundant base_type() call. ColumnInfo.type_code is already base-typed by parse_describe at construction; calling base_type() again per column per row was pure waste. Single largest savings. 2. IntFlag->int conversions inline (~10x per iteration). Lifted to module-level _TC_X constants. 3. Lazy imports inside the loop body (_decode_datetime, _decode_interval, BlobLocator, ClobLocator, RowValue, CollectionValue). Moved to top. Plus three precomputed frozensets (_LENGTH_PREFIXED_SHORT_TYPES, _COMPOSITE_UDT_TYPES, _NUMERIC_TYPES) replace inline tuple-membership checks. _COLLECTION_KIND_MAP is now MappingProxyType (actually frozen). Performance: * parse_tuple_5cols: 2796 ns -> 2030 ns (-27%) * select_bench_table_all (1k rows): 1477 us -> 1198 us (-19%) * Codec micro-bench, cold connect, executemany: unchanged Real-world fetch ceiling on a single connection: 350K rows/sec -> 490K rows/sec. Margaret Hamilton review surfaced four cleanup items, all addressed before tagging: * H1: cursor._dereference_blob_columns had the same redundant base_type() call - stripped for consistency. * M1: documented the load-bearing invariant at parse_describe (the single producer site) so future contributors have a grep target. * M2: _COLLECTION_KIND_MAP wrapped in MappingProxyType. * L1: stale line-number comment fixed to point at the INVARIANT comment instead. baseline.json refreshed; all 224 integration tests pass; ruff clean.	2026-05-04 17:52:20 -06:00
Ryan Malloy	495128c679	Phase 21.1: executemany perf - it was the autocommit cliff (2026.05.04.6) Investigation of the Phase 21 baseline finding that executemany(N) cost scaled linearly per-row (1.74 ms x N) regardless of batch size. Root cause: every autocommit=True INSERT forces a server-side transaction-log flush. Not a wire-protocol bug. Numbers: * executemany(1000) autocommit=True: 1.72 s (1.72 ms/row) * executemany(1000) in single txn: 32 ms (32 us/row) 53x speedup from changing the transaction boundary, not the driver. Pure protocol overhead is ~32 us/row -> ~31K rows/sec sustained throughput on a single connection. Comparable to pg8000. Added test_executemany_1000_rows_in_txn benchmark to make this visible. Updated README headline numbers and added a "Performance gotchas" section explaining when autocommit=False matters. Decision: don't pipeline. The remaining 32 us is already excellent; the autocommit gotcha is the real user-facing footgun. Docs > code. If someone reports needing >31K rows/sec single-connection, that becomes Phase 22.	2026-05-04 17:26:16 -06:00
Ryan Malloy	90ce035a00	Phase 21: Performance benchmarks (2026.05.04.5) Adds tests/benchmarks/ with pytest-benchmark coverage of the hot codec paths and end-to-end SELECT/INSERT/pool/async round-trips. Establishes a committed baseline.json so PRs can be regression-checked at review via --benchmark-compare. * test_codec_perf.py (16): decode/encode_param/parse_tuple_payload micro-benchmarks - run without container, suitable for pre-merge CI. * test_select_perf.py (4): SELECT round-trips - 1-row latency floor, 10-row, 1k-row full fetch, parameterized. * test_insert_perf.py (3): single-row INSERT, executemany 100 / 1000. * test_pool_perf.py (3): cold connect, pool acquire/release, pool acquire + query + release. * test_async_perf.py (2): async round-trip overhead, 10x concurrent. * baseline.json: committed snapshot, 28 measurements. * benchmark pytest marker, gated off by default. * Makefile: bench / bench-codec / bench-save targets; test-integration excludes benchmarks for speed. Headline numbers (dev container loopback): * decode(int): 181 ns * parse_tuple 5 cols: 2.87 µs/row * SELECT 1 round-trip: 177 µs * Pool acquire+query+release: 295 µs * Cold connect: 11.2 ms (72x slower than pool) UTF-8 decode carries no measurable cost vs iso-8859-1 - confirms Phase 20 didn't regress anything. Total: 69 unit + 211 integration + 28 benchmark = 308 tests.	2026-05-04 17:21:12 -06:00

11 Commits