Adds three things to test_scaling_perf.py:
1. 100-column wide-row SELECT - codec stress test at extreme widths.
1k rows x 100 cols = 19.4 ms (~194 us/row, ~1.94 us/column-decode).
Per-column cost continues to drop with width thanks to loop
amortization (5 cols: 480 ns/col -> 100 cols: 194 ns/col).
2. 100k-row memory profile - samples RSS pre-execute, post-execute
(materialization cost), and during iteration. Real numbers:
pre-execute: 45.8 MB
post-execute: 71.2 MB (+25.4 MB = ~259 bytes/row materialization)
iteration: 0 KB extra (just walks the existing list)
Documents the in-memory cursor's actual cost: 100k rows = 25 MB,
1M rows = ~250 MB. Fair regression baseline (tripped at 500 MB).
3. 1M-row scaling gated behind IFX_BENCH_1M=1 env var. Default off
because the dev container's rootdbs runs out of space. For
production-sized servers users can opt in. The implementation
is linear-extrapolation-correct (executemany 100k -> 1M = ~15s,
SELECT 100k -> 1M = ~3s).
Note on the dev-container size limit: dev image's rootdbs is sized
for typical developer workloads, not stress testing. A 1M-row
INSERT exceeds the available pages and fails with -242 ISAM -113
(out of space). This is correct behavior - the limit is enforced
at the storage layer.
Switched RSS sampling from ru_maxrss (peak, monotonic) to
/proc/self/status VmRSS (current). Earlier runs showed flat
RSS because peak from earlier in the test session masked the
fluctuation.
Extends the IfxPy comparison bench script with scaling workloads
(1k/10k/100k rows for both executemany and SELECT). Re-runs the
full comparison with consistent measurement methodology and updates
the README with the actually-correct numbers.
Earlier comparison runs reported informix-db winning all 5
benchmarks. Re-running select_bench_table_all with consistent
measurement gives 3.04 ms, not the 891 us I cited earlier - a
3.4x discrepancy attributable to noisy warmup + small-fixture
artifacts. The "we win everything" framing was wrong.
Corrected comparison reveals two clear stories:
Bulk-insert: pure-Python wins 1.6x at scale.
executemany(10k): IfxPy 259ms -> us 161ms (1.6x faster)
executemany(100k): IfxPy 2376ms -> us 1487ms (1.6x faster)
Reason: Phase 33's pipelining eliminates per-row RTT. IfxPy's
per-call API can't pipeline.
Large-fetch: IfxPy wins 2.3-2.4x at scale.
SELECT 1k rows: IfxPy 1.2ms / us 2.7ms (IfxPy 2.3x)
SELECT 10k rows: IfxPy 11.3ms / us 25.8ms (IfxPy 2.3x)
SELECT 100k rows: IfxPy 112ms / us 271ms (IfxPy 2.4x)
Reason: C-level fetch_tuple at ~1.1us/row beats Python
parse_tuple_payload at ~2.7us/row. Real C-vs-Python codec gap
showing up at scale.
For everyday workloads (single SELECT in a request, INSERT a
handful of rows), drivers are within 5-25%. For workloads where
the gap widens, direction depends on what you're doing - bulk-
write favors us, bulk-read favors IfxPy.
README's "Compared to IfxPy" section rewritten with the corrected
numbers and an honest "when to prefer which" subsection.
tests/benchmarks/compare/README.md mirror updated.
Net narrative: a "faster at bulk-write, slower at bulk-read,
comparable elsewhere" comparison story is more honest and more
durable than a "we win everything" claim that would have collapsed
the first time a user ran their own benchmark.
Side note (lint): one ambiguous unicode `×` in cursors.py replaced
with `x`.
Phase 37 ticket: parse_tuple_payload is the bottleneck at scale.
Closing the 1.6 us/row gap to IfxPy would make us competitive on
bulk-fetch too. Possible approaches: Cython codec, deeper inlining,
per-column dispatch pre-bake.
The serial-loop executemany paid one wire round-trip per row (~30us/
row on loopback). It was the one benchmark where IfxPy beat us in
the comparison work - 10% slower at executemany(1000) in txn.
Phase 33 pipelines the BIND+EXECUTE PDUs: build all N PDUs, send
them back-to-back, then drain all N responses. Eliminates per-row
RTT entirely.
Performance impact:
* executemany(1000) in txn: 31.3 ms -> 11.0 ms (2.85x faster)
* executemany(100) autocommit: 173 ms -> 154 ms (11% faster)
* executemany(1000) autocommit: 1740 ms -> 1590 ms (9% faster)
(Autocommit gets smaller wins because server-side log flushes
dominate - Phase 21.1's "autocommit cliff".)
IfxPy comparison flipped: us 10% slower -> us 2.05x faster on bulk
inserts. We now win all 5 head-to-head benchmarks against the C-bound
driver.
Margaret Hamilton review surfaced one CRITICAL concern (C1) - the
pipeline assumes Informix sends N responses for N pipelined PDUs
even when one fails. If the server cut the stream short, the drain
loop would deadlock on the next read.
Verified by 3 new integration tests in tests/test_executemany_pipeline.py:
* test_pipelined_executemany_mid_batch_constraint_violation (row 500/1000)
* test_pipelined_executemany_first_row_fails (row 0/100)
* test_pipelined_executemany_last_row_fails (row 99/100)
All confirm Informix sends N responses; wire stays aligned; connection
is usable after.
Plus 4 lower-priority fixes Hamilton recommended:
* H1: documented _raise_sq_err self-drains-SQ_EOT invariant + tripwire
* H2: docstring warning about O(N) lock duration; chunk for huge batches
* M1: prepend row-index to exception message rather than reformat
* M2: documented sendall-no-timeout caveat on hostile networks
77 unit + 239 integration + 33 benchmark = 349 tests; ruff clean.
Note: Phase 32 (Tier 1+2 benchmarks) was tagged without bumping
pyproject.toml's version string. .5 was git-tag-only; .6 is the next
published version increment.
Tier 1 — make existing benchmarks reliable:
* Bumped slow-bench rounds: cold_connect_disconnect 5->15, executemany
series 3->10. Single-round outliers no longer dominate.
* Switched bench reporting to median + IQR. Mean was being moved by
individual GC pauses / scheduler hiccups (IfxPy executemany IQR
was 8.2 ms on a 28 ms median - 29% spread - mean was unreliable).
* Updated ifxpy_bench.py to also report median + IQR alongside mean
for cross-comparable numbers.
* Makefile bench targets now show median, iqr, mean, stddev, ops, rounds.
The robust statistics flipped the comparison story:
Old (mean, 3 rounds): us 9% faster / IfxPy 30% faster on 2 of 5
New (median, 10+ rds): us faster on 4 of 5 benchmarks
| Benchmark | IfxPy | informix-db | Δ |
|---|---|---|---|
| select_one_row | 170us | 119us | us 30% faster |
| select_systables_first_10 | 186us | 142us | us 24% faster |
| select_bench_table_all 1k | 980us | 832us | us 15% faster |
| executemany 1k in txn | 28.3ms | 31.3ms | us 10% slower |
| cold_connect_disconnect | 12.0ms | 10.7ms | us 11% faster |
Tier 2 — add benchmarks for claims we make but don't verify:
tests/benchmarks/test_observability_perf.py:
* test_streaming_fetch_memory_profile — RSS sampling during a
cursor iteration. Documents memory growth shape; regression
wall at 100 MB / 1k rows. Currently flat (in-memory cursor
doesn't grow detectably for 278 rows).
* test_select_1_latency_percentiles — 1000-query distribution
with p50/p90/p95/p99/max. Result: p99/p50 = 1.42x (tight tail).
p50=108us, p99=153us.
* test_concurrent_pool_throughput[2,4,8] — N worker threads
through pool, measures aggregate QPS + per-thread fairness.
Plateaus at ~6K QPS (server-bound); per-thread latency scales
~linearly with N (server serialization expected).
README.md (project root): updated Compared-to-IfxPy table with
the median-based numbers + IQR awareness note.
tests/benchmarks/compare/README.md: added "Statistical robustness"
section explaining why median over mean for fair comparison.
236 integration tests pass; ruff clean.
Adds a paired benchmark of informix-db (pure Python) against IfxPy
3.0.5 (IBM's C-bound driver via OneDB ODBC) on identical workloads
against the same Informix dev container.
Headline result: pure Python is competitive — and faster on 2/5
benchmarks where wire round-trip dominates over codec/marshaling.
| Benchmark | IfxPy | informix-db | Result |
|---|---:|---:|---:|
| select_one_row (single-row latency) | 128 us | 116 us | us 9% faster |
| select_systables_first_10 | 126 us | 184 us | IfxPy 32% faster |
| select_bench_table_all (1k rows) | 969 us | 855 us | us 12% faster |
| executemany(1000) in txn | 21.5 ms | 30.8 ms | IfxPy 30% slower |
| cold_connect_disconnect | 11.0 ms | 10.9 ms | comparable |
Why the surprising wins: IfxPy's path is Python -> OneDB ODBC ->
libifdmr -> wire. Ours is Python -> wire. When wire round-trip
dominates (single-row, bulk fetch), the missing abstraction layer
makes us faster. When per-row marshaling dominates (executemany),
IfxPy's C-level execute(stmt, tuple) beats Python BIND-PDU build.
Files added under tests/benchmarks/compare/:
* Dockerfile.ifxpy — Ubuntu 20.04 base with IfxPy + OneDB drivers
* ifxpy_bench.py — IfxPy benchmark workloads matching test_*_perf.py
* README.md — methodology, results, install gauntlet, reproduction
The IfxPy install gauntlet itself is part of the comparison story:
modern Python 3.11 (not 3.13), setuptools <58, permissive CFLAGS,
manual download of 92MB OneDB ODBC tarball, four LD_LIBRARY_PATH
directories, libcrypt.so.1 (deprecated 2018, missing on Arch /
Fedora 35+ / RHEL 9). Versus our `pip install informix-db`.
README.md (project root): added "Compared to IfxPy" section under
Performance with the headline numbers and a pointer to the full
methodology.
.gitignore: keep Dockerfile/script/README under tests/benchmarks/
compare/, exclude the 92MB OneDB tarball and the local venv.
Third-pass optimization on parse_tuple_payload's hot loop. Previous
phases removed redundant work; this one removes correct-but-wasteful
work: the if/elif chain checked branches in implementation order, not
frequency order. Fixed-width types (INT, FLOAT, DATE, BIGINT - the most
common columns in real queries) sat at the bottom, paying ~7 frozenset
misses per column.
Changes (src/informix_db/_resultset.py):
* Added _FIXED_WIDTH_TYPES = frozenset(FIXED_WIDTHS.keys()) at module
load.
* New fast-path branch at the TOP of parse_tuple_payload's loop body
that handles every _FIXED_WIDTH_TYPES column inline: one frozenset
check, one dict lookup, one decode, continue. Skips every other
branch.
* Cleaned up the bottom fall-through; it now genuinely only catches
unknown types.
Performance vs Phase 24 baseline:
* parse_tuple_5cols_iso8859: 1659 ns -> 1400 ns (-16%)
* parse_tuple_5cols_utf8: 1649 ns -> 1341 ns (-19%)
Cumulative vs Phase 21 baseline (before any optimization):
* parse_tuple_5cols: 2796 ns -> 1400 ns (-50%) - HALF the time
* decode_int: 230 ns -> 139 ns (-40%)
Margaret Hamilton review surfaced one HIGH finding addressed before
tagging:
* H: The fast-path optimization assumes every FIXED_WIDTHS key is
decodable WITHOUT qualifier inspection (encoded_length etc.). True
today, but a future contributor adding a fixed-width type that
needs qualifier bits (like DATETIME does) would silently get wrong
decode behavior - Lauren-Bug class failure.
Fix: added INVARIANT comment to FIXED_WIDTHS in converters.py AND
added tests/test_resultset_invariants.py with three CI tripwire
tests:
- _FIXED_WIDTH_TYPES is disjoint from every other dispatch branch
- Every FIXED_WIDTHS key has a DECODERS entry
- DECODERS keys stay < 0x100 (Phase 24 collision-free guarantee)
The tests carry instructions: if one fires, don't update the test
to match - either restore the property or refactor the optimization.
Comments rot when nobody reads them; tests fail loudly.
baseline.json refreshed; 72 unit + 224 integration + 28 bench = 324
tests; ruff clean.
Second pass of hot-path optimization on parse_tuple_payload. Two changes
to converters.py:
1. Split decode() into public + internal. Added _decode_base(base_tc,
raw, encoding) that takes an already-base-typed code and skips the
redundant base_type() call. Public decode() is now a one-line
wrapper. parse_tuple_payload's 4 call sites swapped to use
_decode_base directly. _fastpath.py's external decode() caller is
unaffected.
2. Pre-compiled struct.Struct unpackers. The fixed-width integer/float
decoders (_decode_smallint, _decode_int, _decode_bigint,
_decode_smfloat, _decode_float, _decode_date) switched from per-call
struct.unpack(fmt, raw) to module-level bound methods like
_UNPACK_INT = struct.Struct("!i").unpack. Format-string parsed once
at module load. Measured 37% faster than per-call struct.unpack on
CPython 3.13 micro.
Performance vs Phase 23 baseline:
* decode_int: 173 ns -> 139 ns (-20%)
* decode_bigint: 188 ns -> 150 ns (-20%)
* parse_tuple_5cols: 2047 ns -> 1592 ns (-22%)
* 1k-row SELECT: 1255 us -> 989 us (-21%)
Cumulative vs original Phase 21 baseline:
* decode_int: 230 ns -> 139 ns (-40%)
* parse_tuple_5cols: 2796 ns -> 1592 ns (-43%)
* 1k-row SELECT: 1477 us -> 989 us (-33%)
Real-world fetch ceiling: 358K rows/sec -> ~620K rows/sec.
Margaret Hamilton review surfaced one HIGH-severity finding addressed
before tagging:
* H: The no-collision guarantee that makes _decode_base safe is
structural but undocumented (all DECODERS keys are ≤ 0xFF, all flag
bits are ≥ 0x100, so flagged inputs cannot coincidentally match).
Added load-bearing INVARIANT comment at DECODERS dict explaining
the constraint and what to do if violated. Cross-referenced from
_decode_base's docstring for bidirectional traceability.
baseline.json refreshed; all 224 integration tests pass; ruff clean.
Per-row decode is hit on every row of every SELECT. The original code
had three forms of waste in the inner loop:
1. Redundant base_type() call. ColumnInfo.type_code is already
base-typed by parse_describe at construction; calling base_type()
again per column per row was pure waste. Single largest savings.
2. IntFlag->int conversions inline (~10x per iteration). Lifted to
module-level _TC_X constants.
3. Lazy imports inside the loop body (_decode_datetime, _decode_interval,
BlobLocator, ClobLocator, RowValue, CollectionValue). Moved to top.
Plus three precomputed frozensets (_LENGTH_PREFIXED_SHORT_TYPES,
_COMPOSITE_UDT_TYPES, _NUMERIC_TYPES) replace inline tuple-membership
checks. _COLLECTION_KIND_MAP is now MappingProxyType (actually frozen).
Performance:
* parse_tuple_5cols: 2796 ns -> 2030 ns (-27%)
* select_bench_table_all (1k rows): 1477 us -> 1198 us (-19%)
* Codec micro-bench, cold connect, executemany: unchanged
Real-world fetch ceiling on a single connection: 350K rows/sec ->
490K rows/sec.
Margaret Hamilton review surfaced four cleanup items, all addressed
before tagging:
* H1: cursor._dereference_blob_columns had the same redundant
base_type() call - stripped for consistency.
* M1: documented the load-bearing invariant at parse_describe (the
single producer site) so future contributors have a grep target.
* M2: _COLLECTION_KIND_MAP wrapped in MappingProxyType.
* L1: stale line-number comment fixed to point at the INVARIANT
comment instead.
baseline.json refreshed; all 224 integration tests pass; ruff clean.
Investigation of the Phase 21 baseline finding that executemany(N) cost
scaled linearly per-row (1.74 ms x N) regardless of batch size.
Root cause: every autocommit=True INSERT forces a server-side
transaction-log flush. Not a wire-protocol bug.
Numbers:
* executemany(1000) autocommit=True: 1.72 s (1.72 ms/row)
* executemany(1000) in single txn: 32 ms (32 us/row)
53x speedup from changing the transaction boundary, not the driver.
Pure protocol overhead is ~32 us/row -> ~31K rows/sec sustained
throughput on a single connection. Comparable to pg8000.
Added test_executemany_1000_rows_in_txn benchmark to make this
visible. Updated README headline numbers and added a "Performance
gotchas" section explaining when autocommit=False matters.
Decision: don't pipeline. The remaining 32 us is already excellent;
the autocommit gotcha is the real user-facing footgun. Docs > code.
If someone reports needing >31K rows/sec single-connection, that
becomes Phase 22.