Ryan Malloy 495128c679 Phase 21.1: executemany perf - it was the autocommit cliff (2026.05.04.6)
Investigation of the Phase 21 baseline finding that executemany(N) cost
scaled linearly per-row (1.74 ms x N) regardless of batch size.

Root cause: every autocommit=True INSERT forces a server-side
transaction-log flush. Not a wire-protocol bug.

Numbers:
* executemany(1000) autocommit=True: 1.72 s (1.72 ms/row)
* executemany(1000) in single txn:    32 ms (32 us/row)

53x speedup from changing the transaction boundary, not the driver.
Pure protocol overhead is ~32 us/row -> ~31K rows/sec sustained
throughput on a single connection. Comparable to pg8000.

Added test_executemany_1000_rows_in_txn benchmark to make this
visible. Updated README headline numbers and added a "Performance
gotchas" section explaining when autocommit=False matters.

Decision: don't pipeline. The remaining 32 us is already excellent;
the autocommit gotcha is the real user-facing footgun. Docs > code.
If someone reports needing >31K rows/sec single-connection, that
becomes Phase 22.
2026-05-04 17:26:16 -06:00

98 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Benchmarks (Phase 21)
Performance baselines for `informix-db`. Two layers:
1. **Codec micro-benchmarks** (`test_codec_perf.py`) — pure CPU, no
server. These set the *ceiling* for what end-to-end can achieve.
Run with `make bench-codec`. Suitable for CI's pre-merge job.
2. **End-to-end benchmarks** — exercise the full
PREPARE → BIND → EXECUTE → FETCH → CLOSE → RELEASE round-trip.
Need an Informix container (`make ifx-up`). Run with `make bench`.
## Headline numbers (baseline 2026-05-04, x86_64 Linux, dev container on loopback)
| Operation | Mean | Ops/sec |
|-|-:|-:|
| `decode(int)` (per cell) | 181 ns | 5.5M |
| `parse_tuple_payload(5 cols)` (per row) | 2.87 µs | 350K |
| `encode_param(int)` (per param) | 103 ns | 9.7M |
| `SELECT 1` round-trip | 177 µs | 5,650 |
| Pool acquire + tiny query + release | 295 µs | 3,400 |
| **Cold connect + close** (login handshake) | **11.2 ms** | **89** |
| 1000-row SELECT * | 1.56 ms | 640 |
| INSERT (single, prepared) | 1.88 ms | 530 |
| `executemany(100)` autocommit=True | 181 ms | ~550 rows/sec |
| `executemany(1000)` autocommit=True | 1.72 s | ~580 rows/sec |
| **`executemany(1000)` in single transaction** | **32 ms** | **~31,000 rows/sec** |
### What these tell you
- **Pool gives 72× speedup** over cold connect. If your app opens a
connection per request, fix that first.
- **Wrap bulk INSERTs in a transaction.** That's a **53× speedup** over
the autocommit-True default. With autocommit on, each row forces the
server to flush its transaction log; in transaction mode the flush
happens once at COMMIT. Per-row cost drops from 1.72 ms (storage-bound)
to 32 µs (pure protocol). PEP 249's default `autocommit=False` was
designed for this — we just default to `False`.
- **Codec is not the bottleneck.** Per-row decode (2.9 µs) is 1000× faster
than wire round-trip (177 µs for `SELECT 1`). Network and server-side
cost dominate.
- **UTF-8 carries no measurable cost.** `decode_varchar_utf8` runs at
216 ns vs `decode_varchar_short` at 170 ns — the 27% delta is the
multibyte string walk inherent in UTF-8 decoding, not Phase 20 overhead.
### Performance gotchas
- **`autocommit=True` + `executemany` is the slowest reasonable pattern.**
Use it only when each row genuinely needs to land independently. For
bulk loads, default `autocommit=False` and call `conn.commit()` at the
end of the batch.
- **Single `INSERT` in a tight loop is 1.88 ms each** — strictly worse
than `executemany` (which saves PREPARE/RELEASE overhead). If you find
yourself looping over `cur.execute("INSERT...")` hundreds of times,
switch to `executemany`.
- **Cold connect is 11 ms.** The login handshake is *expensive* compared
to anything you'll do with the connection. Pool everything in
long-lived processes.
## Regression policy
`baseline.json` is committed and represents the dev-container baseline.
Compare a current run against it with:
```bash
uv run pytest tests/benchmarks/ -m benchmark --benchmark-only \
--benchmark-compare=tests/benchmarks/baseline.json \
--benchmark-compare-fail=mean:25%
```
A 25% mean-regression fails the run. Adjust the threshold per CI noise
profile. CI's loopback-network-on-shared-runner is noisier than dev
container on a quiet box — start permissive and tighten as you collect
runs.
## Updating the baseline
When you intentionally change performance (an optimization, or accept
a regression for correctness), refresh:
```bash
make bench-save # writes .results/0001_run.json
cp tests/benchmarks/.results/Linux-CPython-*/0001_run.json tests/benchmarks/baseline.json
git add tests/benchmarks/baseline.json
```
Document the change in CHANGELOG so reviewers know why the floor moved.
## Files
- `test_codec_perf.py` — codec dispatch (decode, encode_param, parse_tuple_payload)
- `test_select_perf.py` — SELECT round-trips, single + multi-row
- `test_insert_perf.py` — INSERT single + executemany throughput
- `test_pool_perf.py` — cold connect vs pool acquire/release
- `test_async_perf.py` — async-path latency + concurrent throughput
- `conftest.py` — long-lived `bench_conn` and 1k-row `bench_table` fixtures
- `baseline.json` — committed baseline for regression comparison
- `.results/` — gitignored; per-run output from `make bench-save`