Phase 22: User-facing documentation refresh (2026.05.04.7)

The docs/USAGE.md predated Phases 17-21, so anyone landing on PyPI was missing scrollable cursors, locale/Unicode, the autocommit cliff finding, and the type-mapping reference. Added sections to docs/USAGE.md: * Locale and Unicode - client_locale, Connection.encoding, CLIENT_LOCALE vs DB_LOCALE, when characters can't fit the codec * Type mapping reference - full SQL <-> Python type table, NULL sentinels subsection, IntervalYM * Performance tips - 53x autocommit-cliff fix, 100x executemany win, 72x pool win, with the actual benchmark numbers from Phase 21.1 * Scrollable cursors - fetch_* API, in-memory vs server-side trade-off, edge cases (past-end semantics, negative indexing, rownumber) * Timeouts and keepalive subsection - production starting points * Environment dictionary subsection - env={} parameter * Known limitations - explicit table of what doesn't work (named params, complex UDT bind, GSSAPI, XA) with workarounds; "things that might surprise you" notes README.md - added Documentation section linking to docs/USAGE.md and tests/benchmarks/README.md. Doc corrections caught during review: * cursor.rownumber is 0-indexed (impl has always been correct; only the original docstring wording was loose) * fetch_* methods work on BOTH scrollable=True and default cursors; the in-memory path supports them too USAGE.md grew from 345 lines to 633.
2026-05-04 17:33:37 -06:00 · 2026-05-04 17:33:37 -06:00 · 0e0dfcba26
commit 0e0dfcba26
parent 495128c679
5 changed files with 321 additions and 4 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,6 +2,29 @@
 All notable changes to `informix-db`. Versioning is [CalVer](https://calver.org/) — `YYYY.MM.DD` for date-based releases, `YYYY.MM.DD.N` for same-day post-releases per PEP 440.
 ## 2026.05.04.7 — User-facing documentation refresh (Phase 22)
 The `docs/USAGE.md` predated Phases 17-21, so anyone landing on PyPI was missing scrollable cursors, locale/Unicode, the autocommit cliff finding, and the type-mapping reference. This release closes that gap.
 ### Added (in `docs/USAGE.md`)
 - **Locale and Unicode** — full section on `client_locale`, `Connection.encoding`, the CLIENT_LOCALE vs DB_LOCALE distinction, what happens when characters can't fit the codec, how to create a UTF-8 database. Bridges the gap between Phase 20's plumbing and a user's first multibyte INSERT.
 - **Type mapping reference** — full SQL ↔ Python type table covering integer widths, DECIMAL, all string types, DATE/DATETIME/INTERVAL, BYTE/TEXT, BLOB/CLOB, ROW/COLLECTION, and `NULL`. Plus subsections on NULL sentinels and `IntervalYM`.
 - **Performance tips** — three numbered patterns: wrap bulk INSERTs in a transaction (53× speedup), use `executemany` not a loop (≈100× speedup), use a connection pool (72× speedup over cold connect). Quotes the actual benchmark numbers from Phase 21.1.
 - **Scrollable cursors** — `fetch_first` / `fetch_last` / `fetch_prior` / `fetch_absolute` / `fetch_relative` / `scroll()` API; in-memory vs `cursor(scrollable=True)` server-side trade-offs; edge cases (past-end semantics, negative indexing, `rownumber` indexing).
 - **Timeouts and keepalive** subsection — `connect_timeout` / `read_timeout` / `keepalive` semantics with a "reasonable production starting point" recommendation.
 - **Environment dictionary** subsection — the `env={}` parameter, with examples (OPT_GOAL, OPTOFC, IFX_AUTOFREE).
 - **Known limitations** — explicit table of what doesn't work yet (named parameters, complex UDT bind, GSSAPI, XA, listener failover, etc.) with workarounds where they exist. Plus "things that work but might surprise you" (autocommit default, no-op commit on unlogged DB, SERIAL retrieval).
 ### Changed
 - **`README.md`** — added a "Documentation" section linking to `docs/USAGE.md` and `tests/benchmarks/README.md`. Bumped phase count.
 ### Doc corrections caught during review
 - `cursor.rownumber` is **0-indexed**, not 1-indexed (the implementation has been correct; only the original docstring wording was loose).
 - `fetch_*` methods work on **both** scrollable=True and the default (in-memory) cursor — the original Phase 17 docs implied scrollable=True was required, but the in-memory path supports them too.
 ## 2026.05.04.6 — `executemany` perf finding: it was the autocommit cliff
 Investigation of the Phase 21 finding that `executemany(N)` cost scaled linearly per-row (1.74 ms × N) regardless of batch size. **Root cause: every autocommit-True INSERT forces a server-side transaction-log flush.** Not a wire-protocol bug.
--- a/README.md
+++ b/README.md
@ -164,9 +164,15 @@ docker compose -f tests/docker-compose.yml up -d
 For the smart-LOB tests specifically, the dev container needs additional one-time setup (blobspace + sbspace + level-0 archive). See [`docs/DECISION_LOG.md`](docs/DECISION_LOG.md) §10 for the exact `onspaces` / `onmode` / `ontape` commands.
 ## Documentation
 - [**`docs/USAGE.md`**](docs/USAGE.md) — practical recipes: connections, parameter binding, type mapping, transactions, performance tips, scrollable cursors, BLOBs, async, TLS, locale/Unicode, error handling, known limitations
 - [`tests/benchmarks/README.md`](tests/benchmarks/README.md) — performance baselines, headline numbers, how to run regressions
 - `CHANGELOG.md` — phase-by-phase release notes
 ## Project history & design rationale
-This driver was built incrementally over 16 phases, each with a focused scope and decision log. The full reasoning trail lives in:
+This driver was built incrementally across 22+ phases, each with a focused scope and decision log. The reasoning trail lives in:
 - [`docs/PROTOCOL_NOTES.md`](docs/PROTOCOL_NOTES.md) — byte-level SQLI wire-format reference
 - [`docs/JDBC_NOTES.md`](docs/JDBC_NOTES.md) — index into the decompiled IBM JDBC driver, used as a clean-room reference
--- a/docs/USAGE.md
+++ b/docs/USAGE.md
@ -26,6 +26,95 @@ conn = informix_db.connect(
 `database` may be `None` to log in without selecting a database; the server still completes a successful login. Useful for cross-database queries that fully qualify table names.
 ### Timeouts and keepalive
 | Parameter | Purpose | Default |
 |---|---|---|
 | `connect_timeout` | Time-bound for the TCP connect + login handshake. `None` uses the OS default (typically minutes). | `None` |
 | `read_timeout` | Per-read timeout on subsequent socket reads. Fires `OperationalError` on a hung server. | `None` |
 | `keepalive` | Set `SO_KEEPALIVE` on the socket. Useful for long-lived idle connections behind aggressive NAT/firewalls. | `False` |
 A reasonable production starting point: `connect_timeout=10.0, read_timeout=30.0, keepalive=True`. The connect timeout protects startup; the read timeout protects against a frozen server; keepalive protects against silent idle disconnection.
 ### Environment dictionary
 The `env={}` parameter sets server-side session variables sent in the login PDU. Useful for things like `OPTOFC` (optimize-on-fetch-close), `IFX_AUTOFREE`, `OPT_GOAL`, or any other runtime knob the server reads from the session env block.
 ```python
 informix_db.connect(
    ...,
    env={
        "OPT_GOAL": "-1",        # optimize for first-row return
        "OPTOFC": "1",           # auto-free cursors at fetch-close
        "IFX_AUTOFREE": "1",
    },
 )
 ```
 `CLIENT_LOCALE` is set automatically from the `client_locale=` parameter — don't put it in `env=`.
 ## Locale and Unicode
 The connection's `client_locale` controls how Python `str` values are encoded to bytes (and back) for CHAR / VARCHAR / NCHAR / NVCHAR / LVARCHAR / CLOB columns. The default `"en_US.8859-1"` is safe for ASCII + Western European text. **For multibyte text (CJK, Cyrillic, Arabic, emoji), set `client_locale="en_US.utf8"` AND make sure the database's `DB_LOCALE` is also UTF-8.**
 ```python
 conn = informix_db.connect(..., client_locale="en_US.utf8")
 print(conn.encoding)  # "utf-8"
 cur = conn.cursor()
 cur.execute("INSERT INTO docs (body) VALUES (?)", ("你好世界",))
 ```
 The `Connection.encoding` property reports the resolved Python codec name. Common mappings:
 | Locale | Python codec |
 |---|---|
 | `en_US.8859-1` (default) | `iso-8859-1` |
 | `en_US.utf8` / `en_US.UTF-8` | `utf-8` |
 | `en_US.8859-15` | `iso-8859-15` |
 | Anything without a codeset suffix, or unknown | falls back to `iso-8859-1` |
 ### CLIENT_LOCALE vs DB_LOCALE
 * **`CLIENT_LOCALE`** is what *your code* uses to encode / decode string parameters and column values. Set per-connection.
 * **`DB_LOCALE`** is what *the database* uses to store string columns. Set at `CREATE DATABASE` time, immutable afterwards.
 If they match, no transcoding happens. If they differ, the server transcodes between them at the storage boundary — and any character in your data that doesn't exist in `DB_LOCALE`'s codeset is either replaced with `?` (lossy) or rejected with sqlcode `-1820` (depends on the server version). The IBM Developer Edition Docker image's default `testdb` is created with `DB_LOCALE=en_US.8859-1`; storing `"你好"` there will fail server-side regardless of `CLIENT_LOCALE`.
 To create a UTF-8 database for full multibyte support:
 ```bash
 # Inside the container, before CREATE DATABASE:
 export DB_LOCALE=en_US.utf8
 export CLIENT_LOCALE=en_US.utf8
 ```
 ```sql
 CREATE DATABASE my_utf8db WITH LOG IN rootdbs;
 ```
 ### When characters can't fit the codec
 Passing a `str` containing characters that can't be encoded under `client_locale` raises `informix_db.DataError` cleanly — the connection survives:
 ```python
 conn = informix_db.connect(..., client_locale="en_US.8859-1")
 cur = conn.cursor()
 try:
    cur.execute("INSERT INTO t VALUES (?)", ("你好",))
 except informix_db.DataError as e:
    print(e)
    # cannot encode parameter under client_locale codec 'iso-8859-1':
    # ordinal not in range(256) at position 0-2.
    # Connect with a wider locale (e.g., 'en_US.utf8') if your data
    # contains characters outside this codec.
 # Connection is still good
 cur.execute("SELECT 1 FROM systables WHERE tabid = 1")
 ```
 Protocol-level strings (cursor names, function signatures, error "near tokens", SQL keywords) are always ASCII and stay `iso-8859-1` regardless of `client_locale`.
 ## Cursor lifecycle
 ```python
@ -71,7 +160,66 @@ cur.execute(
 )
 ```
-Type mapping: `int`, `float`, `str`, `bool`, `None`, `datetime.date`, `datetime.datetime`, `datetime.timedelta`, `decimal.Decimal`, `informix_db.IntervalYM`, `bytes` (BYTE/TEXT params).
+Supported parameter types: `int`, `float`, `str`, `bool`, `None`, `datetime.date`, `datetime.datetime`, `datetime.timedelta`, `decimal.Decimal`, `informix_db.IntervalYM`, `bytes` (BYTE/TEXT params).
 ## Type mapping reference
 What you put in vs. what comes out:
 | SQL type | Param accepts | Result returns |
 |---|---|---|
 | `SMALLINT` (16-bit) | `int` (range -32,767..32,767) | `int` |
 | `INT` / `INTEGER` (32-bit) | `int` (range -2³¹+1..2³¹-1) | `int` |
 | `BIGINT` (64-bit) | `int` | `int` |
 | `SERIAL` / `BIGSERIAL` | `int` (omit for auto-assign) | `int` |
 | `SMALLFLOAT` / `REAL` | `float` | `float` |
 | `FLOAT` / `DOUBLE PRECISION` | `float` | `float` |
 | `DECIMAL(p,s)` / `NUMERIC` | `decimal.Decimal` | `decimal.Decimal` |
 | `MONEY(p,s)` | `decimal.Decimal` | `decimal.Decimal` |
 | `CHAR(N)` | `str` (right-trimmed of trailing spaces) | `str` |
 | `VARCHAR(N)` / `NVARCHAR(N)` | `str` | `str` |
 | `NCHAR(N)` | `str` | `str` |
 | `LVARCHAR(N)` | `str` | `str` |
 | `BOOLEAN` | `bool` | `bool` |
 | `DATE` | `datetime.date` | `datetime.date` |
 | `DATETIME YEAR TO DAY` | `datetime.datetime` | `datetime.date` |
 | `DATETIME YEAR TO SECOND` (etc.) | `datetime.datetime` | `datetime.datetime` |
 | `DATETIME HOUR TO SECOND` | not yet | `datetime.time` |
 | `INTERVAL DAY TO FRACTION(5)` | `datetime.timedelta` | `datetime.timedelta` |
 | `INTERVAL YEAR TO MONTH` | `informix_db.IntervalYM` | `informix_db.IntervalYM` |
 | `BYTE` (legacy in-row blob) | `bytes` | `bytes` |
 | `TEXT` (legacy in-row clob) | `bytes` (or `str`, encoded under `conn.encoding`) | `str` |
 | `BLOB` (smart-LOB) | use `cursor.write_blob_column` with `BLOB_PLACEHOLDER` | `informix_db.BlobLocator` (use `cursor.read_blob_column` for bytes) |
 | `CLOB` (smart-LOB) | use `cursor.write_blob_column(..., clob=True)` | `informix_db.ClobLocator` |
 | `ROW(...)` | not yet | `informix_db.RowValue` (raw payload + schema) |
 | `SET(...)` / `MULTISET(...)` / `LIST(...)` | not yet | `informix_db.CollectionValue` |
 | `NULL` (any type) | `None` | `None` |
 ### NULL sentinels
 Informix encodes NULL inline rather than as a separate flag for fixed-width types:
 * `INT`: `0x80000000` (`INT_MIN`)
 * `SMALLINT`: `0x8000` (`SHORT_MIN`)
 * `BIGINT`: `0x8000000000000000` (`LONG_MIN`)
 * `REAL` / `FLOAT`: all-`0xff` bytes
 * `DATE`: `0x80000000` (Day_MIN)
 If your data legitimately contains these values, you'll see them surface as `None` on the Python side. (Real-world usage rarely hits this — `INT_MIN` as a valid value is uncommon — but it's documented behavior, not a bug.)
 ### `IntervalYM`
 Year-month intervals can't collapse into `datetime.timedelta` because months have variable length. Provided as a small dataclass:
 ```python
 from informix_db import IntervalYM
 iv = IntervalYM(months=18)
 print(iv.years, iv.remainder_months)  # 1 6
 print(str(iv))                         # "1-06"
 cur.execute("INSERT INTO leases (term) VALUES (?)", (iv,))
 ```
 ## Transactions
@ -115,6 +263,123 @@ cur.executemany(
 conn.commit()
 ```
 ## Performance tips
 Three patterns dominate real-world performance. They're all about **batching the right thing**:
 ### 1. Wrap bulk INSERTs in a transaction (53× speedup)
 Under `autocommit=True`, **every INSERT forces a server-side transaction-log flush**. Under `autocommit=False`, the flush happens once at COMMIT.
 | Pattern | 1000 rows | Per row | Throughput |
 |---|---|---|---|
 | `executemany` autocommit=True | 1.72 s | 1.72 ms | ~580 rows/sec |
 | `executemany` in single txn | 32 ms | **32 µs** | **~31,000 rows/sec** |
 ```python
 # Slow — every row commits independently
 conn = informix_db.connect(..., autocommit=True)
 conn.cursor().executemany("INSERT ...", rows)
 # Fast — one log flush at the end
 conn = informix_db.connect(..., autocommit=False)  # default
 cur = conn.cursor()
 cur.executemany("INSERT ...", rows)
 conn.commit()
 ```
 This is the single biggest win for any bulk-load workload.
 ### 2. Use `executemany`, not a loop of `execute` (≈100× speedup)
 `executemany` PREPAREs once and BIND+EXECUTEs per row. A naive loop PREPAREs and RELEASEs per row — paying the server-side parse cost N times.
 ```python
 # Slow: 1.88 ms per row, dominated by PREPARE/RELEASE overhead
 for row in rows:
    cur.execute("INSERT INTO t VALUES (?, ?, ?)", row)
 # Fast: shares the prepared statement across all rows
 cur.executemany("INSERT INTO t VALUES (?, ?, ?)", rows)
 ```
 ### 3. Use a connection pool (72× speedup over cold connect)
 Cold connect takes ~11 ms (TCP + login handshake). Pool acquire takes ~150 µs. If your application opens a fresh connection per request, fix that first.
 ```python
 # In a long-lived process (FastAPI, Django, worker), open the pool once
 pool = informix_db.create_pool(host="...", min_size=2, max_size=10)
 # Per request:
 with pool.connection() as conn:
    cur = conn.cursor()
    cur.execute(...)
 ```
 ### Other tips
 * **Cursor reuse is fine across queries** — but each `execute()` resets `description`, `rowcount`, and the materialized result set. If you need the prior query's data, capture it before re-executing.
 * **`fetchall()` materializes the whole result set in memory.** For large queries, iterate (`for row in cur:`) or use `fetchmany(N)`. Internally the cursor still buffers a server-fetch worth of rows at a time.
 * **The `fast_path_call` API is dramatically cheaper than equivalent SQL** for repeated UDF invocations — routine handles are cached per-connection, so the second call onwards skips the `SQ_GETROUTINE` round-trip.
 For raw numbers (codec speed, round-trip latencies, full bench results), see `tests/benchmarks/README.md`.
 ## Scrollable cursors
 A regular cursor walks rows forward only via `fetchone` / `fetchmany` / iteration. The **`fetch_*` family** lets you move backwards, jump to absolute positions, fetch the last row directly, and revisit rows already seen.
 ```python
 cur = conn.cursor()
 cur.execute("SELECT id, name FROM users ORDER BY id")
 # Standard methods still work
 first = cur.fetchone()                  # row 0
 second = cur.fetchone()                 # row 1
 # Plus the scroll surface
 last = cur.fetch_last()                 # last row
 prev = cur.fetch_prior()                # one back from current
 specific = cur.fetch_absolute(50)       # row 50 (0-indexed)
 relative = cur.fetch_relative(-3)       # 3 rows back from current
 back_to_start = cur.fetch_first()       # row 0
 # PEP 249 scroll()
 cur.scroll(5, mode="relative")          # forward 5 from current
 cur.scroll(0, mode="absolute")          # to row 0
 # Where am I?
 print(cur.rownumber)                    # 0-indexed; None at before-first / after-last
 ```
 ### Two modes: in-memory vs server-side
 The default cursor materializes the full result set into Python memory on `execute`, then `fetch_*` methods operate on the buffer. Random access is essentially free, but memory grows with row count.
 Pass `scrollable=True` to `cursor()` to get a **server-side** scroll cursor:
 ```python
 cur = conn.cursor(scrollable=True)
 cur.execute("SELECT id, name FROM big_table")
 last_row = cur.fetch_last()             # one round-trip, no buffer
 row_500 = cur.fetch_absolute(500)       # one round-trip
 ```
 Server-side mode keeps the cursor open on the server and issues a `SQ_SFETCH` round-trip per scroll operation. Constant client memory, network round-trip per move. Use it when your result set is large enough that materializing it would be wasteful.
 | Mode | When to use |
 |---|---|
 | `cursor()` (default) | Result fits comfortably in memory (~thousands of rows). All `fetch_*` methods are local; fastest random access. |
 | `cursor(scrollable=True)` | Large result sets where memory matters. Each scroll operation is a round-trip; cursor stays open server-side until `close()`. |
 Server-side scroll cursors require non-autocommit mode (the server needs an open transaction to keep the cursor alive across fetches).
 ### Edge cases
 * `fetch_prior()` from past-end returns the **last** row (SQL standard semantics — the first prior from "after-last" is the last actual row, not the second-to-last).
 * `fetch_absolute(0)` is the first row; `fetch_absolute(-1)` is the last row (Python-style negative indexing).
 * `cursor.rownumber` is 0-indexed; returns `None` when positioned before-first or after-last, or when no result set exists.
 ## Smart-LOBs (BLOB / CLOB)
 ### Read
@ -319,6 +584,29 @@ CREATE DATABASE mydb WITH LOG;
 These steps are detailed in the [DECISION_LOG](DECISION_LOG.md) §6.f and §10.
 ## Known limitations
 Things that don't work yet (and the workaround when one exists):
 | Limitation | Workaround |
 |---|---|
 | **Named parameters** (`paramstyle="named"` or `dict` parameters) | Use positional `?` / `:1` / `:2`. PEP 249 declares one paramstyle per module. |
 | **Binding `ROW(...)` / `SET / MULTISET / LIST`** as a parameter | Decode side surfaces these as `RowValue` / `CollectionValue`. For *writes*, use SQL projections to build them server-side. |
 | **GSSAPI / Kerberos / LDAP auth** | Username/password (plain or password obfuscation) only. |
 | **Distributed transactions (XA)** | Out of scope for the current driver. |
 | **Bulk-load via COPY** | Use `executemany` inside a transaction (≈31K rows/sec — see Performance tips). |
 | **`executemany` on SELECT** | Loop `execute(select_sql, params)` — `executemany` is DML-only by design. |
 | **Listener failover / sqlhosts groups** | Connect to a specific host:port. Implement failover at the application layer or behind a load balancer. |
 | **DATETIME `HOUR TO FRACTION` as a parameter** | Use `DATETIME YEAR TO SECOND` (full datetime). Read side handles all qualifier ranges. |
 | **`BlobLocator` / `ClobLocator` as a parameter** | The `read_blob_column` / `write_blob_column` cursor methods cover the BLOB / CLOB I/O cases. Direct locator-as-param will follow when there's a real use case. |
 | **UDT-typed parameters / returns in `fast_path_call`** | Scalar params and returns only (INT / SMALLINT / BIGINT / FLOAT / REAL / CHAR / VARCHAR). Complex UDT bind needs the IfxComplexInput protocol layer (~700 lines, deferred). |
 Things that work but might surprise you:
 * **`autocommit=True` is opt-in.** PEP 249's default is `autocommit=False`, and that's our default too. Many users coming from `IfxPy` (which defaults to autocommit-on) will find this different — and dramatically faster for bulk loads (see Performance tips).
 * **`commit()` / `rollback()` on an unlogged DB are silent no-ops.** The server returns sqlcode `-201` to `SQ_BEGIN`; the connection caches that and skips the round-trip on subsequent calls. Same client code works against logged and unlogged databases.
 * **`SERIAL` / `BIGSERIAL` columns omitted from INSERT** auto-assign on the server. The auto-assigned value isn't currently exposed via `cursor.lastrowid` (PEP 249 optional surface) — round-trip via `SELECT DBINFO('sqlca.sqlerrd1') FROM systables WHERE tabid=1` if you need it.
 ## Migration from `IfxPy` / legacy `informixdb`
 The PEP 249 surface is identical — most code Just Works after switching the import:
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "informix-db"
-version = "2026.05.04.6"
+version = "2026.05.04.7"
 description = "Pure-Python driver for IBM Informix IDS — speaks the SQLI wire protocol over raw sockets. No CSDK, no JVM, no native libraries."
 readme = "README.md"
 license = { text = "MIT" }
--- a/uv.lock
+++ b/uv.lock
@ -34,7 +34,7 @@ wheels = [
 [[package]]
 name = "informix-db"
-version = "2026.5.4.6"
+version = "2026.5.4.7"
 source = { editable = "." }
 [package.optional-dependencies]