11 Commits

Author SHA1 Message Date
f3e589c5bf Phase 23: Hot-path optimization for parse_tuple_payload (2026.05.04.8)
Per-row decode is hit on every row of every SELECT. The original code
had three forms of waste in the inner loop:

1. Redundant base_type() call. ColumnInfo.type_code is already
   base-typed by parse_describe at construction; calling base_type()
   again per column per row was pure waste. Single largest savings.
2. IntFlag->int conversions inline (~10x per iteration). Lifted to
   module-level _TC_X constants.
3. Lazy imports inside the loop body (_decode_datetime, _decode_interval,
   BlobLocator, ClobLocator, RowValue, CollectionValue). Moved to top.

Plus three precomputed frozensets (_LENGTH_PREFIXED_SHORT_TYPES,
_COMPOSITE_UDT_TYPES, _NUMERIC_TYPES) replace inline tuple-membership
checks. _COLLECTION_KIND_MAP is now MappingProxyType (actually frozen).

Performance:
* parse_tuple_5cols: 2796 ns -> 2030 ns (-27%)
* select_bench_table_all (1k rows): 1477 us -> 1198 us (-19%)
* Codec micro-bench, cold connect, executemany: unchanged

Real-world fetch ceiling on a single connection: 350K rows/sec ->
490K rows/sec.

Margaret Hamilton review surfaced four cleanup items, all addressed
before tagging:
* H1: cursor._dereference_blob_columns had the same redundant
  base_type() call - stripped for consistency.
* M1: documented the load-bearing invariant at parse_describe (the
  single producer site) so future contributors have a grep target.
* M2: _COLLECTION_KIND_MAP wrapped in MappingProxyType.
* L1: stale line-number comment fixed to point at the INVARIANT
  comment instead.

baseline.json refreshed; all 224 integration tests pass; ruff clean.
2026-05-04 17:52:20 -06:00
bea1a1cd0c Phase 20: UTF-8/multibyte locale support (2026.05.04.4)
Thread CLIENT_LOCALE through to user-data string codecs. Driver previously
hardcoded iso-8859-1 for all string conversions, which broke any locale
outside Western European code points.

* Connection.encoding property derived from client_locale via
  _python_encoding_from_locale (en_US.utf8 -> utf-8, en_US.8859-1 ->
  iso-8859-1, etc.)
* encode_param / decode / parse_tuple_payload accept an encoding
  parameter; cursor and fast-path call sites forward conn.encoding
* Smart-LOB CLOB encode/decode and TEXT decode honor connection encoding
* DataError raised for non-representable chars; cursor releases the
  prepared statement before propagating so connection state stays clean

Boundary discipline: protocol-level strings (cursor names, function
signatures, SQ_FILE fnames, error near-tokens, SQL text) stay
iso-8859-1 (always ASCII, never user-controlled).

9 new integration tests in tests/test_unicode.py covering ASCII
round-trip, Latin-1 high-bit, full byte range, locale-mapping,
encoding property, UTF-8 negotiation, multibyte (skipped without
IFX_UTF8_DATABASE), DataError on non-representable, CLOB round-trip.

Total: 69 unit + 212 integration = 281 tests.
2026-05-04 17:13:19 -06:00
9048335462 Phase 12: ROW / COLLECTION type recognition
Composite UDTs (ROW=22, COLLECTION=23, SET=19, MULTISET=20, LIST=21)
now decode into typed wrapper objects (informix_db.RowValue,
informix_db.CollectionValue) that expose schema + raw payload bytes.

The wire format is the now-familiar [byte ind][int length][bytes]
pattern (same as UDTVAR(lvarchar) from Phase 10). The bytes are a
TEXTUAL representation of the value when selected without the
extended-binary opt-in JDBC uses:
  ROW value:  b"ROW('Alice',30         )"
  SET value:  b"SET{'red','green','blue'}"
  LIST value: b"LIST{10         ,20         ,30         }"

JDBC's binary-with-schema format runs ~30x larger (1420 bytes for a
2-field ROW vs. our 24). We don't request it — the textual form is
what the server returns by default and is sufficient for type
recognition.

Phase 12 ships type recognition only. Full recursive parsing into
Python tuples/lists/sets is deferred to Phase 13 (would require a
SQL-literal lexer + recursive type-driven decoding). Production
workloads that need typed field access today can project via SQL:

  cur.execute("SELECT id, r.name, r.age FROM tbl")

Tests: 8 integration tests in test_composite_types.py covering ROW
recognition, NULL, sub-field projection workaround, long values
(>255 bytes — verifies 4-byte length prefix), SET/MULTISET/LIST
recognition, and null collections.

Total: 64 unit + 134 integration = 198 tests.

Lesson reinforced: once one UDT-shaped type is implemented (UDTVAR
in Phase 10, smart-LOB in Phase 9), every subsequent UDT-shaped type
is mostly a copy of the existing decoder branch. The hard part is
payload semantics, not framing.
2026-05-04 14:30:44 -06:00
a9a3cfc38e Phase 10: smart-LOB BLOB read via SQ_FILE / lotofile
Implements end-to-end BLOB reading by leveraging the server's
lotofile() function and intercepting the SQ_FILE protocol with
in-memory file emulation. Avoids implementing the heavier
SQ_FPROUTINE + SQ_LODATA stack initially planned for Phase 10.

Strategy: SELECT lotofile(blob_col, '/path', 'client') causes the
server to orchestrate a SQ_FILE (98) protocol round-trip — it tells
the client to "open file X, write these bytes, close". Our handler
buffers the writes in memory keyed by filename instead of touching
disk. The bytes appear in cursor.blob_files dict.

Wire protocol (per IfxSqli.receiveSQFILE line 4980):
* SQ_FILE optype 0 (open): server sends filename + mode/flags/offset
* SQ_FILE optype 3 (write): chunked SQ_FILE_WRITE (107) blocks of
  data, terminated by SQ_EOT. Client responds with total size.
* SQ_FILE optype 1 (close): bare SQ_EOT both ways.

API:
* Low-level: cur.execute("SELECT lotofile(col, '/tmp/X', 'client') ...")
  followed by cur.blob_files[returned_filename] for the bytes.
* High-level: cur.read_blob_column("SELECT col FROM ... WHERE ...", params)
  returns bytes directly, wrapping the user's SQL with lotofile.

Bonus: row decoder now handles UDTVAR (type 40) with extended_name=
"lvarchar" — the wire format that lotofile() returns its result as.
Format: [byte indicator][int length][bytes].

Tests: 6 integration tests in test_smart_lob_read.py covering
low-level + high-level paths, NULL/no-match, multi-chunk (30KB),
and validation. Test data seeded via JDBC reference client since
smart-LOB writes still need Phase 11.

Total: 64 unit + 117 integration = 181 tests.

Strategic insight from this phase: don't estimate protocol-
implementation cost from JDBC's class hierarchy alone. JDBC's
IfxSmBlob is 600+ lines but the wire-level READ path reduces to one
SQL function call + one new tag handler. The wire is often simpler
than the SDK suggests.

Deferred to Phase 11+:
* Smart-LOB write (still needs SQ_FPROUTINE + SQ_LODATA)
* BlobLocator.read() OO API (requires locator-to-source mapping)
* SQ_FILE optype 2 (filetoblob client→server path)
2026-05-04 13:46:18 -06:00
389c32434c Phase 9: smart-LOB BLOB/CLOB locator decoding (Phase 10 deferred for fetch)
SELECT on BLOB or CLOB columns no longer requires raw byte interpretation.
The 72-byte server-side locator is wrapped in a typed BlobLocator or
ClobLocator (frozen dataclass) so the column is recognizable as
"server-side reference, not actual bytes".

Wire-protocol findings:
* Smart-LOB columns DON'T appear with their nominal type codes (102/101)
  in SQ_DESCRIBE. They surface as UDTFIXED (41) with extended_id 10
  (BLOB) or 11 (CLOB) and encoded_length=72 (locator size).
* Retrieving the actual bytes requires SQ_FPROUTINE (103) RPC to
  invoke ifx_lo_open, plus SQ_LODATA (97) for chunked transfer, plus
  another SQ_FPROUTINE for ifx_lo_close. That's a Phase 10 lift —
  roughly 2x the protocol surface of Phase 8.

Server config needed (added to Phase 7 setup):
* sbspace: onspaces -c -S sbspace1 ...
* default sbspace: onmode -wm SBSPACENAME=sbspace1

What ships in Phase 9:
* informix_db.BlobLocator(raw: bytes) — 72-byte frozen wrapper
* informix_db.ClobLocator(raw: bytes) — distinct type, same shape
* Row decoder branch in _resultset.parse_tuple_payload
* Wire constants SQ_LODATA=97, SQ_FPROUTINE=103, SQ_FPARAM=104

Tests:
* 11 unit tests in test_blob_locator_unit.py (no Informix needed) —
  construction, immutability, equality, hash, repr safety, size
  validation.
* 4 integration tests in test_smart_lob.py — fixture seeds via JDBC
  reference client (smart-LOB writes also need deferred protocols).
* RefBlob.java helper in tests/reference/ for seeding via JDBC.

Total: 64 unit + 111 integration = 175 tests.

Locator design note: __repr__ omits the raw bytes (they're opaque to
the client). Same-bytes locators of different families compare
unequal — BlobLocator(x) != ClobLocator(x) — to avoid silent type
confusion.
2026-05-04 13:26:15 -06:00
4dafbf8ce9 Phase 6.d: INTERVAL decoding (both qualifier families)
Implements row-decoding for IDS INTERVAL, the last common temporal type.
The qualifier short bisects the type at the wire level: start_TU >= DAY
maps to datetime.timedelta (day-fraction), start_TU <= MONTH maps to a
new informix_db.IntervalYM (year-month).

Wire format mirrors DATETIME exactly — `[head byte][digit pairs in
base-100]`, with the qualifier dictating field interpretation. The
fraction-to-nanoseconds scaling (`scale_exp = 18 - end_TU`, forced odd)
is the JDBC pattern from `Decimal.fromIfxToArray`.

IntervalYM is a frozen dataclass holding signed total months, with
`years` and `remainder_months` as derived properties. Matches JDBC's
`IntervalYM.months` shape rather than a (years, months) tuple — avoids
ambiguity around what "negative" means for a multi-field tuple.

Tests: 13 unit (synthetic byte streams covering all decoder branches)
+ 9 integration (real Informix queries spanning DAY TO SECOND, HOUR TO
SECOND, YEAR TO MONTH, negatives, NULL, and mixed-family rows).

Total test count: 53 unit + 82 integration = 135.

Encoder for INTERVAL parameter binding is deferred to a later phase
(same arc as DECIMAL/DATETIME — decode lands first).
2026-05-04 12:22:07 -06:00
6819dd4cb0 Phase 6.b: DATETIME decoding for all qualifier ranges
Before:
  cur.execute("SELECT CURRENT YEAR TO SECOND ...")
  cur.fetchone()  # → (b'\xc7\x14\x1a\x05\x04...',) raw BCD bytes

After:
  cur.execute("SELECT CURRENT YEAR TO SECOND ...")
  cur.fetchone()  # → (datetime.datetime(2026, 5, 4, 12, 34, 56),)

Decoder picks the right Python type by qualifier:
  YEAR/MONTH/DAY-only → datetime.date
  HOUR/MIN/SEC-only   → datetime.time
  spans across both   → datetime.datetime

Wire format (per IfxToJavaDateTime + Decimal.init treating as packed BCD):
  byte[0] = sign + biased exponent (in base-100 digit pairs)
  byte[1..] = BCD digit pairs: YYYY (2 bytes) + MM + DD + HH + MI + SS + FFFFF

Qualifier extraction from column descriptor:
  encoded_length = (digit_count << 8) | (start_TU << 4) | end_TU
  TU codes: YEAR=0, MONTH=2, DAY=4, HOUR=6, MIN=8, SEC=10,
            FRAC1=11..FRAC5=15

Verified against four DATETIME columns of different qualifiers in
one tuple — see test_datetime_multiple_columns_in_one_row:
  YEAR TO SECOND       → datetime.datetime(2026, 5, 4, 12, 34, 56)
  YEAR TO DAY          → datetime.date(2026, 5, 4)
  HOUR TO SECOND       → datetime.time(12, 34, 56)
  YEAR TO FRACTION(3)  → datetime.datetime(...)

Module changes:
  src/informix_db/converters.py:
    + _decode_datetime(raw, encoded_length) — qualifier-driven BCD walk
    + TU constants (_TU_YEAR, _TU_MONTH, ..., _TU_SECOND)
  src/informix_db/_resultset.py:
    + DATETIME row-decoder branch — computes width from digit_count
      in encoded_length high byte, calls _decode_datetime with the
      packed qualifier so it can pick the right Python type

Tests: 40 unit + 70 integration (7 new DATETIME tests) = 110 total,
all green, ruff clean. Tests cover:
  - YEAR TO SECOND → datetime.datetime
  - YEAR TO DAY → datetime.date
  - HOUR TO SECOND → datetime.time
  - CURRENT YEAR TO FRACTION(3) → datetime.datetime
  - Mixed qualifiers in one row
  - DATETIME stored in a real table column (round-trip via SELECT)
  - NULL DATETIME → Python None

DATETIME parameter binding (encoder) is Phase 6.x — same status as
DECIMAL encoder.
2026-05-04 12:02:40 -06:00
2bacbc4e53 Phase 6.a: DECIMAL/MONEY row decoding works (COUNT/SUM/AVG return Decimal)
Before:
  cur.execute('SELECT COUNT(*) FROM systables')
  cur.fetchone()  # → (b'\xc2\x02\x00\x00\x00\x00\x00\x00\x00',) raw bytes

After:
  cur.execute('SELECT COUNT(*) FROM systables')
  cur.fetchone()  # → (Decimal('276'),)

The trickiest decode of the project so far. IDS DECIMAL/MONEY wire format:

  byte[0] = (sign << 7) | biased_exponent_base100
    bit 7 = sign (1=positive, 0=negative)
    bits 0-6 = (exponent + 64), XOR'd with 0x7F if negative
  byte[1..] = digit-pair bytes (each 0..99 = two BCD digits)
    if negative: asymmetric base-100 complement applied:
      walk digits right→left, trailing zeros stay zero,
      first non-zero subtracts from 100, rest from 99

Initial naive "99 - d for all digits" decoder gave artifacts like
-1234.559999 instead of -1234.56. The asymmetric complement rule
(from Decimal.decComplement line 447) is what makes negatives
round-trip exactly.

Width on the wire: per-column encoded_length packed as
(precision << 8) | scale; byte width = ceil(precision/2) + 1.
parse_tuple_payload uses this to slice DECIMAL columns correctly.

Tested cases all decode correctly:
  COUNT(*)             → Decimal('276')
  SUM(tabid)           → Decimal('55')
  AVG(tabid)           → Decimal('5.5')
  1234.56::DECIMAL     → Decimal('1234.56')
  -1234.56::DECIMAL    → Decimal('-1234.56')
  -0.5::DECIMAL        → Decimal('-0.5')
  -99.99::DECIMAL      → Decimal('-99.99')
  -12345678.9::DECIMAL → Decimal('-12345678.9')
  NULL                 → None

Encoder (_encode_decimal) is implemented but disabled — server rejects
the produced bytes (precision packing not quite right). Phase 6.x will
fix. Workaround: cast Decimal to float, or pass via SQL literal.

Module changes:
  src/informix_db/converters.py:
    + decimal module import
    + _decode_decimal — full BCD decoder with asymmetric complement
    + _encode_decimal (Phase 6.x stub — present but unreached)
    + DECIMAL/MONEY added to DECODERS dispatch
  src/informix_db/_resultset.py:
    + DECIMAL/MONEY width computation from encoded_length

Tests: 40 unit + 55 integration (8 new DECIMAL) = 95 total, all
green, ruff clean.
2026-05-04 11:17:59 -06:00
34ad04a872 Phase 2.x: VARCHAR row decoding works — three byte-level fixes
Three findings, each caught by a different debugging technique,
documented in DECISION_LOG.md:

1. CURNAME+NFETCH PDU: trailing reserved field is SHORT not INT.
   Caught by byte-diffing our 44-byte PDU against JDBC's 42-byte
   reference under socat. The server tolerated the longer version
   for INT-only SELECTs (silently consuming extra zeros) but
   rejected it for VARCHAR queries. Lesson: server tolerance varies
   by query type — always match JDBC byte-for-byte.

2. SQ_TUPLE payload pads to even byte alignment. An 11-byte
   "syscolumns" VARCHAR payload had a trailing 0x00 between it and
   the next SQ_TUPLE tag. JDBC's IfxRowColumn.readTuple consumes
   this pad silently; we weren't, so any odd-length variable-width
   row desynced the parser.

3. VARCHAR/NCHAR/NVCHAR in tuple data use a SINGLE-byte length
   prefix (max 255 chars — IDS VARCHAR's hard limit). NOT a 2-byte
   short as I'd initially assumed. CHAR is fixed-width per
   encoded_length. LVARCHAR uses a 4-byte int prefix for >255 byte
   values.

Module changes:
  src/informix_db/_resultset.py — _LENGTH_PREFIXED_SHORT_TYPES set,
    branched VARCHAR/NCHAR/NVCHAR (1-byte prefix) vs CHAR (fixed)
    vs LVARCHAR (4-byte prefix); even-byte alignment pad consumed
    after each SQ_TUPLE payload.
  src/informix_db/cursors.py — CURNAME+NFETCH and standalone NFETCH
    PDUs now write_short(0) for the reserved trailing field.

Tests: 40 unit + 18 integration (3 new VARCHAR tests) = 58 total,
all green, ruff clean. New tests cover:
  - VARCHAR single-column SELECT
  - Odd-length VARCHAR row (regression for the pad-byte bug)
  - Mixed INT + VARCHAR + FLOAT three-column SELECT

Sample output:
  SELECT FIRST 5 tabname FROM systables → ('systables',),
    ('syscolumns',), ('sysindices',), ('systabauth',), ('syscolauth',)
  SELECT FIRST 3 tabname, tabid, nrows → ('systables', 1, 276.0), ...

VARCHAR was the last known gap from the Phase 2 commit. Phase 2
now reads INT, BIGINT, REAL, FLOAT, CHAR, VARCHAR end-to-end. Phase
6+ types (DATETIME, INTERVAL, DECIMAL, BLOBs) remain.
2026-05-04 07:55:13 -06:00
a1bd52788d Phase 2: SELECT works end-to-end — pure-Python Informix fully reads data
cursor.execute("SELECT 1 FROM systables WHERE tabid = 1")
  cursor.fetchone() == (1,)

To my knowledge, this is the first time a pure-Python implementation
has read data from Informix without wrapping IBM's CSDK or JDBC.

Three breakthroughs in this commit:

1. Login PDU's database field is BROKEN. Passing a database name there
   makes the server reject subsequent SQ_DBOPEN with sqlcode -759
   ("database not available"). JDBC always sends NULL in the login
   PDU's database slot — we now do the same. The user-supplied database
   opens via SQ_DBOPEN in _init_session.

2. Post-login session init dance: SQ_PROTOCOLS (8-byte feature mask
   replayed verbatim from JDBC) → SQ_INFO with INFO_ENV + env vars
   (48-byte PDU replayed verbatim — DBTEMP=/tmp, SUBQCACHESZ=10) →
   SQ_DBOPEN. Without all three steps in this exact order, the server
   silently ignores SELECTs.

3. SQ_DESCRIBE per-column block has 10 fields per column (not the
   simple "name + type" my best-effort parser assumed): fieldIndex,
   columnStartPos, columnType, columnExtendedId, ownerName,
   extendedName, reference, alignment, sourceType, encodedLength.
   The string table at the end is offset-indexed (fieldIndex points
   into it), which is how JDBC handles disambiguation.

Cursor lifecycle implementation in cursors.py mirrors JDBC exactly:
  PREPARE+NDESCRIBE+WANTDONE → DESCRIBE+DONE+COST+EOT
  CURNAME+NFETCH(4096) → TUPLE*+DONE+COST+EOT
  NFETCH(4096) → DONE+COST+EOT (drain)
  CLOSE → EOT
  RELEASE → EOT

Five round trips per SELECT — same as JDBC.

Module changes:
  src/informix_db/connections.py — added _init_session(), _send_protocols(),
    _send_dbopen(), _drain_to_eot(), _raise_sq_err(); login PDU now
    forces database=None always; SQ_INFO PDU replayed verbatim from
    JDBC capture (offsets-indexed env-var format too gnarly to derive
    in MVP).
  src/informix_db/cursors.py — full rewrite: real PDU builders for
    PREPARE/CURNAME+NFETCH/NFETCH/CLOSE/RELEASE; tag-dispatched
    response readers; cursor-name generator matching JDBC's "_ifxc"
    convention.
  src/informix_db/_resultset.py — proper SQ_DESCRIBE parser per
    JDBC's receiveDescribe (USVER mode); offset-indexed string table
    with name lookup by fieldIndex; ColumnInfo dataclass with raw
    type-code preserved for null-flag extraction.
  src/informix_db/_messages.py — added SQ_NDESCRIBE=22, SQ_WANTDONE=49.

Test coverage: 40 unit + 15 integration tests (7 smoke + 8 new SELECT)
= 55 total, all green, ruff clean. New tests cover:
  - SELECT 1 returns (1,)
  - cursor.description shape per PEP 249
  - Multi-row INT SELECT
  - Multi-column mixed types (INT + FLOAT)
  - Iterator protocol (for row in cursor)
  - fetchmany(n)
  - Re-executing on same cursor resets state
  - Two cursors on one connection (sequential)

Known gap: VARCHAR row decoding doesn't yet handle the variable-width
on-wire encoding correctly. Phase 2.x will address — for now NotImpl
errors surface raw bytes in the row tuple.
2026-05-03 15:37:10 -06:00
e2c48f855e Phase 2 progress: cursor scaffolding + protocol findings (SELECT path WIP)
Cursor class scaffolded with full PEP 249 surface:
  src/informix_db/cursors.py — Cursor with execute, fetchone, fetchmany,
    fetchall, description, rowcount, arraysize, close, iterator,
    context manager. Sends SQ_COMMAND chains for parameterless SQL
    (Phase 4 adds SQ_BIND/SQ_EXECUTE for params).
  src/informix_db/_resultset.py — ColumnInfo, parse_describe,
    parse_tuple_payload. Best-effort SQ_DESCRIBE parser; refines in
    Phase 2.1.
  src/informix_db/connections.py — Connection.cursor() now returns a
    real Cursor; new _send_pdu() lets Cursor share the connection's
    socket without violating encapsulation.

Protocol findings landed in PROTOCOL_NOTES.md §6:
  §6a — SQ_PREPARE format with named tags (the "trailing 22, 49"
    are SQ_NDESCRIBE and SQ_WANTDONE chained into the same PDU).
    Confirmed against IfxSqli.sendPrepare line 1062.
  §6c — Server requires post-login init sequence (SQ_PROTOCOLS →
    SQ_INFO → SQ_ID(env vars) → SQ_DBOPEN) BEFORE any PREPARE works.
    Discovered the hard way: PREPARE without this sequence gets no
    response; SQ_DBOPEN without SQ_PROTOCOLS gets sqlcode=-759
    ("Database not available"). The login PDU's database field is
    a hint, not an open.
  §6e — SQ_TUPLE corrected: [short warn][int size][bytes payload]
    (not [int 0][short payloadLen] as earlier draft claimed).
  Two more constants added to _messages.MessageType:
    SQ_NDESCRIBE = 22, SQ_WANTDONE = 49

Tests: 40 unit + 7 integration (added 2 new — cursor() returns a
Cursor, parameter binding raises NotSupportedError). All green, ruff
clean. Removed obsolete "cursor() raises NotImplementedError" test.

What works end-to-end now: connect, cursor(), close, parameter-attempt
gating. What doesn't yet: cursor.execute("SELECT 1") — server requires
the post-login init sequence we don't yet send.

Discovered captures (kept for next session's analysis):
  docs/CAPTURES/06-py-select1-attempt.socat.log
  docs/CAPTURES/07-py-replay-jdbc-prepare.socat.log
  docs/CAPTURES/08-py-with-dbopen.socat.log
  docs/CAPTURES/09-py-full-replay.socat.log

Three new tasks created tracking the remaining Phase 2 blockers:
post-login init sequence, proper SQ_DESCRIBE parser, SQ_ID action
vocabulary helpers.
2026-05-02 21:04:30 -06:00