Phase 20: UTF-8/multibyte locale support (2026.05.04.4)
Thread CLIENT_LOCALE through to user-data string codecs. Driver previously hardcoded iso-8859-1 for all string conversions, which broke any locale outside Western European code points. * Connection.encoding property derived from client_locale via _python_encoding_from_locale (en_US.utf8 -> utf-8, en_US.8859-1 -> iso-8859-1, etc.) * encode_param / decode / parse_tuple_payload accept an encoding parameter; cursor and fast-path call sites forward conn.encoding * Smart-LOB CLOB encode/decode and TEXT decode honor connection encoding * DataError raised for non-representable chars; cursor releases the prepared statement before propagating so connection state stays clean Boundary discipline: protocol-level strings (cursor names, function signatures, SQ_FILE fnames, error near-tokens, SQL text) stay iso-8859-1 (always ASCII, never user-controlled). 9 new integration tests in tests/test_unicode.py covering ASCII round-trip, Latin-1 high-bit, full byte range, locale-mapping, encoding property, UTF-8 negotiation, multibyte (skipped without IFX_UTF8_DATABASE), DataError on non-representable, CLOB round-trip. Total: 69 unit + 212 integration = 281 tests.
This commit is contained in:
parent
9703279bc8
commit
bea1a1cd0c
47
CHANGELOG.md
47
CHANGELOG.md
@ -2,6 +2,53 @@
|
|||||||
|
|
||||||
All notable changes to `informix-db`. Versioning is [CalVer](https://calver.org/) — `YYYY.MM.DD` for date-based releases, `YYYY.MM.DD.N` for same-day post-releases per PEP 440.
|
All notable changes to `informix-db`. Versioning is [CalVer](https://calver.org/) — `YYYY.MM.DD` for date-based releases, `YYYY.MM.DD.N` for same-day post-releases per PEP 440.
|
||||||
|
|
||||||
|
## 2026.05.04.4 — UTF-8 / multibyte locale support
|
||||||
|
|
||||||
|
Threads the connection's `CLIENT_LOCALE` through to user-data string codecs so multibyte locales (UTF-8, etc.) round-trip correctly. The driver previously hardcoded `iso-8859-1` for every string conversion — fine for Western European text, broken-by-design for CJK, Cyrillic, Arabic, emoji.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- **`Connection.encoding`** property — reports the Python codec name derived from `CLIENT_LOCALE` (e.g., `iso-8859-1`, `utf-8`, `iso-8859-15`). Default for a connection without `client_locale=` is `iso-8859-1` (compatible with the legacy default).
|
||||||
|
|
||||||
|
- **`informix_db.connections._python_encoding_from_locale(locale: str)`** — maps Informix locale strings (`en_US.utf8`, `en_US.8859-1`, `en_US.819`) to Python codec names. Falls back to `iso-8859-1` for unknown / unsuffixed forms.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
|
||||||
|
- **`encode_param(value, encoding=...)`** and `_encode_str(value, encoding=...)` honor the connection's encoding instead of hardcoded `iso-8859-1`. Cursor's `_emit_bind_params` forwards `self._conn.encoding` per parameter.
|
||||||
|
|
||||||
|
- **`decode(type_code, raw, encoding=...)`** and `parse_tuple_payload(reader, columns, encoding=...)` thread the encoding to string column decoders (CHAR, VARCHAR, NCHAR, NVCHAR, LVARCHAR). Cursor's `_read_fetch_response` forwards `self._conn.encoding`.
|
||||||
|
|
||||||
|
- **Smart-LOB CLOB encode/decode** (`write_blob_column`, simple-LOB TEXT fetch) honor `self._conn.encoding`.
|
||||||
|
|
||||||
|
- **Fast-path RPC** (`Connection.fast_path_call`) honors `self._encoding` for its bound parameters.
|
||||||
|
|
||||||
|
### Boundary discipline
|
||||||
|
|
||||||
|
Protocol-level strings stay `iso-8859-1` (always ASCII, never user-controlled): cursor names, function signatures, server-fabricated SQ_FILE virtual filenames, error "near tokens", SQL keywords/identifiers. Only user-data strings (column values, parameter binds) follow `CLIENT_LOCALE`.
|
||||||
|
|
||||||
|
### Error handling
|
||||||
|
|
||||||
|
Encoding-can't-represent-this-value (e.g., `"你好"` on an `8859-1` connection) now raises `informix_db.DataError` instead of letting Python's `UnicodeEncodeError` leak. The cursor releases the prepared statement before propagating, so the connection survives cleanly for the next query.
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
|
||||||
|
9 new integration tests in `tests/test_unicode.py`:
|
||||||
|
- ASCII round-trip (regression)
|
||||||
|
- Latin-1 high-bit chars round-trip on default locale
|
||||||
|
- Full byte range 0x20-0xFE round-trip via VARCHAR
|
||||||
|
- Locale → Python codec mapping for common forms
|
||||||
|
- `Connection.encoding` exposes the resolved codec
|
||||||
|
- UTF-8 locale negotiation (server transcodes for ASCII even with 8859-1 DB)
|
||||||
|
- UTF-8 multibyte round-trip (skipped without `IFX_UTF8_DATABASE` env var pointing to a UTF-8 database)
|
||||||
|
- Non-representable char raises `DataError` cleanly; connection survives
|
||||||
|
- CLOB column round-trips Latin-1 text honoring connection encoding
|
||||||
|
|
||||||
|
Total: **69 unit + 212 integration = 281 tests**.
|
||||||
|
|
||||||
|
### Limitations
|
||||||
|
|
||||||
|
- Multibyte UTF-8 storage requires both `client_locale='en_US.utf8'` AND a database whose `DB_LOCALE` is UTF-8. The dev container's `testdb` is `8859-1`, so storing CJK chars there will continue to fail server-side regardless of the client codec. The `test_utf8_multibyte_round_trip` test is gated on the `IFX_UTF8_DATABASE` env var pointing to a UTF-8 database.
|
||||||
|
|
||||||
## 2026.05.04.3 — Resilience tests (fault injection)
|
## 2026.05.04.3 — Resilience tests (fault injection)
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
[project]
|
[project]
|
||||||
name = "informix-db"
|
name = "informix-db"
|
||||||
version = "2026.05.04.3"
|
version = "2026.05.04.4"
|
||||||
description = "Pure-Python driver for IBM Informix IDS — speaks the SQLI wire protocol over raw sockets. No CSDK, no JVM, no native libraries."
|
description = "Pure-Python driver for IBM Informix IDS — speaks the SQLI wire protocol over raw sockets. No CSDK, no JVM, no native libraries."
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
license = { text = "MIT" }
|
license = { text = "MIT" }
|
||||||
|
|||||||
@ -77,6 +77,7 @@ def build_exfp_routine_pdu(
|
|||||||
db_name: str,
|
db_name: str,
|
||||||
handle: int,
|
handle: int,
|
||||||
params: tuple,
|
params: tuple,
|
||||||
|
encoding: str = "iso-8859-1",
|
||||||
) -> bytes:
|
) -> bytes:
|
||||||
"""Build a ``SQ_EXFPROUTINE`` request PDU.
|
"""Build a ``SQ_EXFPROUTINE`` request PDU.
|
||||||
|
|
||||||
@ -106,7 +107,7 @@ def build_exfp_routine_pdu(
|
|||||||
if value is None:
|
if value is None:
|
||||||
out.extend(struct.pack("!hhh", 0, -1, 0))
|
out.extend(struct.pack("!hhh", 0, -1, 0))
|
||||||
continue
|
continue
|
||||||
ifx_type, prec, raw = encode_param(value)
|
ifx_type, prec, raw = encode_param(value, encoding=encoding)
|
||||||
out.extend(struct.pack("!hhh", ifx_type, 0, prec))
|
out.extend(struct.pack("!hhh", ifx_type, 0, prec))
|
||||||
out.extend(raw)
|
out.extend(raw)
|
||||||
if len(raw) & 1:
|
if len(raw) & 1:
|
||||||
|
|||||||
@ -177,6 +177,7 @@ _LENGTH_PREFIXED_SHORT_TYPES = frozenset({
|
|||||||
def parse_tuple_payload(
|
def parse_tuple_payload(
|
||||||
reader: IfxStreamReader,
|
reader: IfxStreamReader,
|
||||||
columns: list[ColumnInfo],
|
columns: list[ColumnInfo],
|
||||||
|
encoding: str = "iso-8859-1",
|
||||||
) -> tuple:
|
) -> tuple:
|
||||||
"""Parse a SQ_TUPLE payload (the SQ_TUPLE tag is already consumed).
|
"""Parse a SQ_TUPLE payload (the SQ_TUPLE tag is already consumed).
|
||||||
|
|
||||||
@ -193,6 +194,10 @@ def parse_tuple_payload(
|
|||||||
* LVARCHAR: 4-byte length prefix instead of 2.
|
* LVARCHAR: 4-byte length prefix instead of 2.
|
||||||
* Other variable-width types (DECIMAL, DATETIME, INTERVAL, BLOBs):
|
* Other variable-width types (DECIMAL, DATETIME, INTERVAL, BLOBs):
|
||||||
Phase 6+ — currently surfaces raw bytes from ``encoded_length``.
|
Phase 6+ — currently surfaces raw bytes from ``encoded_length``.
|
||||||
|
|
||||||
|
``encoding`` is forwarded to ``decode()`` for string columns. Caller
|
||||||
|
(typically the cursor) should pass the connection's
|
||||||
|
``encoding`` so user-data text honors CLIENT_LOCALE.
|
||||||
"""
|
"""
|
||||||
reader.read_short() # warn (Phase 5 surfaces)
|
reader.read_short() # warn (Phase 5 surfaces)
|
||||||
size = reader.read_int()
|
size = reader.read_int()
|
||||||
@ -229,7 +234,7 @@ def parse_tuple_payload(
|
|||||||
offset += 1
|
offset += 1
|
||||||
raw = payload[offset:offset + length]
|
raw = payload[offset:offset + length]
|
||||||
offset += length
|
offset += length
|
||||||
values.append(decode(col.type_code, raw))
|
values.append(decode(col.type_code, raw, encoding))
|
||||||
continue
|
continue
|
||||||
|
|
||||||
if base == int(IfxType.LVARCHAR):
|
if base == int(IfxType.LVARCHAR):
|
||||||
@ -240,7 +245,7 @@ def parse_tuple_payload(
|
|||||||
offset += length
|
offset += length
|
||||||
if length & 1:
|
if length & 1:
|
||||||
offset += 1
|
offset += 1
|
||||||
values.append(decode(col.type_code, raw))
|
values.append(decode(col.type_code, raw, encoding))
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# DECIMAL/MONEY: width = ceil(precision/2) + 1, where precision is
|
# DECIMAL/MONEY: width = ceil(precision/2) + 1, where precision is
|
||||||
@ -368,7 +373,7 @@ def parse_tuple_payload(
|
|||||||
offset += length
|
offset += length
|
||||||
if length & 1:
|
if length & 1:
|
||||||
offset += 1
|
offset += 1
|
||||||
values.append(raw.decode("iso-8859-1"))
|
values.append(raw.decode(encoding))
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Fixed-width types
|
# Fixed-width types
|
||||||
@ -380,7 +385,7 @@ def parse_tuple_payload(
|
|||||||
raw = payload[offset:offset + width]
|
raw = payload[offset:offset + width]
|
||||||
offset += width
|
offset += width
|
||||||
try:
|
try:
|
||||||
values.append(decode(col.type_code, raw))
|
values.append(decode(col.type_code, raw, encoding))
|
||||||
except NotImplementedError:
|
except NotImplementedError:
|
||||||
values.append(raw)
|
values.append(raw)
|
||||||
return tuple(values)
|
return tuple(values)
|
||||||
|
|||||||
@ -53,6 +53,34 @@ _DEFAULT_CAP_1 = 0x0000013C
|
|||||||
_DEFAULT_CAP_2 = 0
|
_DEFAULT_CAP_2 = 0
|
||||||
_DEFAULT_CAP_3 = 0
|
_DEFAULT_CAP_3 = 0
|
||||||
|
|
||||||
|
# Phase 20: client_locale → Python encoding name. Used by user-data
|
||||||
|
# string codecs (CHAR/VARCHAR/LVARCHAR/CLOB/TEXT). Protocol-level
|
||||||
|
# strings (cursor names, signatures, error tokens) stay iso-8859-1.
|
||||||
|
_LOCALE_ENCODING_MAP = {
|
||||||
|
"8859-1": "iso-8859-1",
|
||||||
|
"819": "iso-8859-1",
|
||||||
|
"8859-15": "iso-8859-15",
|
||||||
|
"923": "iso-8859-15",
|
||||||
|
"utf8": "utf-8",
|
||||||
|
"utf-8": "utf-8",
|
||||||
|
"utf16": "utf-16",
|
||||||
|
"ucs2": "utf-16",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _python_encoding_from_locale(locale: str) -> str:
|
||||||
|
"""Map an Informix CLIENT_LOCALE string to the matching Python codec.
|
||||||
|
|
||||||
|
The CLIENT_LOCALE format is ``<lang>_<region>.<codeset>`` — we
|
||||||
|
only care about the codeset suffix. Unknown / no-suffix locales
|
||||||
|
fall back to ``iso-8859-1`` (the Informix default).
|
||||||
|
"""
|
||||||
|
if "." not in locale:
|
||||||
|
return "iso-8859-1"
|
||||||
|
suffix = locale.split(".", 1)[1].lower()
|
||||||
|
return _LOCALE_ENCODING_MAP.get(suffix, "iso-8859-1")
|
||||||
|
|
||||||
|
|
||||||
# Default environment variables sent in the login PDU (SQ_ASCENV section).
|
# Default environment variables sent in the login PDU (SQ_ASCENV section).
|
||||||
# These match what the JDBC driver sends for a vanilla en_US.8859-1
|
# These match what the JDBC driver sends for a vanilla en_US.8859-1
|
||||||
# connection. Anything missing makes the server fall back to defaults.
|
# connection. Anything missing makes the server fall back to defaults.
|
||||||
@ -96,6 +124,7 @@ class Connection:
|
|||||||
self._database = database
|
self._database = database
|
||||||
self._server = server
|
self._server = server
|
||||||
self._client_locale = client_locale
|
self._client_locale = client_locale
|
||||||
|
self._encoding = _python_encoding_from_locale(client_locale)
|
||||||
self._autocommit = autocommit
|
self._autocommit = autocommit
|
||||||
self._closed = False
|
self._closed = False
|
||||||
self._lock = threading.Lock()
|
self._lock = threading.Lock()
|
||||||
@ -154,6 +183,16 @@ class Connection:
|
|||||||
def closed(self) -> bool:
|
def closed(self) -> bool:
|
||||||
return self._closed
|
return self._closed
|
||||||
|
|
||||||
|
@property
|
||||||
|
def encoding(self) -> str:
|
||||||
|
"""Python codec name for user-data strings (CHAR/VARCHAR/CLOB/TEXT).
|
||||||
|
|
||||||
|
Derived from ``client_locale`` at connect time. Defaults to
|
||||||
|
``"iso-8859-1"`` for the Informix default locale; ``"utf-8"``
|
||||||
|
when ``client_locale="en_US.utf8"`` (or similar).
|
||||||
|
"""
|
||||||
|
return self._encoding
|
||||||
|
|
||||||
def cursor(self, *, scrollable: bool = False) -> Cursor:
|
def cursor(self, *, scrollable: bool = False) -> Cursor:
|
||||||
"""Return a new Cursor for executing SQL on this connection.
|
"""Return a new Cursor for executing SQL on this connection.
|
||||||
|
|
||||||
@ -293,7 +332,9 @@ class Connection:
|
|||||||
|
|
||||||
# Now execute via SQ_EXFPROUTINE
|
# Now execute via SQ_EXFPROUTINE
|
||||||
self._sock.write_all(
|
self._sock.write_all(
|
||||||
build_exfp_routine_pdu(db_name, handle, params)
|
build_exfp_routine_pdu(
|
||||||
|
db_name, handle, params, encoding=self._encoding
|
||||||
|
)
|
||||||
)
|
)
|
||||||
reader = _SocketReader(self._sock)
|
reader = _SocketReader(self._sock)
|
||||||
tag = reader.read_short()
|
tag = reader.read_short()
|
||||||
|
|||||||
@ -194,12 +194,12 @@ def _decode_float(raw: bytes) -> float | None:
|
|||||||
return struct.unpack("!d", raw)[0]
|
return struct.unpack("!d", raw)[0]
|
||||||
|
|
||||||
|
|
||||||
def _decode_char(raw: bytes) -> str:
|
def _decode_char(raw: bytes, encoding: str = "iso-8859-1") -> str:
|
||||||
"""Strip trailing spaces (CHAR is space-padded to declared length)."""
|
"""Strip trailing spaces (CHAR is space-padded to declared length)."""
|
||||||
return raw.rstrip(b" \x00").decode("iso-8859-1")
|
return raw.rstrip(b" \x00").decode(encoding)
|
||||||
|
|
||||||
|
|
||||||
def _decode_varchar(raw: bytes) -> str | None:
|
def _decode_varchar(raw: bytes, encoding: str = "iso-8859-1") -> str | None:
|
||||||
"""VARCHAR — variable-length string. NULL is the special sentinel ``\\x00``
|
"""VARCHAR — variable-length string. NULL is the special sentinel ``\\x00``
|
||||||
(single nul byte). The row decoder peels off the length prefix and passes
|
(single nul byte). The row decoder peels off the length prefix and passes
|
||||||
the content here. Note: VARCHAR cannot contain embedded nuls anyway, so
|
the content here. Note: VARCHAR cannot contain embedded nuls anyway, so
|
||||||
@ -207,7 +207,7 @@ def _decode_varchar(raw: bytes) -> str | None:
|
|||||||
"""
|
"""
|
||||||
if raw == b"\x00":
|
if raw == b"\x00":
|
||||||
return None
|
return None
|
||||||
return raw.rstrip(b"\x00").decode("iso-8859-1")
|
return raw.rstrip(b"\x00").decode(encoding)
|
||||||
|
|
||||||
|
|
||||||
def _decode_bool(raw: bytes) -> bool:
|
def _decode_bool(raw: bytes) -> bool:
|
||||||
@ -534,11 +534,25 @@ DECODERS: dict[int, DecoderFn] = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def decode(type_code: int, raw: bytes) -> object:
|
_STRING_DECODER_TYPES = frozenset({
|
||||||
|
int(IfxType.CHAR),
|
||||||
|
int(IfxType.VARCHAR),
|
||||||
|
int(IfxType.NCHAR),
|
||||||
|
int(IfxType.NVCHAR),
|
||||||
|
int(IfxType.LVARCHAR),
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
def decode(type_code: int, raw: bytes, encoding: str = "iso-8859-1") -> object:
|
||||||
"""Decode ``raw`` bytes for the given IDS type code into a Python value.
|
"""Decode ``raw`` bytes for the given IDS type code into a Python value.
|
||||||
|
|
||||||
The high-bit flags (NOTNULLABLE etc.) are stripped before lookup.
|
The high-bit flags (NOTNULLABLE etc.) are stripped before lookup.
|
||||||
Raises ``KeyError`` for unsupported types — Phase 6+ adds the rest.
|
Raises ``KeyError`` for unsupported types — Phase 6+ adds the rest.
|
||||||
|
|
||||||
|
``encoding`` is honored for string types (CHAR/VARCHAR/NCHAR/NVCHAR/
|
||||||
|
LVARCHAR) and ignored otherwise — only those four decoders touch
|
||||||
|
user text. Pass the connection's ``encoding`` (derived from
|
||||||
|
CLIENT_LOCALE) so multibyte locales round-trip correctly.
|
||||||
"""
|
"""
|
||||||
base = base_type(type_code)
|
base = base_type(type_code)
|
||||||
decoder = DECODERS.get(base)
|
decoder = DECODERS.get(base)
|
||||||
@ -548,6 +562,8 @@ def decode(type_code: int, raw: bytes) -> object:
|
|||||||
f"(Phase 2 MVP supports: SMALLINT, INT, BIGINT, REAL, FLOAT, "
|
f"(Phase 2 MVP supports: SMALLINT, INT, BIGINT, REAL, FLOAT, "
|
||||||
f"CHAR, VARCHAR, BOOL, DATE)"
|
f"CHAR, VARCHAR, BOOL, DATE)"
|
||||||
)
|
)
|
||||||
|
if base in _STRING_DECODER_TYPES:
|
||||||
|
return decoder(raw, encoding)
|
||||||
return decoder(raw)
|
return decoder(raw)
|
||||||
|
|
||||||
|
|
||||||
@ -579,14 +595,32 @@ def _encode_bigint(value: int) -> EncodedParam:
|
|||||||
return (52, 0x1300, value.to_bytes(8, "big", signed=True))
|
return (52, 0x1300, value.to_bytes(8, "big", signed=True))
|
||||||
|
|
||||||
|
|
||||||
def _encode_str(value: str) -> EncodedParam:
|
def _encode_str(value: str, encoding: str = "iso-8859-1") -> EncodedParam:
|
||||||
"""Encode a Python str as Informix CHAR (type=0, length-prefixed).
|
"""Encode a Python str as Informix CHAR (type=0, length-prefixed).
|
||||||
|
|
||||||
JDBC sends Java strings as CHAR (type=0) on the wire — the server
|
JDBC sends Java strings as CHAR (type=0) on the wire — the server
|
||||||
handles conversion to the actual column type (CHAR/VARCHAR/NVARCHAR).
|
handles conversion to the actual column type (CHAR/VARCHAR/NVARCHAR).
|
||||||
Format: ``[short length][bytes]`` (writePadded adds even-byte pad).
|
Format: ``[short length][bytes]`` (writePadded adds even-byte pad).
|
||||||
|
|
||||||
|
``encoding`` honors the connection's ``CLIENT_LOCALE``: pass
|
||||||
|
``"utf-8"`` for ``en_US.utf8`` connections so multi-byte chars
|
||||||
|
round-trip rather than crashing on UnicodeEncodeError.
|
||||||
|
|
||||||
|
A character outside the configured codec raises :class:`DataError`
|
||||||
|
rather than letting Python's :class:`UnicodeEncodeError` bubble up —
|
||||||
|
this matches PEP 249's category for "value can't fit the column"
|
||||||
|
and lets clean exception-handling work (``except informix_db.Error``).
|
||||||
"""
|
"""
|
||||||
encoded = value.encode("iso-8859-1")
|
from .exceptions import DataError
|
||||||
|
try:
|
||||||
|
encoded = value.encode(encoding)
|
||||||
|
except UnicodeEncodeError as exc:
|
||||||
|
raise DataError(
|
||||||
|
f"cannot encode parameter under client_locale codec "
|
||||||
|
f"{encoding!r}: {exc.reason} at position {exc.start}-{exc.end}. "
|
||||||
|
f"Connect with a wider locale (e.g., 'en_US.utf8') if your "
|
||||||
|
f"data contains characters outside this codec."
|
||||||
|
) from exc
|
||||||
raw = len(encoded).to_bytes(2, "big") + encoded
|
raw = len(encoded).to_bytes(2, "big") + encoded
|
||||||
return (0, 0, raw)
|
return (0, 0, raw)
|
||||||
|
|
||||||
@ -883,11 +917,17 @@ def _encode_decimal(value: decimal.Decimal) -> EncodedParam:
|
|||||||
return (5, prec_short, raw)
|
return (5, prec_short, raw)
|
||||||
|
|
||||||
|
|
||||||
def encode_param(value: object) -> EncodedParam:
|
def encode_param(
|
||||||
|
value: object, encoding: str = "iso-8859-1"
|
||||||
|
) -> EncodedParam:
|
||||||
"""Pick an encoder based on the Python value's type.
|
"""Pick an encoder based on the Python value's type.
|
||||||
|
|
||||||
Returns ``(ifx_type, precision_short, raw_bytes)`` for the parameter.
|
Returns ``(ifx_type, precision_short, raw_bytes)`` for the parameter.
|
||||||
Returns ``(0, 0, b"")`` and the caller must use indicator=-1 for None.
|
Returns ``(0, 0, b"")`` and the caller must use indicator=-1 for None.
|
||||||
|
|
||||||
|
``encoding``: Python codec name for ``str`` values. Should match
|
||||||
|
the connection's ``CLIENT_LOCALE``. Caller (typically the cursor)
|
||||||
|
forwards ``conn.encoding``.
|
||||||
"""
|
"""
|
||||||
if value is None:
|
if value is None:
|
||||||
return (0, 0, b"")
|
return (0, 0, b"")
|
||||||
@ -901,7 +941,7 @@ def encode_param(value: object) -> EncodedParam:
|
|||||||
if isinstance(value, float):
|
if isinstance(value, float):
|
||||||
return _encode_float(value)
|
return _encode_float(value)
|
||||||
if isinstance(value, str):
|
if isinstance(value, str):
|
||||||
return _encode_str(value)
|
return _encode_str(value, encoding=encoding)
|
||||||
# NB: datetime.datetime is a subclass of datetime.date — must check
|
# NB: datetime.datetime is a subclass of datetime.date — must check
|
||||||
# datetime BEFORE date.
|
# datetime BEFORE date.
|
||||||
if isinstance(value, datetime.datetime):
|
if isinstance(value, datetime.datetime):
|
||||||
|
|||||||
@ -19,6 +19,7 @@ in Phase 4.
|
|||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import contextlib
|
||||||
import itertools
|
import itertools
|
||||||
import struct
|
import struct
|
||||||
from collections.abc import Iterator
|
from collections.abc import Iterator
|
||||||
@ -229,10 +230,22 @@ class Cursor:
|
|||||||
prepared statement; binding happens before opening the cursor.
|
prepared statement; binding happens before opening the cursor.
|
||||||
We send SQ_BIND alone first (no SQ_EXECUTE — that's for DML),
|
We send SQ_BIND alone first (no SQ_EXECUTE — that's for DML),
|
||||||
then proceed with the normal cursor open + fetch flow.
|
then proceed with the normal cursor open + fetch flow.
|
||||||
|
|
||||||
|
Mirrors :meth:`_execute_dml_with_params` cleanup: a client-side
|
||||||
|
failure during bind-build (e.g., a DataError for a string that
|
||||||
|
can't fit the connection's codec) releases the prepared
|
||||||
|
statement before propagating.
|
||||||
"""
|
"""
|
||||||
# Send SQ_BIND alone (without SQ_EXECUTE chained — for SELECT,
|
# Send SQ_BIND alone (without SQ_EXECUTE chained — for SELECT,
|
||||||
# opening the cursor is what executes the prepared query).
|
# opening the cursor is what executes the prepared query).
|
||||||
self._conn._send_pdu(self._build_bind_only_pdu(params))
|
try:
|
||||||
|
pdu = self._build_bind_only_pdu(params)
|
||||||
|
except Exception:
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
self._conn._send_pdu(self._build_release_pdu())
|
||||||
|
self._drain_to_eot()
|
||||||
|
raise
|
||||||
|
self._conn._send_pdu(pdu)
|
||||||
self._drain_to_eot()
|
self._drain_to_eot()
|
||||||
# Now open the cursor and fetch — the bound values are in scope
|
# Now open the cursor and fetch — the bound values are in scope
|
||||||
# for the prepared statement.
|
# for the prepared statement.
|
||||||
@ -311,7 +324,7 @@ class Cursor:
|
|||||||
continue
|
continue
|
||||||
blob_bytes = self._fetch_blob(bytes(descriptor))
|
blob_bytes = self._fetch_blob(bytes(descriptor))
|
||||||
if type_code == int(IfxType.TEXT):
|
if type_code == int(IfxType.TEXT):
|
||||||
row_list[idx] = blob_bytes.decode("iso-8859-1")
|
row_list[idx] = blob_bytes.decode(self._conn.encoding)
|
||||||
else:
|
else:
|
||||||
row_list[idx] = blob_bytes
|
row_list[idx] = blob_bytes
|
||||||
new_rows.append(tuple(row_list))
|
new_rows.append(tuple(row_list))
|
||||||
@ -630,8 +643,20 @@ class Cursor:
|
|||||||
Per JDBC's sendExecute path for prepared statements (line 1108
|
Per JDBC's sendExecute path for prepared statements (line 1108
|
||||||
of IfxSqli): build a single PDU containing SQ_BIND with all
|
of IfxSqli): build a single PDU containing SQ_BIND with all
|
||||||
parameter values followed by SQ_EXECUTE.
|
parameter values followed by SQ_EXECUTE.
|
||||||
|
|
||||||
|
If parameter encoding raises (e.g., :class:`DataError` for a
|
||||||
|
non-representable string), the prepared statement is still
|
||||||
|
allocated on the server. Send the SQ_RELEASE before propagating
|
||||||
|
— otherwise the next ``execute()`` finds a half-state connection.
|
||||||
"""
|
"""
|
||||||
self._conn._send_pdu(self._build_bind_execute_pdu(params))
|
try:
|
||||||
|
pdu = self._build_bind_execute_pdu(params)
|
||||||
|
except Exception:
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
self._conn._send_pdu(self._build_release_pdu())
|
||||||
|
self._drain_to_eot()
|
||||||
|
raise
|
||||||
|
self._conn._send_pdu(pdu)
|
||||||
self._drain_to_eot()
|
self._drain_to_eot()
|
||||||
self._conn._send_pdu(self._build_release_pdu())
|
self._conn._send_pdu(self._build_release_pdu())
|
||||||
self._drain_to_eot()
|
self._drain_to_eot()
|
||||||
@ -1036,7 +1061,7 @@ class Cursor:
|
|||||||
writer.write_short(-1)
|
writer.write_short(-1)
|
||||||
writer.write_short(0)
|
writer.write_short(0)
|
||||||
continue
|
continue
|
||||||
ifx_type, prec, raw = encode_param(value)
|
ifx_type, prec, raw = encode_param(value, encoding=self._conn.encoding)
|
||||||
writer.write_short(ifx_type)
|
writer.write_short(ifx_type)
|
||||||
writer.write_short(0) # indicator = 0 (non-null)
|
writer.write_short(0) # indicator = 0 (non-null)
|
||||||
writer.write_short(prec)
|
writer.write_short(prec)
|
||||||
@ -1047,7 +1072,7 @@ class Cursor:
|
|||||||
# ``bytes`` and ``bytearray`` flow through here; ``str``
|
# ``bytes`` and ``bytearray`` flow through here; ``str``
|
||||||
# for TEXT is converted to bytes per ``CLIENT_LOCALE``.
|
# for TEXT is converted to bytes per ``CLIENT_LOCALE``.
|
||||||
payload = (
|
payload = (
|
||||||
value.encode("iso-8859-1")
|
value.encode(self._conn.encoding)
|
||||||
if isinstance(value, str)
|
if isinstance(value, str)
|
||||||
else bytes(value)
|
else bytes(value)
|
||||||
)
|
)
|
||||||
@ -1296,7 +1321,9 @@ class Cursor:
|
|||||||
if tag == MessageType.SQ_EOT:
|
if tag == MessageType.SQ_EOT:
|
||||||
return
|
return
|
||||||
elif tag == MessageType.SQ_TUPLE:
|
elif tag == MessageType.SQ_TUPLE:
|
||||||
row = parse_tuple_payload(reader, self._columns)
|
row = parse_tuple_payload(
|
||||||
|
reader, self._columns, encoding=self._conn.encoding
|
||||||
|
)
|
||||||
self._rows.append(row)
|
self._rows.append(row)
|
||||||
elif tag == MessageType.SQ_DONE:
|
elif tag == MessageType.SQ_DONE:
|
||||||
self._consume_done(reader)
|
self._consume_done(reader)
|
||||||
|
|||||||
243
tests/test_unicode.py
Normal file
243
tests/test_unicode.py
Normal file
@ -0,0 +1,243 @@
|
|||||||
|
"""Phase 20 integration tests — locale + multi-byte string handling.
|
||||||
|
|
||||||
|
The driver historically hardcoded ``iso-8859-1`` everywhere, which was
|
||||||
|
"the default and probably fine" but made multi-byte locales (UTF-8,
|
||||||
|
UCS-2) broken-by-design. This phase:
|
||||||
|
|
||||||
|
1. Threads the connection's ``client_locale`` through to the user-data
|
||||||
|
string codecs (CHAR / VARCHAR / NVCHAR / LVARCHAR / CLOB / TEXT).
|
||||||
|
2. Maps locale strings → Python encoding names via
|
||||||
|
:func:`informix_db._python_encoding_from_locale`.
|
||||||
|
3. Verifies round-trip integrity at multiple locale settings.
|
||||||
|
|
||||||
|
Protocol-level strings (cursor names, function signatures, error
|
||||||
|
"near tokens") stay iso-8859-1 — those are always ASCII and never
|
||||||
|
contain user-controlled bytes.
|
||||||
|
|
||||||
|
Caveat: many test scenarios depend on the *database's* DB_LOCALE,
|
||||||
|
which is set at CREATE DATABASE time. The dev container's testdb
|
||||||
|
was created with the default 8859-1 locale — so chars outside 8859-1
|
||||||
|
will fail server-side regardless of CLIENT_LOCALE. Tests for
|
||||||
|
multibyte UTF-8 storage are skipped unless a UTF-8 database is
|
||||||
|
available (env var IFX_UTF8_DATABASE).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import contextlib
|
||||||
|
import os
|
||||||
|
from collections.abc import Iterator
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
import informix_db
|
||||||
|
from tests.conftest import ConnParams
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.integration
|
||||||
|
|
||||||
|
|
||||||
|
def _connect(params: ConnParams, **overrides) -> informix_db.Connection:
|
||||||
|
kwargs = {
|
||||||
|
"host": params.host,
|
||||||
|
"port": params.port,
|
||||||
|
"user": params.user,
|
||||||
|
"password": params.password,
|
||||||
|
"database": params.database,
|
||||||
|
"server": params.server,
|
||||||
|
"autocommit": True,
|
||||||
|
}
|
||||||
|
kwargs.update(overrides)
|
||||||
|
return informix_db.connect(**kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
# -------- ISO-8859-1 (default) — chars 0..255 round-trip --------
|
||||||
|
|
||||||
|
|
||||||
|
def test_ascii_round_trip(conn_params: ConnParams) -> None:
|
||||||
|
"""Pure ASCII works (regression test)."""
|
||||||
|
with _connect(conn_params) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("CREATE TEMP TABLE p20_ascii (s VARCHAR(50))")
|
||||||
|
cur.execute("INSERT INTO p20_ascii VALUES (?)", ("hello world",))
|
||||||
|
cur.execute("SELECT s FROM p20_ascii")
|
||||||
|
assert cur.fetchone() == ("hello world",)
|
||||||
|
|
||||||
|
|
||||||
|
def test_iso8859_high_bit_round_trip(conn_params: ConnParams) -> None:
|
||||||
|
"""Latin-1 high-bit chars (128-255) round-trip on default locale."""
|
||||||
|
samples = [
|
||||||
|
"café", # é = 0xE9
|
||||||
|
"résumé", # é = 0xE9
|
||||||
|
"naïve", # ï = 0xEF
|
||||||
|
"Zürich", # ü = 0xFC
|
||||||
|
"señorita", # ñ = 0xF1
|
||||||
|
"©™®", # 0xA9, trademark not in 8859-1, replaced
|
||||||
|
]
|
||||||
|
with _connect(conn_params) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("CREATE TEMP TABLE p20_latin (id INT, s VARCHAR(50))")
|
||||||
|
# Filter to chars that ARE in 8859-1
|
||||||
|
latin_safe = [s for s in samples if all(ord(c) <= 0xFF for c in s)]
|
||||||
|
for i, s in enumerate(latin_safe):
|
||||||
|
cur.execute("INSERT INTO p20_latin VALUES (?, ?)", (i, s))
|
||||||
|
cur.execute("SELECT id, s FROM p20_latin ORDER BY id")
|
||||||
|
rows = cur.fetchall()
|
||||||
|
assert [r[1] for r in rows] == latin_safe
|
||||||
|
|
||||||
|
|
||||||
|
def test_iso8859_full_byte_range(conn_params: ConnParams) -> None:
|
||||||
|
"""Each byte 0x20..0xFE round-trips through VARCHAR.
|
||||||
|
|
||||||
|
0x00 is NUL (string terminator on the wire) and not allowed in
|
||||||
|
VARCHAR. 0x1F and below are control chars; some servers reject.
|
||||||
|
0xFF is sometimes treated specially in length-prefixed encodings.
|
||||||
|
Using 0x20..0xFE keeps us in safe territory.
|
||||||
|
"""
|
||||||
|
chars = bytes(range(0x20, 0xFF)).decode("iso-8859-1")
|
||||||
|
assert len(chars) == 0xFF - 0x20
|
||||||
|
|
||||||
|
with _connect(conn_params) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("CREATE TEMP TABLE p20_full (s VARCHAR(255))")
|
||||||
|
cur.execute("INSERT INTO p20_full VALUES (?)", (chars,))
|
||||||
|
cur.execute("SELECT s FROM p20_full")
|
||||||
|
(got,) = cur.fetchone()
|
||||||
|
assert got == chars
|
||||||
|
|
||||||
|
|
||||||
|
# -------- Locale mapping --------
|
||||||
|
|
||||||
|
|
||||||
|
def test_locale_maps_to_python_encoding() -> None:
|
||||||
|
"""The locale → Python-encoding mapping handles common forms."""
|
||||||
|
from informix_db.connections import _python_encoding_from_locale
|
||||||
|
|
||||||
|
assert _python_encoding_from_locale("en_US.8859-1") == "iso-8859-1"
|
||||||
|
assert _python_encoding_from_locale("en_US.819") == "iso-8859-1"
|
||||||
|
assert _python_encoding_from_locale("en_US.utf8") == "utf-8"
|
||||||
|
assert _python_encoding_from_locale("en_US.UTF-8") == "utf-8"
|
||||||
|
# Unknown / no codeset suffix: fall back to safe default
|
||||||
|
assert _python_encoding_from_locale("en_US") == "iso-8859-1"
|
||||||
|
assert _python_encoding_from_locale("") == "iso-8859-1"
|
||||||
|
|
||||||
|
|
||||||
|
def test_connection_exposes_python_encoding(conn_params: ConnParams) -> None:
|
||||||
|
"""``conn.encoding`` reports the Python-side encoding for user data."""
|
||||||
|
with _connect(conn_params) as conn:
|
||||||
|
assert conn.encoding == "iso-8859-1"
|
||||||
|
with _connect(conn_params, client_locale="en_US.utf8") as conn:
|
||||||
|
assert conn.encoding == "utf-8"
|
||||||
|
|
||||||
|
|
||||||
|
# -------- UTF-8 connections (require UTF-8 DB to fully validate) --------
|
||||||
|
|
||||||
|
|
||||||
|
def test_utf8_locale_negotiation_works(conn_params: ConnParams) -> None:
|
||||||
|
"""Connecting with ``client_locale='en_US.utf8'`` doesn't crash.
|
||||||
|
|
||||||
|
The server handles transcoding when CLIENT_LOCALE differs from
|
||||||
|
DB_LOCALE for code points representable in both. ASCII obviously is.
|
||||||
|
"""
|
||||||
|
with _connect(conn_params, client_locale="en_US.utf8") as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("SELECT FIRST 1 tabname FROM systables")
|
||||||
|
row = cur.fetchone()
|
||||||
|
assert isinstance(row[0], str)
|
||||||
|
assert row[0] == "systables"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def utf8_db_params(conn_params: ConnParams) -> Iterator[ConnParams]:
|
||||||
|
"""Provide a UTF-8 DB connection if one's available; skip otherwise."""
|
||||||
|
db_name = os.environ.get("IFX_UTF8_DATABASE")
|
||||||
|
if not db_name:
|
||||||
|
pytest.skip(
|
||||||
|
"UTF-8 database not available; set IFX_UTF8_DATABASE env var "
|
||||||
|
"to enable. Create with: CREATE DATABASE my_utf8db WITH LOG IN "
|
||||||
|
"rootdbs (after setting DB_LOCALE=en_US.utf8 in the env)."
|
||||||
|
)
|
||||||
|
yield conn_params._replace(database=db_name)
|
||||||
|
|
||||||
|
|
||||||
|
def test_utf8_multibyte_round_trip(utf8_db_params: ConnParams) -> None:
|
||||||
|
"""Multi-byte UTF-8 chars round-trip when both locale + DB are UTF-8."""
|
||||||
|
samples = [
|
||||||
|
"你好世界", # CJK
|
||||||
|
"مرحبا", # Arabic (RTL)
|
||||||
|
"ñoño 🎉", # Latin + emoji (4-byte UTF-8)
|
||||||
|
"Здравствуй", # Cyrillic
|
||||||
|
]
|
||||||
|
with _connect(utf8_db_params, client_locale="en_US.utf8") as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"CREATE TEMP TABLE p20_utf8 (id INT, s NVARCHAR(100))"
|
||||||
|
)
|
||||||
|
for i, s in enumerate(samples):
|
||||||
|
cur.execute("INSERT INTO p20_utf8 VALUES (?, ?)", (i, s))
|
||||||
|
cur.execute("SELECT id, s FROM p20_utf8 ORDER BY id")
|
||||||
|
rows = cur.fetchall()
|
||||||
|
assert [r[1] for r in rows] == samples
|
||||||
|
|
||||||
|
|
||||||
|
# -------- Negative tests: non-representable chars on 8859-1 DB --------
|
||||||
|
|
||||||
|
|
||||||
|
def test_chinese_into_8859_1_db_raises_or_lossy(
|
||||||
|
conn_params: ConnParams,
|
||||||
|
) -> None:
|
||||||
|
"""Storing CJK chars in an 8859-1 DB either raises cleanly or lossy-substitutes.
|
||||||
|
|
||||||
|
The exact behavior depends on the server's transcoding: some
|
||||||
|
versions raise -1820 ('character not in target codeset'); others
|
||||||
|
silently replace with '?'. Either is acceptable — the test asserts
|
||||||
|
the connection survives.
|
||||||
|
"""
|
||||||
|
with _connect(conn_params) as conn:
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("CREATE TEMP TABLE p20_neg (s VARCHAR(50))")
|
||||||
|
with contextlib.suppress(informix_db.Error):
|
||||||
|
cur.execute("INSERT INTO p20_neg VALUES (?)", ("你好",))
|
||||||
|
|
||||||
|
# Connection survives whatever happened
|
||||||
|
cur.execute("SELECT 1 FROM systables WHERE tabid = 1")
|
||||||
|
assert cur.fetchone() == (1,)
|
||||||
|
|
||||||
|
|
||||||
|
# -------- Smart-LOB CLOB with locale --------
|
||||||
|
|
||||||
|
|
||||||
|
def test_clob_round_trip_8859_1(conn_params: ConnParams) -> None:
|
||||||
|
"""CLOB columns round-trip Latin-1 text through the SQ_FILE protocol."""
|
||||||
|
text = "Lorem ipsum dolor sit amet, café résumé naïve"
|
||||||
|
text_bytes = text.encode("iso-8859-1")
|
||||||
|
|
||||||
|
# Need a logged DB for CLOB
|
||||||
|
logged_params = conn_params._replace(database="testdb")
|
||||||
|
try:
|
||||||
|
conn = _connect(logged_params)
|
||||||
|
except informix_db.Error as e:
|
||||||
|
pytest.skip(f"logged DB unavailable: {e!r}")
|
||||||
|
try:
|
||||||
|
cur = conn.cursor()
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
cur.execute("DROP TABLE p20_clob")
|
||||||
|
try:
|
||||||
|
cur.execute("CREATE TABLE p20_clob (id INT, txt CLOB)")
|
||||||
|
except informix_db.Error as e:
|
||||||
|
pytest.skip(f"sbspace unavailable: {e!r}")
|
||||||
|
try:
|
||||||
|
cur.write_blob_column(
|
||||||
|
"INSERT INTO p20_clob VALUES (?, BLOB_PLACEHOLDER)",
|
||||||
|
text_bytes,
|
||||||
|
(1,),
|
||||||
|
clob=True,
|
||||||
|
)
|
||||||
|
got = cur.read_blob_column(
|
||||||
|
"SELECT txt FROM p20_clob WHERE id = ?", (1,)
|
||||||
|
)
|
||||||
|
assert got == text_bytes
|
||||||
|
finally:
|
||||||
|
with contextlib.suppress(Exception):
|
||||||
|
cur.execute("DROP TABLE p20_clob")
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
Loading…
x
Reference in New Issue
Block a user