mcarchive-org

warehack.ing/mcarchive-org

Fork 0

Commit Graph

Author	SHA1	Message	Date
Ryan Malloy	6198defeca	Resilience: address Hamilton tier-2 findings H7 — Process-wide shared httpx.AsyncClient via get_shared_client(). Each tool call no longer pays a TCP+TLS handshake; connection pool is reused across the server's lifetime. Tests inject mock transports directly via ArchiveClient(transport=...) so the singleton stays clean. M1 — Retry/backoff on 429/502/503/504 with Retry-After honored (both delta-seconds and HTTP-date forms). Exponential backoff with jitter, capped at 30s, max 3 attempts. Applied to both _fetch_json and stream_file (retry happens BEFORE any bytes are yielded so it can't corrupt a partial write). M2 — Per-(identifier, filename) asyncio.Lock in download_file serializes concurrent downloads of the same file inside one process. Different files still download in parallel. M5 — collection field normalized to list[str] in all output paths (search docs, scrape items, item metadata). LLMs can write `if 'foo' in doc['collection']` without checking the type first. M7 — `is_collection: bool` derived from mediatype on every doc / metadata response, so LLMs can route collection containers vs. real media items without re-querying. H1 — Stream-abort errors (httpx.ReadError, RemoteProtocolError, ConnectError, ReadTimeout) caught and re-raised as ArchiveError with bytes-written context so the caller knows where the partial download ended. Bytes already on disk remain valid for resume. 19 new regression tests (52 total, all green, ruff clean): - 4 tests covering retry/backoff, exhaustion, HTTP-date Retry-After - 1 test for stream-abort byte-count surfacing - 6 tests for collection normalization shapes - 4 tests for is_collection in real tool flow + shared client lifecycle - 2 tests verifying download lock: same-file serialized, different files parallel	2026-04-21 20:24:21 -06:00
Ryan Malloy	4a03af1675	Hardening: address Hamilton review ship-blockers Critical fixes: - Validate identifier (^[A-Za-z0-9._-]+$) and filename (no '..', absolute paths, NUL bytes, drive letters) at the client boundary - Confine download destinations under MCARCHIVE_DOWNLOAD_ROOT via Path.resolve() + is_relative_to() check; reject symlinked dirs - Use O_NOFOLLOW on the destination open() to refuse symlink substitution - Detect Range-ignored responses: if resume requested but server returns 200 (or 206 with wrong Content-Range start), raise ArchiveError BEFORE writing any bytes — closes the silent file-corruption hole Usability: - Wrap raise_for_status everywhere with ArchiveError that includes the response body preview — 4xx Solr errors now tell you what's wrong - URL-encode filenames in download URLs (handles spaces and special chars) - Map archive.org's {"error": ...} payloads on /metadata/{id}/files to ArchiveError with the server's message - Lazy-resolve download root so env-var changes after import are honored - Refactor item_resource to a shared async helper (drops .fn type-ignore) - Rename result key 'bytes' -> 'bytes_written' (avoids shadowing builtin) Tests: - New tests/test_client_mocked.py: 29 regression tests using httpx.MockTransport covering every Hamilton finding above (path traversal, symlink refusal, Range-ignored, Content-Range mismatch, error body surfacing, malformed JSON, dark items, etc.) - Set asyncio_mode = "auto" in pyproject for cleaner test markers 33/33 tests pass (4 live + 29 mocked), ruff clean.	2026-04-21 15:34:30 -06:00

Author

SHA1

Message

Date

Ryan Malloy

6198defeca

Resilience: address Hamilton tier-2 findings

H7 — Process-wide shared httpx.AsyncClient via get_shared_client().
Each tool call no longer pays a TCP+TLS handshake; connection pool is
reused across the server's lifetime. Tests inject mock transports
directly via ArchiveClient(transport=...) so the singleton stays clean.

M1 — Retry/backoff on 429/502/503/504 with Retry-After honored
(both delta-seconds and HTTP-date forms). Exponential backoff with
jitter, capped at 30s, max 3 attempts. Applied to both _fetch_json
and stream_file (retry happens BEFORE any bytes are yielded so it
can't corrupt a partial write).

M2 — Per-(identifier, filename) asyncio.Lock in download_file
serializes concurrent downloads of the same file inside one process.
Different files still download in parallel.

M5 — collection field normalized to list[str] in all output paths
(search docs, scrape items, item metadata). LLMs can write
`if 'foo' in doc['collection']` without checking the type first.

M7 — `is_collection: bool` derived from mediatype on every doc /
metadata response, so LLMs can route collection containers vs.
real media items without re-querying.

H1 — Stream-abort errors (httpx.ReadError, RemoteProtocolError,
ConnectError, ReadTimeout) caught and re-raised as ArchiveError
with bytes-written context so the caller knows where the partial
download ended. Bytes already on disk remain valid for resume.

19 new regression tests (52 total, all green, ruff clean):
- 4 tests covering retry/backoff, exhaustion, HTTP-date Retry-After
- 1 test for stream-abort byte-count surfacing
- 6 tests for collection normalization shapes
- 4 tests for is_collection in real tool flow + shared client lifecycle
- 2 tests verifying download lock: same-file serialized, different files parallel

2026-04-21 20:24:21 -06:00

Ryan Malloy

4a03af1675

Hardening: address Hamilton review ship-blockers

Critical fixes:
- Validate identifier (^[A-Za-z0-9._-]+$) and filename (no '..', absolute
  paths, NUL bytes, drive letters) at the client boundary
- Confine download destinations under MCARCHIVE_DOWNLOAD_ROOT via
  Path.resolve() + is_relative_to() check; reject symlinked dirs
- Use O_NOFOLLOW on the destination open() to refuse symlink substitution
- Detect Range-ignored responses: if resume requested but server returns 200
  (or 206 with wrong Content-Range start), raise ArchiveError BEFORE writing
  any bytes — closes the silent file-corruption hole

Usability:
- Wrap raise_for_status everywhere with ArchiveError that includes the
  response body preview — 4xx Solr errors now tell you what's wrong
- URL-encode filenames in download URLs (handles spaces and special chars)
- Map archive.org's {"error": ...} payloads on /metadata/{id}/files to
  ArchiveError with the server's message
- Lazy-resolve download root so env-var changes after import are honored
- Refactor item_resource to a shared async helper (drops .fn type-ignore)
- Rename result key 'bytes' -> 'bytes_written' (avoids shadowing builtin)

Tests:
- New tests/test_client_mocked.py: 29 regression tests using
  httpx.MockTransport covering every Hamilton finding above (path traversal,
  symlink refusal, Range-ignored, Content-Range mismatch, error body
  surfacing, malformed JSON, dark items, etc.)
- Set asyncio_mode = "auto" in pyproject for cleaner test markers

33/33 tests pass (4 live + 29 mocked), ruff clean.

2026-04-21 15:34:30 -06:00

2 Commits