2 Commits

Author SHA1 Message Date
4a03af1675 Hardening: address Hamilton review ship-blockers
Critical fixes:
- Validate identifier (^[A-Za-z0-9._-]+$) and filename (no '..', absolute
  paths, NUL bytes, drive letters) at the client boundary
- Confine download destinations under MCARCHIVE_DOWNLOAD_ROOT via
  Path.resolve() + is_relative_to() check; reject symlinked dirs
- Use O_NOFOLLOW on the destination open() to refuse symlink substitution
- Detect Range-ignored responses: if resume requested but server returns 200
  (or 206 with wrong Content-Range start), raise ArchiveError BEFORE writing
  any bytes — closes the silent file-corruption hole

Usability:
- Wrap raise_for_status everywhere with ArchiveError that includes the
  response body preview — 4xx Solr errors now tell you what's wrong
- URL-encode filenames in download URLs (handles spaces and special chars)
- Map archive.org's {"error": ...} payloads on /metadata/{id}/files to
  ArchiveError with the server's message
- Lazy-resolve download root so env-var changes after import are honored
- Refactor item_resource to a shared async helper (drops .fn type-ignore)
- Rename result key 'bytes' -> 'bytes_written' (avoids shadowing builtin)

Tests:
- New tests/test_client_mocked.py: 29 regression tests using
  httpx.MockTransport covering every Hamilton finding above (path traversal,
  symlink refusal, Range-ignored, Content-Range mismatch, error body
  surfacing, malformed JSON, dark items, etc.)
- Set asyncio_mode = "auto" in pyproject for cleaner test markers

33/33 tests pass (4 live + 29 mocked), ruff clean.
2026-04-21 15:34:30 -06:00
5265a6440b Initial mcarchive-org MCP server
FastMCP server wrapping archive.org's public read APIs:
- search_items / scrape_items: advanced search + bulk cursor pagination
- get_item_metadata / list_files: progressive disclosure with filtering
- get_file_url / download_file: canonical URLs and streaming downloads
  with HTTP Range resume + optional MD5 verification

Smoke-tested end-to-end via claude -p headless MCP and pytest against
live archive.org endpoints.
2026-04-21 09:41:20 -06:00