mcarchive-org/CHANGELOG.md
Ryan Malloy 52a2be7cc6
Some checks are pending
CI / test (3.10) (push) Waiting to run
CI / test (3.11) (push) Waiting to run
CI / test (3.12) (push) Waiting to run
CI / test (3.13) (push) Waiting to run
Release prep: CHANGELOG, CI workflow, Gitea project URLs
- CHANGELOG.md documents the 2026.04.21 initial release: full tool
  inventory, every reliability claim, and test count (66/66 green).
- .github/workflows/ci.yml runs ruff check + pytest -m 'not network'
  across Python 3.10/3.11/3.12/3.13 on push and PR. Skips live archive.org
  tests in CI to keep runs fast and avoid hammering archive.org.
- pyproject.toml [project.urls]: point Homepage / Repository / Bug Tracker
  / Changelog at git.supported.systems/rsp2k/mcarchive-org. Keep the
  archive.org developer docs link for context.
2026-04-21 21:20:56 -06:00

45 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Changelog
Versioning is date-based: `YYYY.MM.DD` for normal releases, `YYYY.MM.DD.N` (PEP 440 post-release) for same-day fixes.
## 2026.04.21 — initial release
First public release. An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org). No API key required.
### Tools
- `search_items` — Solr-style search via `advancedsearch.php` (1200 rows, paginated)
- `scrape_items` — bulk cursor-paginated search via the Scrape API (count ≥ 100)
- `get_item_metadata` — item metadata; skips the (potentially huge) files list by default
- `list_files` — files array with optional format / fnmatch glob filtering, includes pre-built `download_url` per file
- `get_file_url` — build a canonical download URL without hitting the network
- `download_file` — stream a file to disk with HTTP Range resume + optional MD5 verification
- `get_download_root` — report current download root and its source (env var vs default)
- `set_download_root` — change the download root mid-session (useful in stdio mode where env vars can't be re-exported)
Plus an MCP resource template: `archive://item/{identifier}`.
### Reliability features
- **Input validation**: identifiers must match `^[A-Za-z0-9._-]+$`; filenames reject `..` components, absolute paths, NUL bytes, and Windows drive letters before any FS or network I/O
- **Path confinement**: download destinations are resolved and asserted to live under `MCARCHIVE_DOWNLOAD_ROOT`; symlinks at the destination are refused
- **`O_NOFOLLOW`**: defense-in-depth against symlink-substitution races on the destination file
- **Range-correctness check**: when resuming, the server's response must be HTTP 206 with a matching `Content-Range` start byte — otherwise the download aborts before any byte is written, eliminating silent file corruption
- **Atomic write staging**: downloads write to `<dest>.part` and are renamed to `<dest>` only on successful completion (POSIX-atomic). Failed downloads leave only `.part`, never an empty `dest`
- **Already-complete short-circuit**: re-downloading an already-complete file skips the network entirely (still re-verifies MD5 if asked)
- **Retry with backoff**: 429/502/503/504 retried up to 3 times with `Retry-After` honored (delta-seconds and HTTP-date forms), exponential backoff with jitter, capped at 30s. Retries happen *before* any bytes are yielded, so retry can never corrupt a partial write
- **Concurrent-download serialization**: per-`(identifier, filename)` `asyncio.Lock` prevents two parallel calls from racing on the same destination file. Different files still download in parallel
- **Stream-abort surfacing**: `httpx.ReadError`/`RemoteProtocolError`/`ConnectError`/`ReadTimeout` mid-stream are caught and re-raised as `ArchiveError` with a byte-count context so the caller knows where the partial download ended
- **Error body surfacing**: 4xx/5xx responses include a body preview in the exception message — invaluable for an LLM trying to fix a bad query
- **Process-wide shared `httpx.AsyncClient`**: one connection pool reused across the server's lifetime (no TCP+TLS handshake per tool call)
### Output normalization
- `collection` field is always `list[str]` (archive.org returns string OR list inconsistently)
- Every search doc / metadata response includes a derived `is_collection: bool` so LLMs can route collection containers vs. real media items without re-querying
- File entries always include a ready-to-use `download_url` plus `size_human` ("12.3 MB") alongside raw `size` in bytes
### Tests
66 tests total (4 live integration against archive.org + 62 mock-transport regression tests). Mock tests cover every reliability claim above so future refactors can't silently regress safety.