mcarchive-org/CHANGELOG.md
Ryan Malloy a3c7b69ba8
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
CI / test (3.13) (push) Has been cancelled
Release 2026.4.21.1: refresh project URLs after warehack.ing transfer
PyPI metadata is immutable per version, so this post-release exists solely
to refresh the [project.urls] block: Homepage / Repository / Bug Tracker /
Changelog now point at git.supported.systems/warehack.ing/mcarchive-org
(the new canonical home after the org transfer).

No code changes. Same wheel contents as 2026.4.21, only METADATA URLs
differ.
2026-04-21 22:17:50 -06:00

49 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Changelog
Versioning is date-based: `YYYY.MM.DD` for normal releases, `YYYY.MM.DD.N` (PEP 440 post-release) for same-day fixes.
## 2026.4.21.1 — metadata refresh
Project URLs in package metadata updated to point at the new canonical home: `git.supported.systems/warehack.ing/mcarchive-org`. No code changes — same wheel contents, just refreshed `Project-URL` fields. PyPI metadata is immutable per version, hence the post-release bump rather than an in-place edit.
## 2026.04.21 — initial release
First public release. An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org). No API key required.
### Tools
- `search_items` — Solr-style search via `advancedsearch.php` (1200 rows, paginated)
- `scrape_items` — bulk cursor-paginated search via the Scrape API (count ≥ 100)
- `get_item_metadata` — item metadata; skips the (potentially huge) files list by default
- `list_files` — files array with optional format / fnmatch glob filtering, includes pre-built `download_url` per file
- `get_file_url` — build a canonical download URL without hitting the network
- `download_file` — stream a file to disk with HTTP Range resume + optional MD5 verification
- `get_download_root` — report current download root and its source (env var vs default)
- `set_download_root` — change the download root mid-session (useful in stdio mode where env vars can't be re-exported)
Plus an MCP resource template: `archive://item/{identifier}`.
### Reliability features
- **Input validation**: identifiers must match `^[A-Za-z0-9._-]+$`; filenames reject `..` components, absolute paths, NUL bytes, and Windows drive letters before any FS or network I/O
- **Path confinement**: download destinations are resolved and asserted to live under `MCARCHIVE_DOWNLOAD_ROOT`; symlinks at the destination are refused
- **`O_NOFOLLOW`**: defense-in-depth against symlink-substitution races on the destination file
- **Range-correctness check**: when resuming, the server's response must be HTTP 206 with a matching `Content-Range` start byte — otherwise the download aborts before any byte is written, eliminating silent file corruption
- **Atomic write staging**: downloads write to `<dest>.part` and are renamed to `<dest>` only on successful completion (POSIX-atomic). Failed downloads leave only `.part`, never an empty `dest`
- **Already-complete short-circuit**: re-downloading an already-complete file skips the network entirely (still re-verifies MD5 if asked)
- **Retry with backoff**: 429/502/503/504 retried up to 3 times with `Retry-After` honored (delta-seconds and HTTP-date forms), exponential backoff with jitter, capped at 30s. Retries happen *before* any bytes are yielded, so retry can never corrupt a partial write
- **Concurrent-download serialization**: per-`(identifier, filename)` `asyncio.Lock` prevents two parallel calls from racing on the same destination file. Different files still download in parallel
- **Stream-abort surfacing**: `httpx.ReadError`/`RemoteProtocolError`/`ConnectError`/`ReadTimeout` mid-stream are caught and re-raised as `ArchiveError` with a byte-count context so the caller knows where the partial download ended
- **Error body surfacing**: 4xx/5xx responses include a body preview in the exception message — invaluable for an LLM trying to fix a bad query
- **Process-wide shared `httpx.AsyncClient`**: one connection pool reused across the server's lifetime (no TCP+TLS handshake per tool call)
### Output normalization
- `collection` field is always `list[str]` (archive.org returns string OR list inconsistently)
- Every search doc / metadata response includes a derived `is_collection: bool` so LLMs can route collection containers vs. real media items without re-querying
- File entries always include a ready-to-use `download_url` plus `size_human` ("12.3 MB") alongside raw `size` in bytes
### Tests
66 tests total (4 live integration against archive.org + 62 mock-transport regression tests). Mock tests cover every reliability claim above so future refactors can't silently regress safety.