Ryan Malloy 6198defeca Resilience: address Hamilton tier-2 findings
H7 — Process-wide shared httpx.AsyncClient via get_shared_client().
Each tool call no longer pays a TCP+TLS handshake; connection pool is
reused across the server's lifetime. Tests inject mock transports
directly via ArchiveClient(transport=...) so the singleton stays clean.

M1 — Retry/backoff on 429/502/503/504 with Retry-After honored
(both delta-seconds and HTTP-date forms). Exponential backoff with
jitter, capped at 30s, max 3 attempts. Applied to both _fetch_json
and stream_file (retry happens BEFORE any bytes are yielded so it
can't corrupt a partial write).

M2 — Per-(identifier, filename) asyncio.Lock in download_file
serializes concurrent downloads of the same file inside one process.
Different files still download in parallel.

M5 — collection field normalized to list[str] in all output paths
(search docs, scrape items, item metadata). LLMs can write
`if 'foo' in doc['collection']` without checking the type first.

M7 — `is_collection: bool` derived from mediatype on every doc /
metadata response, so LLMs can route collection containers vs.
real media items without re-querying.

H1 — Stream-abort errors (httpx.ReadError, RemoteProtocolError,
ConnectError, ReadTimeout) caught and re-raised as ArchiveError
with bytes-written context so the caller knows where the partial
download ended. Bytes already on disk remain valid for resume.

19 new regression tests (52 total, all green, ruff clean):
- 4 tests covering retry/backoff, exhaustion, HTTP-date Retry-After
- 1 test for stream-abort byte-count surfacing
- 6 tests for collection normalization shapes
- 4 tests for is_collection in real tool flow + shared client lifecycle
- 2 tests verifying download lock: same-file serialized, different files parallel
2026-04-21 20:24:21 -06:00
2026-04-21 09:41:20 -06:00
2026-04-21 09:41:20 -06:00
2026-04-21 09:41:20 -06:00

mcarchive-org

An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive.

Built on FastMCP + httpx. No API key required — archive.org's read endpoints are public.

Tools

Tool Purpose
search_items Small Solr-style search via advancedsearch.php (1200 rows, paginated)
scrape_items Bulk cursor-paginated search via Scrape API (count ≥ 100)
get_item_metadata Metadata for one item; skips the (possibly huge) files list by default
list_files Files array with optional format / glob filtering — includes download_url per file
get_file_url Build a canonical download URL without hitting the network
download_file Stream a file to disk with resume support and optional MD5 verification

Also exposes an MCP resource template: archive://item/{identifier}.

Install & run

# From a checkout:
uv sync
uv run mcarchive-org

# Or from PyPI (once published):
uvx mcarchive-org

Register with Claude Code:

claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org

Environment

Variable Default Purpose
MCARCHIVE_DOWNLOAD_ROOT ./downloads Base directory for download_file

Example flow

search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
  → identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)

list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
  → [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]

download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
  → { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }

Query syntax notes

archive.org uses a Solr/Lucene dialect:

  • mediatype:(audio OR movies) — restrict to media types
  • collection:etree — items in a specific collection
  • date:[1977-01-01 TO 1977-12-31] — date ranges
  • creator:"Grateful Dead" — phrase match
  • -subject:bootleg — exclusion
  • Sort by downloads desc, date asc, addeddate desc, etc.

See archive.org's search docs for the full grammar.

License

MIT

Description
MCP server for searching and downloading files from the Internet Archive (archive.org)
Readme 185 KiB
2026-04-22 04:18:06 +00:00
Languages
Python 100%