Critical fixes:
- Validate identifier (^[A-Za-z0-9._-]+$) and filename (no '..', absolute
paths, NUL bytes, drive letters) at the client boundary
- Confine download destinations under MCARCHIVE_DOWNLOAD_ROOT via
Path.resolve() + is_relative_to() check; reject symlinked dirs
- Use O_NOFOLLOW on the destination open() to refuse symlink substitution
- Detect Range-ignored responses: if resume requested but server returns 200
(or 206 with wrong Content-Range start), raise ArchiveError BEFORE writing
any bytes — closes the silent file-corruption hole
Usability:
- Wrap raise_for_status everywhere with ArchiveError that includes the
response body preview — 4xx Solr errors now tell you what's wrong
- URL-encode filenames in download URLs (handles spaces and special chars)
- Map archive.org's {"error": ...} payloads on /metadata/{id}/files to
ArchiveError with the server's message
- Lazy-resolve download root so env-var changes after import are honored
- Refactor item_resource to a shared async helper (drops .fn type-ignore)
- Rename result key 'bytes' -> 'bytes_written' (avoids shadowing builtin)
Tests:
- New tests/test_client_mocked.py: 29 regression tests using
httpx.MockTransport covering every Hamilton finding above (path traversal,
symlink refusal, Range-ignored, Content-Range mismatch, error body
surfacing, malformed JSON, dark items, etc.)
- Set asyncio_mode = "auto" in pyproject for cleaner test markers
33/33 tests pass (4 live + 29 mocked), ruff clean.
mcarchive-org
An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive.
Built on FastMCP + httpx. No API key required — archive.org's read endpoints are public.
Tools
| Tool | Purpose |
|---|---|
search_items |
Small Solr-style search via advancedsearch.php (1–200 rows, paginated) |
scrape_items |
Bulk cursor-paginated search via Scrape API (count ≥ 100) |
get_item_metadata |
Metadata for one item; skips the (possibly huge) files list by default |
list_files |
Files array with optional format / glob filtering — includes download_url per file |
get_file_url |
Build a canonical download URL without hitting the network |
download_file |
Stream a file to disk with resume support and optional MD5 verification |
Also exposes an MCP resource template: archive://item/{identifier}.
Install & run
# From a checkout:
uv sync
uv run mcarchive-org
# Or from PyPI (once published):
uvx mcarchive-org
Register with Claude Code:
claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org
Environment
| Variable | Default | Purpose |
|---|---|---|
MCARCHIVE_DOWNLOAD_ROOT |
./downloads |
Base directory for download_file |
Example flow
search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
→ identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)
list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
→ [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]
download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
→ { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }
Query syntax notes
archive.org uses a Solr/Lucene dialect:
mediatype:(audio OR movies)— restrict to media typescollection:etree— items in a specific collectiondate:[1977-01-01 TO 1977-12-31]— date rangescreator:"Grateful Dead"— phrase match-subject:bootleg— exclusion- Sort by
downloads desc,date asc,addeddate desc, etc.
See archive.org's search docs for the full grammar.
License
MIT
Description
MCP server for searching and downloading files from the Internet Archive (archive.org)
Languages
Python
100%