mcarchive-org/README.md
Ryan Malloy 5265a6440b Initial mcarchive-org MCP server
FastMCP server wrapping archive.org's public read APIs:
- search_items / scrape_items: advanced search + bulk cursor pagination
- get_item_metadata / list_files: progressive disclosure with filtering
- get_file_url / download_file: canonical URLs and streaming downloads
  with HTTP Range resume + optional MD5 verification

Smoke-tested end-to-end via claude -p headless MCP and pytest against
live archive.org endpoints.
2026-04-21 09:41:20 -06:00

74 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# mcarchive-org
An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org).
Built on [FastMCP](https://gofastmcp.com) + [httpx](https://www.python-httpx.org/). No API key required — archive.org's read endpoints are public.
## Tools
| Tool | Purpose |
|------|---------|
| `search_items` | Small Solr-style search via `advancedsearch.php` (1200 rows, paginated) |
| `scrape_items` | Bulk cursor-paginated search via Scrape API (count ≥ 100) |
| `get_item_metadata` | Metadata for one item; skips the (possibly huge) files list by default |
| `list_files` | Files array with optional format / glob filtering — includes `download_url` per file |
| `get_file_url` | Build a canonical download URL without hitting the network |
| `download_file` | Stream a file to disk with resume support and optional MD5 verification |
Also exposes an MCP resource template: `archive://item/{identifier}`.
## Install & run
```bash
# From a checkout:
uv sync
uv run mcarchive-org
# Or from PyPI (once published):
uvx mcarchive-org
```
Register with Claude Code:
```bash
claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org
```
## Environment
| Variable | Default | Purpose |
|----------|---------|---------|
| `MCARCHIVE_DOWNLOAD_ROOT` | `./downloads` | Base directory for `download_file` |
## Example flow
```
search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
→ identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)
list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
→ [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]
download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
→ { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }
```
## Query syntax notes
archive.org uses a Solr/Lucene dialect:
- `mediatype:(audio OR movies)` — restrict to media types
- `collection:etree` — items in a specific collection
- `date:[1977-01-01 TO 1977-12-31]` — date ranges
- `creator:"Grateful Dead"` — phrase match
- `-subject:bootleg` — exclusion
- Sort by `downloads desc`, `date asc`, `addeddate desc`, etc.
See [archive.org's search docs](https://archive.org/advancedsearch.php) for the full grammar.
## License
MIT