FastMCP server wrapping archive.org's public read APIs: - search_items / scrape_items: advanced search + bulk cursor pagination - get_item_metadata / list_files: progressive disclosure with filtering - get_file_url / download_file: canonical URLs and streaming downloads with HTTP Range resume + optional MD5 verification Smoke-tested end-to-end via claude -p headless MCP and pytest against live archive.org endpoints.
74 lines
2.5 KiB
Markdown
74 lines
2.5 KiB
Markdown
# mcarchive-org
|
||
|
||
An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org).
|
||
|
||
Built on [FastMCP](https://gofastmcp.com) + [httpx](https://www.python-httpx.org/). No API key required — archive.org's read endpoints are public.
|
||
|
||
## Tools
|
||
|
||
| Tool | Purpose |
|
||
|------|---------|
|
||
| `search_items` | Small Solr-style search via `advancedsearch.php` (1–200 rows, paginated) |
|
||
| `scrape_items` | Bulk cursor-paginated search via Scrape API (count ≥ 100) |
|
||
| `get_item_metadata` | Metadata for one item; skips the (possibly huge) files list by default |
|
||
| `list_files` | Files array with optional format / glob filtering — includes `download_url` per file |
|
||
| `get_file_url` | Build a canonical download URL without hitting the network |
|
||
| `download_file` | Stream a file to disk with resume support and optional MD5 verification |
|
||
|
||
Also exposes an MCP resource template: `archive://item/{identifier}`.
|
||
|
||
## Install & run
|
||
|
||
```bash
|
||
# From a checkout:
|
||
uv sync
|
||
uv run mcarchive-org
|
||
|
||
# Or from PyPI (once published):
|
||
uvx mcarchive-org
|
||
```
|
||
|
||
Register with Claude Code:
|
||
|
||
```bash
|
||
claude mcp add archive-org -- uvx mcarchive-org
|
||
# or, from a local checkout:
|
||
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org
|
||
```
|
||
|
||
## Environment
|
||
|
||
| Variable | Default | Purpose |
|
||
|----------|---------|---------|
|
||
| `MCARCHIVE_DOWNLOAD_ROOT` | `./downloads` | Base directory for `download_file` |
|
||
|
||
## Example flow
|
||
|
||
```
|
||
search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
|
||
→ identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)
|
||
|
||
list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
|
||
→ [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]
|
||
|
||
download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
|
||
→ { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }
|
||
```
|
||
|
||
## Query syntax notes
|
||
|
||
archive.org uses a Solr/Lucene dialect:
|
||
|
||
- `mediatype:(audio OR movies)` — restrict to media types
|
||
- `collection:etree` — items in a specific collection
|
||
- `date:[1977-01-01 TO 1977-12-31]` — date ranges
|
||
- `creator:"Grateful Dead"` — phrase match
|
||
- `-subject:bootleg` — exclusion
|
||
- Sort by `downloads desc`, `date asc`, `addeddate desc`, etc.
|
||
|
||
See [archive.org's search docs](https://archive.org/advancedsearch.php) for the full grammar.
|
||
|
||
## License
|
||
|
||
MIT
|