mcarchive-org/README.md

# mcarchive-org

An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org).

Built on [FastMCP](https://gofastmcp.com) + [httpx](https://www.python-httpx.org/). No API key required — archive.org's read endpoints are public.

## Tools

| Tool | Purpose |
|------|---------|
| `search_items` | Small Solr-style search via `advancedsearch.php` (1–200 rows, paginated) |
| `scrape_items` | Bulk cursor-paginated search via Scrape API (count ≥ 100) |
| `get_item_metadata` | Metadata for one item; skips the (possibly huge) files list by default |
| `list_files` | Files array with optional format / glob filtering — includes `download_url` per file |
| `get_file_url` | Build a canonical download URL without hitting the network |
| `download_file` | Stream a file to disk with resume support and optional MD5 verification |

Also exposes an MCP resource template: `archive://item/{identifier}`.

## Install & run

```bash
# From a checkout:
uv sync
uv run mcarchive-org

# Or from PyPI (once published):
uvx mcarchive-org
```

Register with Claude Code:

```bash
claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org
```

## Environment

| Variable | Default | Purpose |
|----------|---------|---------|
| `MCARCHIVE_DOWNLOAD_ROOT` | `./downloads` | Base directory for `download_file` |

## Example flow

```
search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
  → identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)

list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
  → [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]

download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
  → { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }
```

## Query syntax notes

archive.org uses a Solr/Lucene dialect:

- `mediatype:(audio OR movies)` — restrict to media types
- `collection:etree` — items in a specific collection
- `date:[1977-01-01 TO 1977-12-31]` — date ranges
- `creator:"Grateful Dead"` — phrase match
- `-subject:bootleg` — exclusion
- Sort by `downloads desc`, `date asc`, `addeddate desc`, etc.

See [archive.org's search docs](https://archive.org/advancedsearch.php) for the full grammar.

## License

MIT