Initial mcarchive-org MCP server
FastMCP server wrapping archive.org's public read APIs: - search_items / scrape_items: advanced search + bulk cursor pagination - get_item_metadata / list_files: progressive disclosure with filtering - get_file_url / download_file: canonical URLs and streaming downloads with HTTP Range resume + optional MD5 verification Smoke-tested end-to-end via claude -p headless MCP and pytest against live archive.org endpoints.
This commit is contained in:
commit
5265a6440b
14
.gitignore
vendored
Normal file
14
.gitignore
vendored
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*.egg-info/
|
||||||
|
.venv/
|
||||||
|
.ruff_cache/
|
||||||
|
.pytest_cache/
|
||||||
|
dist/
|
||||||
|
build/
|
||||||
|
.mypy_cache/
|
||||||
|
*.log
|
||||||
|
|
||||||
|
# downloads from test runs
|
||||||
|
downloads/
|
||||||
|
tmp/
|
||||||
73
README.md
Normal file
73
README.md
Normal file
@ -0,0 +1,73 @@
|
|||||||
|
# mcarchive-org
|
||||||
|
|
||||||
|
An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org).
|
||||||
|
|
||||||
|
Built on [FastMCP](https://gofastmcp.com) + [httpx](https://www.python-httpx.org/). No API key required — archive.org's read endpoints are public.
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
| Tool | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `search_items` | Small Solr-style search via `advancedsearch.php` (1–200 rows, paginated) |
|
||||||
|
| `scrape_items` | Bulk cursor-paginated search via Scrape API (count ≥ 100) |
|
||||||
|
| `get_item_metadata` | Metadata for one item; skips the (possibly huge) files list by default |
|
||||||
|
| `list_files` | Files array with optional format / glob filtering — includes `download_url` per file |
|
||||||
|
| `get_file_url` | Build a canonical download URL without hitting the network |
|
||||||
|
| `download_file` | Stream a file to disk with resume support and optional MD5 verification |
|
||||||
|
|
||||||
|
Also exposes an MCP resource template: `archive://item/{identifier}`.
|
||||||
|
|
||||||
|
## Install & run
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From a checkout:
|
||||||
|
uv sync
|
||||||
|
uv run mcarchive-org
|
||||||
|
|
||||||
|
# Or from PyPI (once published):
|
||||||
|
uvx mcarchive-org
|
||||||
|
```
|
||||||
|
|
||||||
|
Register with Claude Code:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
claude mcp add archive-org -- uvx mcarchive-org
|
||||||
|
# or, from a local checkout:
|
||||||
|
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
| Variable | Default | Purpose |
|
||||||
|
|----------|---------|---------|
|
||||||
|
| `MCARCHIVE_DOWNLOAD_ROOT` | `./downloads` | Base directory for `download_file` |
|
||||||
|
|
||||||
|
## Example flow
|
||||||
|
|
||||||
|
```
|
||||||
|
search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
|
||||||
|
→ identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)
|
||||||
|
|
||||||
|
list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
|
||||||
|
→ [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]
|
||||||
|
|
||||||
|
download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
|
||||||
|
→ { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }
|
||||||
|
```
|
||||||
|
|
||||||
|
## Query syntax notes
|
||||||
|
|
||||||
|
archive.org uses a Solr/Lucene dialect:
|
||||||
|
|
||||||
|
- `mediatype:(audio OR movies)` — restrict to media types
|
||||||
|
- `collection:etree` — items in a specific collection
|
||||||
|
- `date:[1977-01-01 TO 1977-12-31]` — date ranges
|
||||||
|
- `creator:"Grateful Dead"` — phrase match
|
||||||
|
- `-subject:bootleg` — exclusion
|
||||||
|
- Sort by `downloads desc`, `date asc`, `addeddate desc`, etc.
|
||||||
|
|
||||||
|
See [archive.org's search docs](https://archive.org/advancedsearch.php) for the full grammar.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
54
pyproject.toml
Normal file
54
pyproject.toml
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
[project]
|
||||||
|
name = "mcarchive-org"
|
||||||
|
version = "2026.04.21"
|
||||||
|
description = "MCP server for searching and downloading files from the Internet Archive (archive.org)"
|
||||||
|
readme = "README.md"
|
||||||
|
requires-python = ">=3.10"
|
||||||
|
license = { text = "MIT" }
|
||||||
|
authors = [
|
||||||
|
{ name = "Ryan Malloy", email = "ryan@supported.systems" },
|
||||||
|
]
|
||||||
|
keywords = ["mcp", "archive.org", "internet-archive", "fastmcp", "llm"]
|
||||||
|
classifiers = [
|
||||||
|
"Development Status :: 4 - Beta",
|
||||||
|
"Intended Audience :: Developers",
|
||||||
|
"License :: OSI Approved :: MIT License",
|
||||||
|
"Programming Language :: Python :: 3",
|
||||||
|
"Programming Language :: Python :: 3.10",
|
||||||
|
"Programming Language :: Python :: 3.11",
|
||||||
|
"Programming Language :: Python :: 3.12",
|
||||||
|
"Programming Language :: Python :: 3.13",
|
||||||
|
"Topic :: Internet :: WWW/HTTP",
|
||||||
|
]
|
||||||
|
dependencies = [
|
||||||
|
"fastmcp>=3.2.4",
|
||||||
|
"httpx>=0.28.1",
|
||||||
|
]
|
||||||
|
|
||||||
|
[project.scripts]
|
||||||
|
mcarchive-org = "mcarchive_org.server:main"
|
||||||
|
|
||||||
|
[project.urls]
|
||||||
|
Homepage = "https://archive.org/developers/"
|
||||||
|
|
||||||
|
[build-system]
|
||||||
|
requires = ["hatchling"]
|
||||||
|
build-backend = "hatchling.build"
|
||||||
|
|
||||||
|
[tool.hatch.build.targets.wheel]
|
||||||
|
packages = ["src/mcarchive_org"]
|
||||||
|
|
||||||
|
[tool.ruff]
|
||||||
|
line-length = 100
|
||||||
|
target-version = "py310"
|
||||||
|
|
||||||
|
[tool.ruff.lint]
|
||||||
|
select = ["E", "F", "W", "I", "UP", "B", "SIM", "RUF"]
|
||||||
|
ignore = ["E501"]
|
||||||
|
|
||||||
|
[dependency-groups]
|
||||||
|
dev = [
|
||||||
|
"pytest>=8.0",
|
||||||
|
"pytest-asyncio>=0.23",
|
||||||
|
"ruff>=0.5",
|
||||||
|
]
|
||||||
8
src/mcarchive_org/__init__.py
Normal file
8
src/mcarchive_org/__init__.py
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
"""MCP server for the Internet Archive (archive.org)."""
|
||||||
|
|
||||||
|
from importlib.metadata import PackageNotFoundError, version
|
||||||
|
|
||||||
|
try:
|
||||||
|
__version__ = version("mcarchive-org")
|
||||||
|
except PackageNotFoundError:
|
||||||
|
__version__ = "0.0.0"
|
||||||
4
src/mcarchive_org/__main__.py
Normal file
4
src/mcarchive_org/__main__.py
Normal file
@ -0,0 +1,4 @@
|
|||||||
|
from mcarchive_org.server import main
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
196
src/mcarchive_org/client.py
Normal file
196
src/mcarchive_org/client.py
Normal file
@ -0,0 +1,196 @@
|
|||||||
|
"""Low-level archive.org HTTP client (pure httpx, no MCP dependencies)."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import hashlib
|
||||||
|
from collections.abc import AsyncIterator
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
ARCHIVE_BASE = "https://archive.org"
|
||||||
|
DEFAULT_UA = "mcarchive-org/2026.04.21 (+https://archive.org/developers/)"
|
||||||
|
DEFAULT_TIMEOUT = httpx.Timeout(30.0, read=60.0)
|
||||||
|
|
||||||
|
|
||||||
|
class ArchiveError(RuntimeError):
|
||||||
|
"""Raised when archive.org returns an error payload or unexpected status."""
|
||||||
|
|
||||||
|
|
||||||
|
class ArchiveClient:
|
||||||
|
"""Async client for the three archive.org endpoints we care about.
|
||||||
|
|
||||||
|
- advancedsearch.php : small Solr-style queries (<= ~10,000 rows paginated)
|
||||||
|
- services/search/v1/scrape : bulk cursor-based iteration (count >= 100)
|
||||||
|
- metadata/{id} : full item manifest including files[]
|
||||||
|
- download/{id}/{file} : byte stream with Range support
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
base_url: str = ARCHIVE_BASE,
|
||||||
|
user_agent: str = DEFAULT_UA,
|
||||||
|
timeout: httpx.Timeout | float = DEFAULT_TIMEOUT,
|
||||||
|
) -> None:
|
||||||
|
self._base = base_url.rstrip("/")
|
||||||
|
self._client = httpx.AsyncClient(
|
||||||
|
headers={"User-Agent": user_agent, "Accept": "application/json"},
|
||||||
|
timeout=timeout,
|
||||||
|
follow_redirects=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def aclose(self) -> None:
|
||||||
|
await self._client.aclose()
|
||||||
|
|
||||||
|
async def __aenter__(self) -> ArchiveClient:
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, *exc: object) -> None:
|
||||||
|
await self.aclose()
|
||||||
|
|
||||||
|
# ---------- search ----------
|
||||||
|
|
||||||
|
async def search(
|
||||||
|
self,
|
||||||
|
query: str,
|
||||||
|
fields: list[str] | None = None,
|
||||||
|
sort: list[str] | None = None,
|
||||||
|
rows: int = 25,
|
||||||
|
page: int = 1,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Advanced search — best for small result sets (<=10k total)."""
|
||||||
|
params: list[tuple[str, str]] = [
|
||||||
|
("q", query),
|
||||||
|
("output", "json"),
|
||||||
|
("rows", str(rows)),
|
||||||
|
("page", str(page)),
|
||||||
|
]
|
||||||
|
for f in fields or ["identifier", "title", "mediatype", "creator", "date"]:
|
||||||
|
params.append(("fl[]", f))
|
||||||
|
for s in sort or []:
|
||||||
|
params.append(("sort[]", s))
|
||||||
|
|
||||||
|
r = await self._client.get(f"{self._base}/advancedsearch.php", params=params)
|
||||||
|
r.raise_for_status()
|
||||||
|
data = r.json()
|
||||||
|
resp = data.get("response", {})
|
||||||
|
return {
|
||||||
|
"num_found": resp.get("numFound", 0),
|
||||||
|
"start": resp.get("start", 0),
|
||||||
|
"page": page,
|
||||||
|
"rows": rows,
|
||||||
|
"docs": resp.get("docs", []),
|
||||||
|
}
|
||||||
|
|
||||||
|
async def scrape(
|
||||||
|
self,
|
||||||
|
query: str,
|
||||||
|
fields: list[str] | None = None,
|
||||||
|
sorts: list[str] | None = None,
|
||||||
|
count: int = 100,
|
||||||
|
cursor: str | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Scrape API — cursor-paginated; count must be >= 100."""
|
||||||
|
if count < 100:
|
||||||
|
raise ValueError("scrape count must be >= 100; use search() for smaller queries")
|
||||||
|
|
||||||
|
params: dict[str, str] = {"q": query, "count": str(count)}
|
||||||
|
if fields:
|
||||||
|
params["fields"] = ",".join(fields)
|
||||||
|
if sorts:
|
||||||
|
params["sorts"] = ",".join(sorts)
|
||||||
|
if cursor:
|
||||||
|
params["cursor"] = cursor
|
||||||
|
|
||||||
|
r = await self._client.get(f"{self._base}/services/search/v1/scrape", params=params)
|
||||||
|
r.raise_for_status()
|
||||||
|
data = r.json()
|
||||||
|
if "error" in data:
|
||||||
|
raise ArchiveError(f"{data.get('errorType', 'ScrapeError')}: {data['error']}")
|
||||||
|
return data # keys: items, count, total, cursor (if more pages)
|
||||||
|
|
||||||
|
# ---------- metadata ----------
|
||||||
|
|
||||||
|
async def metadata(self, identifier: str) -> dict[str, Any]:
|
||||||
|
"""Full metadata blob for an item."""
|
||||||
|
r = await self._client.get(f"{self._base}/metadata/{identifier}")
|
||||||
|
r.raise_for_status()
|
||||||
|
data = r.json()
|
||||||
|
if not data:
|
||||||
|
raise ArchiveError(f"item not found: {identifier}")
|
||||||
|
return data
|
||||||
|
|
||||||
|
async def files(self, identifier: str) -> list[dict[str, Any]]:
|
||||||
|
"""Just the files[] slice — smaller payload when that's all you want."""
|
||||||
|
r = await self._client.get(f"{self._base}/metadata/{identifier}/files")
|
||||||
|
r.raise_for_status()
|
||||||
|
data = r.json()
|
||||||
|
if isinstance(data, dict) and "result" in data:
|
||||||
|
return data["result"]
|
||||||
|
if isinstance(data, list):
|
||||||
|
return data
|
||||||
|
raise ArchiveError(f"unexpected files response for {identifier}")
|
||||||
|
|
||||||
|
# ---------- download ----------
|
||||||
|
|
||||||
|
def download_url(self, identifier: str, filename: str) -> str:
|
||||||
|
return f"{self._base}/download/{identifier}/{filename}"
|
||||||
|
|
||||||
|
async def stream_file(
|
||||||
|
self,
|
||||||
|
identifier: str,
|
||||||
|
filename: str,
|
||||||
|
resume_from: int = 0,
|
||||||
|
) -> AsyncIterator[bytes]:
|
||||||
|
"""Async byte iterator — caller is responsible for writing to disk."""
|
||||||
|
headers = {}
|
||||||
|
if resume_from > 0:
|
||||||
|
headers["Range"] = f"bytes={resume_from}-"
|
||||||
|
url = self.download_url(identifier, filename)
|
||||||
|
async with self._client.stream("GET", url, headers=headers) as r:
|
||||||
|
r.raise_for_status()
|
||||||
|
async for chunk in r.aiter_bytes(chunk_size=1 << 16):
|
||||||
|
yield chunk
|
||||||
|
|
||||||
|
async def download_to_file(
|
||||||
|
self,
|
||||||
|
identifier: str,
|
||||||
|
filename: str,
|
||||||
|
dest: Path,
|
||||||
|
verify_md5: str | None = None,
|
||||||
|
chunk_cb=None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Download with resume support. Returns stats + md5 verification result."""
|
||||||
|
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
resume_from = dest.stat().st_size if dest.exists() else 0
|
||||||
|
|
||||||
|
hasher = hashlib.md5() if verify_md5 else None
|
||||||
|
if hasher and resume_from:
|
||||||
|
# re-hash existing bytes so the final digest is correct
|
||||||
|
with dest.open("rb") as f:
|
||||||
|
while chunk := f.read(1 << 16):
|
||||||
|
hasher.update(chunk)
|
||||||
|
|
||||||
|
bytes_written = resume_from
|
||||||
|
mode = "ab" if resume_from else "wb"
|
||||||
|
with dest.open(mode) as f:
|
||||||
|
async for chunk in self.stream_file(identifier, filename, resume_from=resume_from):
|
||||||
|
f.write(chunk)
|
||||||
|
bytes_written += len(chunk)
|
||||||
|
if hasher:
|
||||||
|
hasher.update(chunk)
|
||||||
|
if chunk_cb:
|
||||||
|
chunk_cb(bytes_written)
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"path": str(dest),
|
||||||
|
"bytes": bytes_written,
|
||||||
|
"resumed_from": resume_from,
|
||||||
|
}
|
||||||
|
if verify_md5 and hasher:
|
||||||
|
actual = hasher.hexdigest()
|
||||||
|
result["md5_actual"] = actual
|
||||||
|
result["md5_expected"] = verify_md5
|
||||||
|
result["md5_ok"] = actual.lower() == verify_md5.lower()
|
||||||
|
return result
|
||||||
258
src/mcarchive_org/server.py
Normal file
258
src/mcarchive_org/server.py
Normal file
@ -0,0 +1,258 @@
|
|||||||
|
"""FastMCP server exposing archive.org search, metadata, and download."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import fnmatch
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Annotated, Any
|
||||||
|
|
||||||
|
from fastmcp import FastMCP
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from mcarchive_org import __version__
|
||||||
|
from mcarchive_org.client import ArchiveClient
|
||||||
|
|
||||||
|
DEFAULT_DOWNLOAD_ROOT = Path(
|
||||||
|
os.environ.get("MCARCHIVE_DOWNLOAD_ROOT", Path.cwd() / "downloads")
|
||||||
|
).expanduser()
|
||||||
|
|
||||||
|
mcp = FastMCP(
|
||||||
|
name="mcarchive-org",
|
||||||
|
instructions=(
|
||||||
|
"Search and download files from the Internet Archive (archive.org). "
|
||||||
|
"Typical flow: search_items -> get_item_metadata -> list_files -> download_file. "
|
||||||
|
"Use scrape_items (count>=100) only for bulk cursor-paginated iteration."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- helpers (not exposed as tools) ----------
|
||||||
|
|
||||||
|
|
||||||
|
def _human_size(n: int | str | None) -> str:
|
||||||
|
try:
|
||||||
|
x = float(n) # type: ignore[arg-type]
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return "?"
|
||||||
|
for unit in ("B", "KB", "MB", "GB", "TB"):
|
||||||
|
if x < 1024:
|
||||||
|
return f"{x:.1f} {unit}" if unit != "B" else f"{int(x)} B"
|
||||||
|
x /= 1024
|
||||||
|
return f"{x:.1f} PB"
|
||||||
|
|
||||||
|
|
||||||
|
def _enrich_file(identifier: str, f: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
name = f.get("name", "")
|
||||||
|
return {
|
||||||
|
"name": name,
|
||||||
|
"format": f.get("format"),
|
||||||
|
"size": int(f["size"]) if f.get("size") and str(f["size"]).isdigit() else None,
|
||||||
|
"size_human": _human_size(f.get("size")),
|
||||||
|
"md5": f.get("md5"),
|
||||||
|
"sha1": f.get("sha1"),
|
||||||
|
"mtime": f.get("mtime"),
|
||||||
|
"source": f.get("source"),
|
||||||
|
"download_url": f"https://archive.org/download/{identifier}/{name}",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _matches(name: str, format_: str | None, name_glob: str | None, formats: list[str] | None) -> bool:
|
||||||
|
if name_glob and not fnmatch.fnmatchcase(name, name_glob):
|
||||||
|
return False
|
||||||
|
return not (formats and (format_ or "").lower() not in {f.lower() for f in formats})
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- tools ----------
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool
|
||||||
|
async def search_items(
|
||||||
|
query: Annotated[str, Field(description="Lucene/Solr query, e.g. 'mediatype:audio AND creator:\"Grateful Dead\"'")],
|
||||||
|
fields: Annotated[
|
||||||
|
list[str] | None,
|
||||||
|
Field(description="Which metadata fields to return per doc. Defaults to identifier,title,mediatype,creator,date."),
|
||||||
|
] = None,
|
||||||
|
sort: Annotated[
|
||||||
|
list[str] | None,
|
||||||
|
Field(description="Sort expressions like 'downloads desc' or 'date asc'."),
|
||||||
|
] = None,
|
||||||
|
rows: Annotated[int, Field(ge=1, le=200, description="Results per page (1-200).")] = 25,
|
||||||
|
page: Annotated[int, Field(ge=1, description="1-indexed page number.")] = 1,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Search archive.org items. Good for small/interactive queries.
|
||||||
|
|
||||||
|
Returns up to `rows` matching items plus `num_found` (total hits) and `has_more`.
|
||||||
|
Use scrape_items for bulk iteration over large result sets.
|
||||||
|
"""
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
result = await c.search(query=query, fields=fields, sort=sort, rows=rows, page=page)
|
||||||
|
total = result["num_found"]
|
||||||
|
seen = (page - 1) * rows + len(result["docs"])
|
||||||
|
return {
|
||||||
|
"query": query,
|
||||||
|
"num_found": total,
|
||||||
|
"page": page,
|
||||||
|
"rows": rows,
|
||||||
|
"has_more": seen < total,
|
||||||
|
"docs": result["docs"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool
|
||||||
|
async def scrape_items(
|
||||||
|
query: Annotated[str, Field(description="Lucene/Solr query.")],
|
||||||
|
fields: Annotated[list[str] | None, Field(description="Metadata fields per item.")] = None,
|
||||||
|
sorts: Annotated[list[str] | None, Field(description="Sort expressions, e.g. ['date asc'].")] = None,
|
||||||
|
count: Annotated[int, Field(ge=100, le=10000, description="Items per page (>=100 required by API).")] = 500,
|
||||||
|
cursor: Annotated[str | None, Field(description="Pass the `cursor` from a prior response to fetch next page.")] = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Scrape API — high-throughput cursor-paginated search. count >= 100.
|
||||||
|
|
||||||
|
Response includes `cursor` (for next page) when more results exist; missing when done.
|
||||||
|
"""
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
data = await c.scrape(query=query, fields=fields, sorts=sorts, count=count, cursor=cursor)
|
||||||
|
return {
|
||||||
|
"items": data.get("items", []),
|
||||||
|
"count": data.get("count"),
|
||||||
|
"total": data.get("total"),
|
||||||
|
"next_cursor": data.get("cursor"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool
|
||||||
|
async def get_item_metadata(
|
||||||
|
identifier: Annotated[str, Field(description="Archive.org item identifier, e.g. 'nasa'.")],
|
||||||
|
include_files: Annotated[
|
||||||
|
bool, Field(description="If true, include the full files[] array. Can be large.")
|
||||||
|
] = False,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Get metadata for a single item.
|
||||||
|
|
||||||
|
By default omits the (potentially huge) files[] array — call list_files for that.
|
||||||
|
"""
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
data = await c.metadata(identifier)
|
||||||
|
|
||||||
|
md = data.get("metadata", {})
|
||||||
|
out: dict[str, Any] = {
|
||||||
|
"identifier": md.get("identifier", identifier),
|
||||||
|
"title": md.get("title"),
|
||||||
|
"mediatype": md.get("mediatype"),
|
||||||
|
"collection": md.get("collection"),
|
||||||
|
"creator": md.get("creator"),
|
||||||
|
"date": md.get("date"),
|
||||||
|
"description": md.get("description"),
|
||||||
|
"publicdate": md.get("publicdate"),
|
||||||
|
"uploader": md.get("uploader"),
|
||||||
|
"subject": md.get("subject"),
|
||||||
|
"licenseurl": md.get("licenseurl"),
|
||||||
|
"item_size_bytes": data.get("item_size"),
|
||||||
|
"item_size_human": _human_size(data.get("item_size")),
|
||||||
|
"files_count": data.get("files_count"),
|
||||||
|
"server": data.get("server"),
|
||||||
|
"dir": data.get("dir"),
|
||||||
|
"item_url": f"https://archive.org/details/{identifier}",
|
||||||
|
}
|
||||||
|
if include_files:
|
||||||
|
out["files"] = [_enrich_file(identifier, f) for f in data.get("files", [])]
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool
|
||||||
|
async def list_files(
|
||||||
|
identifier: Annotated[str, Field(description="Archive.org item identifier.")],
|
||||||
|
formats: Annotated[
|
||||||
|
list[str] | None,
|
||||||
|
Field(description="Filter by format, e.g. ['MP3','VBR MP3','JPEG']. Case-insensitive."),
|
||||||
|
] = None,
|
||||||
|
name_glob: Annotated[
|
||||||
|
str | None,
|
||||||
|
Field(description="fnmatch-style glob on filename, e.g. '*.mp3' or 'cover.*'."),
|
||||||
|
] = None,
|
||||||
|
limit: Annotated[int, Field(ge=1, le=1000, description="Max files to return.")] = 100,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""List files in an item, with optional format/glob filtering.
|
||||||
|
|
||||||
|
Each entry includes a ready-to-use `download_url`.
|
||||||
|
"""
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
files = await c.files(identifier)
|
||||||
|
|
||||||
|
matches = [
|
||||||
|
_enrich_file(identifier, f)
|
||||||
|
for f in files
|
||||||
|
if _matches(f.get("name", ""), f.get("format"), name_glob, formats)
|
||||||
|
]
|
||||||
|
return {
|
||||||
|
"identifier": identifier,
|
||||||
|
"total_matching": len(matches),
|
||||||
|
"returned": min(len(matches), limit),
|
||||||
|
"files": matches[:limit],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool
|
||||||
|
def get_file_url(
|
||||||
|
identifier: Annotated[str, Field(description="Item identifier.")],
|
||||||
|
filename: Annotated[str, Field(description="Exact filename as shown in list_files.")],
|
||||||
|
) -> dict[str, str]:
|
||||||
|
"""Build the canonical download URL for a file without fetching anything."""
|
||||||
|
return {
|
||||||
|
"url": f"https://archive.org/download/{identifier}/{filename}",
|
||||||
|
"item_url": f"https://archive.org/details/{identifier}",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool
|
||||||
|
async def download_file(
|
||||||
|
identifier: Annotated[str, Field(description="Item identifier.")],
|
||||||
|
filename: Annotated[str, Field(description="Exact filename from list_files.")],
|
||||||
|
dest_dir: Annotated[
|
||||||
|
str | None,
|
||||||
|
Field(description="Directory to save into. Defaults to $MCARCHIVE_DOWNLOAD_ROOT/{identifier}."),
|
||||||
|
] = None,
|
||||||
|
verify_md5: Annotated[
|
||||||
|
str | None,
|
||||||
|
Field(description="Expected MD5 hex digest (from list_files). If provided, checksum is verified."),
|
||||||
|
] = None,
|
||||||
|
overwrite: Annotated[
|
||||||
|
bool,
|
||||||
|
Field(description="If false and file exists, resume the download (Range request)."),
|
||||||
|
] = False,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Download a file to disk. Supports resume via HTTP Range when overwrite=false."""
|
||||||
|
target_dir = Path(dest_dir).expanduser() if dest_dir else (DEFAULT_DOWNLOAD_ROOT / identifier)
|
||||||
|
dest = target_dir / filename
|
||||||
|
if overwrite and dest.exists():
|
||||||
|
dest.unlink()
|
||||||
|
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
result = await c.download_to_file(identifier, filename, dest, verify_md5=verify_md5)
|
||||||
|
|
||||||
|
result["identifier"] = identifier
|
||||||
|
result["filename"] = filename
|
||||||
|
result["size_human"] = _human_size(result.get("bytes"))
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- resources ----------
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.resource("archive://item/{identifier}")
|
||||||
|
async def item_resource(identifier: str) -> dict[str, Any]:
|
||||||
|
"""Expose item metadata as a readable MCP resource."""
|
||||||
|
return await get_item_metadata.fn(identifier=identifier, include_files=False) # type: ignore[attr-defined]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- entry point ----------
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
print(f"mcarchive-org v{__version__} — Internet Archive MCP server")
|
||||||
|
mcp.run()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
14
tests/conftest.py
Normal file
14
tests/conftest.py
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_collection_modifyitems(config, items):
|
||||||
|
for item in items:
|
||||||
|
if "asyncio" in item.keywords or item.get_closest_marker("asyncio"):
|
||||||
|
continue
|
||||||
|
|
||||||
|
|
||||||
|
pytest_plugins = ["pytest_asyncio"]
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_configure(config: pytest.Config) -> None:
|
||||||
|
config.addinivalue_line("markers", "network: test hits live archive.org")
|
||||||
52
tests/test_client.py
Normal file
52
tests/test_client.py
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
"""End-to-end smoke tests against live archive.org (network required).
|
||||||
|
|
||||||
|
Run with: uv run pytest -v
|
||||||
|
Skip with: uv run pytest -v -m 'not network'
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from mcarchive_org.client import ArchiveClient
|
||||||
|
|
||||||
|
pytestmark = [pytest.mark.asyncio, pytest.mark.network]
|
||||||
|
|
||||||
|
|
||||||
|
async def test_search_nasa_item():
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
result = await c.search(query="identifier:nasa", rows=5)
|
||||||
|
assert result["num_found"] >= 1
|
||||||
|
assert any(d["identifier"] == "nasa" for d in result["docs"])
|
||||||
|
|
||||||
|
|
||||||
|
async def test_metadata_nasa():
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
data = await c.metadata("nasa")
|
||||||
|
assert data["metadata"]["identifier"] == "nasa"
|
||||||
|
assert isinstance(data["files"], list) and data["files"]
|
||||||
|
|
||||||
|
|
||||||
|
async def test_download_small_file(tmp_path: Path):
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
files = await c.files("nasa")
|
||||||
|
# pick the smallest file to keep the test fast
|
||||||
|
small = min(
|
||||||
|
(f for f in files if f.get("size") and str(f["size"]).isdigit()),
|
||||||
|
key=lambda f: int(f["size"]),
|
||||||
|
)
|
||||||
|
dest = tmp_path / small["name"]
|
||||||
|
result = await c.download_to_file(
|
||||||
|
"nasa", small["name"], dest, verify_md5=small.get("md5")
|
||||||
|
)
|
||||||
|
assert result["bytes"] > 0
|
||||||
|
if small.get("md5"):
|
||||||
|
assert result["md5_ok"] is True
|
||||||
|
|
||||||
|
|
||||||
|
async def test_scrape_requires_min_count():
|
||||||
|
async with ArchiveClient() as c:
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
await c.scrape(query="identifier:nasa", count=10)
|
||||||
Loading…
x
Reference in New Issue
Block a user