mcp-office-tools/docs/RESOURCE_DESIGN.md
Ryan Malloy d569034fa3
Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Add MCP resource system for embedded document content
Implements URI-based access to document content with:

- ResourceStore for caching extracted images, chapters, sheets, slides
- Content-based document IDs (SHA256 hash) for stable URIs across sessions
- 11 resource templates with flexible URI patterns:
  - Binary: image://, chart://, media://, embed://
  - Text: chapter://, section://, sheet://, slide://
  - Ranges: chapters://doc/1-5, slides://doc/1,3,5
  - Hierarchical: paragraph://doc/3/5

- Format suffixes for output control:
  - chapter://doc/3.md (default markdown)
  - chapter://doc/3.txt (plain text)
  - chapter://doc/3.html (basic HTML)

- index_document tool scans and populates resources:
  - Word: chapters as markdown, embedded images
  - Excel: sheets as markdown tables
  - PowerPoint: slides as markdown

Tool responses return URIs instead of blobs - clients fetch only what they need.
2026-01-11 09:04:29 -07:00

267 lines
10 KiB
Markdown

# MCP Resources Design for Embedded Office Content
## Overview
Expose embedded content from Office documents as MCP resources, allowing clients to fetch specific items on-demand rather than bloating tool responses.
## URI Scheme
```
office://{doc_id}/{resource_type}/{resource_id}
```
**Examples:**
- `office://abc123/image/0` - First image from document abc123
- `office://abc123/chart/revenue-q4` - Named chart
- `office://abc123/media/video-1` - Embedded video
- `office://abc123/embed/attached.pdf` - Embedded PDF
## Supported Resource Types
| Type | MIME Types | Sources |
|------|-----------|---------|
| `image` | image/png, image/jpeg, image/gif, image/wmf, image/emf | All Office formats |
| `chart` | image/png (rendered), application/json (data) | Excel, Word, PowerPoint |
| `media` | audio/*, video/* | PowerPoint, Word |
| `embed` | application/pdf, application/msword, etc. | OLE embedded objects |
| `font` | font/ttf, font/otf | Embedded fonts |
| `slide` | image/png (rendered) | PowerPoint slides as images |
## Document ID Strategy
Documents need stable IDs for resource URIs. Options:
1. **Content hash** - SHA256 of file content (stable across sessions)
2. **Path hash** - Hash of file path (simpler, works for local files)
3. **Session ID** - Random ID per extraction (only valid during session)
**Recommendation:** Use content hash prefix (first 12 chars of SHA256) for stability.
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ MCP Client │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Resource Template: office://{doc_id}/{type}/{resource_id} │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Resource Manager │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ImageStore │ │ ChartStore │ │ MediaStore │ ... │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Document Cache │
│ { doc_id: { images: [...], charts: [...], media: [...] } } │
└─────────────────────────────────────────────────────────────┘
```
## Implementation
### 1. Resource Store (in-memory cache)
```python
from dataclasses import dataclass
from typing import Dict, List, Optional
import hashlib
@dataclass
class EmbeddedResource:
"""Represents an embedded resource from an Office document."""
resource_id: str
resource_type: str # image, chart, media, embed
mime_type: str
data: bytes
name: Optional[str] = None # Original filename if available
metadata: Optional[dict] = None # Size, dimensions, etc.
class ResourceStore:
"""Manages extracted resources from Office documents."""
def __init__(self):
self._documents: Dict[str, Dict[str, List[EmbeddedResource]]] = {}
@staticmethod
def get_doc_id(file_path: str) -> str:
"""Generate stable document ID from file content."""
with open(file_path, 'rb') as f:
content_hash = hashlib.sha256(f.read()).hexdigest()
return content_hash[:12]
def store(self, doc_id: str, resource: EmbeddedResource):
"""Store an extracted resource."""
if doc_id not in self._documents:
self._documents[doc_id] = {}
rtype = resource.resource_type
if rtype not in self._documents[doc_id]:
self._documents[doc_id][rtype] = []
self._documents[doc_id][rtype].append(resource)
def get(self, doc_id: str, resource_type: str, resource_id: str) -> Optional[EmbeddedResource]:
"""Retrieve a specific resource."""
if doc_id not in self._documents:
return None
resources = self._documents[doc_id].get(resource_type, [])
# Try by index first
if resource_id.isdigit():
idx = int(resource_id)
if 0 <= idx < len(resources):
return resources[idx]
# Try by name
for r in resources:
if r.resource_id == resource_id or r.name == resource_id:
return r
return None
def list_resources(self, doc_id: str) -> Dict[str, List[dict]]:
"""List all resources for a document."""
if doc_id not in self._documents:
return {}
result = {}
for rtype, resources in self._documents[doc_id].items():
result[rtype] = [
{
"id": r.resource_id,
"name": r.name,
"mime_type": r.mime_type,
"uri": f"office://{doc_id}/{rtype}/{r.resource_id}"
}
for r in resources
]
return result
# Global instance
resource_store = ResourceStore()
```
### 2. Resource Template Registration
```python
from fastmcp import FastMCP
app = FastMCP("MCP Office Tools")
@app.resource(
"office://{doc_id}/{resource_type}/{resource_id}",
name="office_embedded_resource",
description="Embedded content from Office documents (images, charts, media, etc.)"
)
def get_office_resource(doc_id: str, resource_type: str, resource_id: str) -> bytes:
"""Retrieve embedded resource from an Office document."""
resource = resource_store.get(doc_id, resource_type, resource_id)
if resource is None:
raise ValueError(
f"Resource not found: office://{doc_id}/{resource_type}/{resource_id}"
)
return resource.data
```
### 3. Integration with extract_images Tool
Modify `extract_images` to populate the resource store:
```python
@mcp_tool(name="extract_images")
async def extract_images(self, file_path: str, ...) -> dict:
# ... existing extraction logic ...
doc_id = ResourceStore.get_doc_id(resolved_path)
for idx, image_data in enumerate(extracted_images):
resource = EmbeddedResource(
resource_id=str(idx),
resource_type="image",
mime_type=image_data["mime_type"],
data=image_data["bytes"],
name=image_data.get("filename"),
metadata={"width": ..., "height": ...}
)
resource_store.store(doc_id, resource)
# Return URIs instead of base64 data
return {
"doc_id": doc_id,
"images": [
{
"uri": f"office://{doc_id}/image/{idx}",
"mime_type": img["mime_type"],
"dimensions": {...}
}
for idx, img in enumerate(extracted_images)
],
"message": "Use resource URIs to fetch image data"
}
```
### 4. New Tool: list_embedded_resources
```python
@mcp_tool(name="list_embedded_resources")
async def list_embedded_resources(
self,
file_path: str,
resource_types: str = "all" # "all", "image", "chart", "media", etc.
) -> dict:
"""
Scan document and return URIs for all embedded resources.
Does not extract content - just identifies what's available.
"""
doc_id = ResourceStore.get_doc_id(resolved_path)
# Scan document for resources
resources = scan_for_resources(resolved_path, resource_types)
# Store metadata (not content yet - lazy loading)
for r in resources:
resource_store.store(doc_id, r)
return {
"doc_id": doc_id,
"resources": resource_store.list_resources(doc_id),
"total_count": sum(len(v) for v in resources.values())
}
```
## Usage Flow
1. **Client extracts images or lists resources:**
```
→ list_embedded_resources("report.docx")
← { "doc_id": "a1b2c3d4e5f6", "resources": { "image": [...], "chart": [...] } }
```
2. **Client fetches specific resource via URI:**
```
→ read_resource("office://a1b2c3d4e5f6/image/0")
← <binary PNG data>
```
3. **Resources remain available for the session** (or until cache expires)
## Benefits
1. **Smaller tool responses** - URIs instead of base64 blobs
2. **On-demand fetching** - Client only loads what it needs
3. **Unified access** - Same pattern for images, charts, media, embeds
4. **Cacheable** - Document ID enables client-side caching
5. **Discoverable** - `list_embedded_resources` shows what's available
## Future Extensions
- **Lazy extraction** - Only extract when resource is read, not when listed
- **Thumbnails** - `office://{doc_id}/image/{id}?size=thumb`
- **Format conversion** - `office://{doc_id}/image/{id}?format=webp`
- **Expiration** - TTL on cached resources
- **Persistence** - Optional disk-backed store for large documents