Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Implements URI-based access to document content with: - ResourceStore for caching extracted images, chapters, sheets, slides - Content-based document IDs (SHA256 hash) for stable URIs across sessions - 11 resource templates with flexible URI patterns: - Binary: image://, chart://, media://, embed:// - Text: chapter://, section://, sheet://, slide:// - Ranges: chapters://doc/1-5, slides://doc/1,3,5 - Hierarchical: paragraph://doc/3/5 - Format suffixes for output control: - chapter://doc/3.md (default markdown) - chapter://doc/3.txt (plain text) - chapter://doc/3.html (basic HTML) - index_document tool scans and populates resources: - Word: chapters as markdown, embedded images - Excel: sheets as markdown tables - PowerPoint: slides as markdown Tool responses return URIs instead of blobs - clients fetch only what they need.
267 lines
10 KiB
Markdown
267 lines
10 KiB
Markdown
# MCP Resources Design for Embedded Office Content
|
|
|
|
## Overview
|
|
|
|
Expose embedded content from Office documents as MCP resources, allowing clients to fetch specific items on-demand rather than bloating tool responses.
|
|
|
|
## URI Scheme
|
|
|
|
```
|
|
office://{doc_id}/{resource_type}/{resource_id}
|
|
```
|
|
|
|
**Examples:**
|
|
- `office://abc123/image/0` - First image from document abc123
|
|
- `office://abc123/chart/revenue-q4` - Named chart
|
|
- `office://abc123/media/video-1` - Embedded video
|
|
- `office://abc123/embed/attached.pdf` - Embedded PDF
|
|
|
|
## Supported Resource Types
|
|
|
|
| Type | MIME Types | Sources |
|
|
|------|-----------|---------|
|
|
| `image` | image/png, image/jpeg, image/gif, image/wmf, image/emf | All Office formats |
|
|
| `chart` | image/png (rendered), application/json (data) | Excel, Word, PowerPoint |
|
|
| `media` | audio/*, video/* | PowerPoint, Word |
|
|
| `embed` | application/pdf, application/msword, etc. | OLE embedded objects |
|
|
| `font` | font/ttf, font/otf | Embedded fonts |
|
|
| `slide` | image/png (rendered) | PowerPoint slides as images |
|
|
|
|
## Document ID Strategy
|
|
|
|
Documents need stable IDs for resource URIs. Options:
|
|
|
|
1. **Content hash** - SHA256 of file content (stable across sessions)
|
|
2. **Path hash** - Hash of file path (simpler, works for local files)
|
|
3. **Session ID** - Random ID per extraction (only valid during session)
|
|
|
|
**Recommendation:** Use content hash prefix (first 12 chars of SHA256) for stability.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ MCP Client │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Resource Template: office://{doc_id}/{type}/{resource_id} │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Resource Manager │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ ImageStore │ │ ChartStore │ │ MediaStore │ ... │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Document Cache │
|
|
│ { doc_id: { images: [...], charts: [...], media: [...] } } │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Implementation
|
|
|
|
### 1. Resource Store (in-memory cache)
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import Dict, List, Optional
|
|
import hashlib
|
|
|
|
@dataclass
|
|
class EmbeddedResource:
|
|
"""Represents an embedded resource from an Office document."""
|
|
resource_id: str
|
|
resource_type: str # image, chart, media, embed
|
|
mime_type: str
|
|
data: bytes
|
|
name: Optional[str] = None # Original filename if available
|
|
metadata: Optional[dict] = None # Size, dimensions, etc.
|
|
|
|
class ResourceStore:
|
|
"""Manages extracted resources from Office documents."""
|
|
|
|
def __init__(self):
|
|
self._documents: Dict[str, Dict[str, List[EmbeddedResource]]] = {}
|
|
|
|
@staticmethod
|
|
def get_doc_id(file_path: str) -> str:
|
|
"""Generate stable document ID from file content."""
|
|
with open(file_path, 'rb') as f:
|
|
content_hash = hashlib.sha256(f.read()).hexdigest()
|
|
return content_hash[:12]
|
|
|
|
def store(self, doc_id: str, resource: EmbeddedResource):
|
|
"""Store an extracted resource."""
|
|
if doc_id not in self._documents:
|
|
self._documents[doc_id] = {}
|
|
rtype = resource.resource_type
|
|
if rtype not in self._documents[doc_id]:
|
|
self._documents[doc_id][rtype] = []
|
|
self._documents[doc_id][rtype].append(resource)
|
|
|
|
def get(self, doc_id: str, resource_type: str, resource_id: str) -> Optional[EmbeddedResource]:
|
|
"""Retrieve a specific resource."""
|
|
if doc_id not in self._documents:
|
|
return None
|
|
resources = self._documents[doc_id].get(resource_type, [])
|
|
|
|
# Try by index first
|
|
if resource_id.isdigit():
|
|
idx = int(resource_id)
|
|
if 0 <= idx < len(resources):
|
|
return resources[idx]
|
|
|
|
# Try by name
|
|
for r in resources:
|
|
if r.resource_id == resource_id or r.name == resource_id:
|
|
return r
|
|
return None
|
|
|
|
def list_resources(self, doc_id: str) -> Dict[str, List[dict]]:
|
|
"""List all resources for a document."""
|
|
if doc_id not in self._documents:
|
|
return {}
|
|
|
|
result = {}
|
|
for rtype, resources in self._documents[doc_id].items():
|
|
result[rtype] = [
|
|
{
|
|
"id": r.resource_id,
|
|
"name": r.name,
|
|
"mime_type": r.mime_type,
|
|
"uri": f"office://{doc_id}/{rtype}/{r.resource_id}"
|
|
}
|
|
for r in resources
|
|
]
|
|
return result
|
|
|
|
# Global instance
|
|
resource_store = ResourceStore()
|
|
```
|
|
|
|
### 2. Resource Template Registration
|
|
|
|
```python
|
|
from fastmcp import FastMCP
|
|
|
|
app = FastMCP("MCP Office Tools")
|
|
|
|
@app.resource(
|
|
"office://{doc_id}/{resource_type}/{resource_id}",
|
|
name="office_embedded_resource",
|
|
description="Embedded content from Office documents (images, charts, media, etc.)"
|
|
)
|
|
def get_office_resource(doc_id: str, resource_type: str, resource_id: str) -> bytes:
|
|
"""Retrieve embedded resource from an Office document."""
|
|
resource = resource_store.get(doc_id, resource_type, resource_id)
|
|
if resource is None:
|
|
raise ValueError(
|
|
f"Resource not found: office://{doc_id}/{resource_type}/{resource_id}"
|
|
)
|
|
return resource.data
|
|
```
|
|
|
|
### 3. Integration with extract_images Tool
|
|
|
|
Modify `extract_images` to populate the resource store:
|
|
|
|
```python
|
|
@mcp_tool(name="extract_images")
|
|
async def extract_images(self, file_path: str, ...) -> dict:
|
|
# ... existing extraction logic ...
|
|
|
|
doc_id = ResourceStore.get_doc_id(resolved_path)
|
|
|
|
for idx, image_data in enumerate(extracted_images):
|
|
resource = EmbeddedResource(
|
|
resource_id=str(idx),
|
|
resource_type="image",
|
|
mime_type=image_data["mime_type"],
|
|
data=image_data["bytes"],
|
|
name=image_data.get("filename"),
|
|
metadata={"width": ..., "height": ...}
|
|
)
|
|
resource_store.store(doc_id, resource)
|
|
|
|
# Return URIs instead of base64 data
|
|
return {
|
|
"doc_id": doc_id,
|
|
"images": [
|
|
{
|
|
"uri": f"office://{doc_id}/image/{idx}",
|
|
"mime_type": img["mime_type"],
|
|
"dimensions": {...}
|
|
}
|
|
for idx, img in enumerate(extracted_images)
|
|
],
|
|
"message": "Use resource URIs to fetch image data"
|
|
}
|
|
```
|
|
|
|
### 4. New Tool: list_embedded_resources
|
|
|
|
```python
|
|
@mcp_tool(name="list_embedded_resources")
|
|
async def list_embedded_resources(
|
|
self,
|
|
file_path: str,
|
|
resource_types: str = "all" # "all", "image", "chart", "media", etc.
|
|
) -> dict:
|
|
"""
|
|
Scan document and return URIs for all embedded resources.
|
|
Does not extract content - just identifies what's available.
|
|
"""
|
|
doc_id = ResourceStore.get_doc_id(resolved_path)
|
|
|
|
# Scan document for resources
|
|
resources = scan_for_resources(resolved_path, resource_types)
|
|
|
|
# Store metadata (not content yet - lazy loading)
|
|
for r in resources:
|
|
resource_store.store(doc_id, r)
|
|
|
|
return {
|
|
"doc_id": doc_id,
|
|
"resources": resource_store.list_resources(doc_id),
|
|
"total_count": sum(len(v) for v in resources.values())
|
|
}
|
|
```
|
|
|
|
## Usage Flow
|
|
|
|
1. **Client extracts images or lists resources:**
|
|
```
|
|
→ list_embedded_resources("report.docx")
|
|
← { "doc_id": "a1b2c3d4e5f6", "resources": { "image": [...], "chart": [...] } }
|
|
```
|
|
|
|
2. **Client fetches specific resource via URI:**
|
|
```
|
|
→ read_resource("office://a1b2c3d4e5f6/image/0")
|
|
← <binary PNG data>
|
|
```
|
|
|
|
3. **Resources remain available for the session** (or until cache expires)
|
|
|
|
## Benefits
|
|
|
|
1. **Smaller tool responses** - URIs instead of base64 blobs
|
|
2. **On-demand fetching** - Client only loads what it needs
|
|
3. **Unified access** - Same pattern for images, charts, media, embeds
|
|
4. **Cacheable** - Document ID enables client-side caching
|
|
5. **Discoverable** - `list_embedded_resources` shows what's available
|
|
|
|
## Future Extensions
|
|
|
|
- **Lazy extraction** - Only extract when resource is read, not when listed
|
|
- **Thumbnails** - `office://{doc_id}/image/{id}?size=thumb`
|
|
- **Format conversion** - `office://{doc_id}/image/{id}?format=webp`
|
|
- **Expiration** - TTL on cached resources
|
|
- **Persistence** - Optional disk-backed store for large documents
|