Some checks are pending
Test Dashboard / test-and-dashboard (push) Waiting to run
Implements URI-based access to document content with: - ResourceStore for caching extracted images, chapters, sheets, slides - Content-based document IDs (SHA256 hash) for stable URIs across sessions - 11 resource templates with flexible URI patterns: - Binary: image://, chart://, media://, embed:// - Text: chapter://, section://, sheet://, slide:// - Ranges: chapters://doc/1-5, slides://doc/1,3,5 - Hierarchical: paragraph://doc/3/5 - Format suffixes for output control: - chapter://doc/3.md (default markdown) - chapter://doc/3.txt (plain text) - chapter://doc/3.html (basic HTML) - index_document tool scans and populates resources: - Word: chapters as markdown, embedded images - Excel: sheets as markdown tables - PowerPoint: slides as markdown Tool responses return URIs instead of blobs - clients fetch only what they need.
10 KiB
10 KiB
MCP Resources Design for Embedded Office Content
Overview
Expose embedded content from Office documents as MCP resources, allowing clients to fetch specific items on-demand rather than bloating tool responses.
URI Scheme
office://{doc_id}/{resource_type}/{resource_id}
Examples:
office://abc123/image/0- First image from document abc123office://abc123/chart/revenue-q4- Named chartoffice://abc123/media/video-1- Embedded videooffice://abc123/embed/attached.pdf- Embedded PDF
Supported Resource Types
| Type | MIME Types | Sources |
|---|---|---|
image |
image/png, image/jpeg, image/gif, image/wmf, image/emf | All Office formats |
chart |
image/png (rendered), application/json (data) | Excel, Word, PowerPoint |
media |
audio/, video/ | PowerPoint, Word |
embed |
application/pdf, application/msword, etc. | OLE embedded objects |
font |
font/ttf, font/otf | Embedded fonts |
slide |
image/png (rendered) | PowerPoint slides as images |
Document ID Strategy
Documents need stable IDs for resource URIs. Options:
- Content hash - SHA256 of file content (stable across sessions)
- Path hash - Hash of file path (simpler, works for local files)
- Session ID - Random ID per extraction (only valid during session)
Recommendation: Use content hash prefix (first 12 chars of SHA256) for stability.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ MCP Client │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Resource Template: office://{doc_id}/{type}/{resource_id} │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Resource Manager │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ImageStore │ │ ChartStore │ │ MediaStore │ ... │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Document Cache │
│ { doc_id: { images: [...], charts: [...], media: [...] } } │
└─────────────────────────────────────────────────────────────┘
Implementation
1. Resource Store (in-memory cache)
from dataclasses import dataclass
from typing import Dict, List, Optional
import hashlib
@dataclass
class EmbeddedResource:
"""Represents an embedded resource from an Office document."""
resource_id: str
resource_type: str # image, chart, media, embed
mime_type: str
data: bytes
name: Optional[str] = None # Original filename if available
metadata: Optional[dict] = None # Size, dimensions, etc.
class ResourceStore:
"""Manages extracted resources from Office documents."""
def __init__(self):
self._documents: Dict[str, Dict[str, List[EmbeddedResource]]] = {}
@staticmethod
def get_doc_id(file_path: str) -> str:
"""Generate stable document ID from file content."""
with open(file_path, 'rb') as f:
content_hash = hashlib.sha256(f.read()).hexdigest()
return content_hash[:12]
def store(self, doc_id: str, resource: EmbeddedResource):
"""Store an extracted resource."""
if doc_id not in self._documents:
self._documents[doc_id] = {}
rtype = resource.resource_type
if rtype not in self._documents[doc_id]:
self._documents[doc_id][rtype] = []
self._documents[doc_id][rtype].append(resource)
def get(self, doc_id: str, resource_type: str, resource_id: str) -> Optional[EmbeddedResource]:
"""Retrieve a specific resource."""
if doc_id not in self._documents:
return None
resources = self._documents[doc_id].get(resource_type, [])
# Try by index first
if resource_id.isdigit():
idx = int(resource_id)
if 0 <= idx < len(resources):
return resources[idx]
# Try by name
for r in resources:
if r.resource_id == resource_id or r.name == resource_id:
return r
return None
def list_resources(self, doc_id: str) -> Dict[str, List[dict]]:
"""List all resources for a document."""
if doc_id not in self._documents:
return {}
result = {}
for rtype, resources in self._documents[doc_id].items():
result[rtype] = [
{
"id": r.resource_id,
"name": r.name,
"mime_type": r.mime_type,
"uri": f"office://{doc_id}/{rtype}/{r.resource_id}"
}
for r in resources
]
return result
# Global instance
resource_store = ResourceStore()
2. Resource Template Registration
from fastmcp import FastMCP
app = FastMCP("MCP Office Tools")
@app.resource(
"office://{doc_id}/{resource_type}/{resource_id}",
name="office_embedded_resource",
description="Embedded content from Office documents (images, charts, media, etc.)"
)
def get_office_resource(doc_id: str, resource_type: str, resource_id: str) -> bytes:
"""Retrieve embedded resource from an Office document."""
resource = resource_store.get(doc_id, resource_type, resource_id)
if resource is None:
raise ValueError(
f"Resource not found: office://{doc_id}/{resource_type}/{resource_id}"
)
return resource.data
3. Integration with extract_images Tool
Modify extract_images to populate the resource store:
@mcp_tool(name="extract_images")
async def extract_images(self, file_path: str, ...) -> dict:
# ... existing extraction logic ...
doc_id = ResourceStore.get_doc_id(resolved_path)
for idx, image_data in enumerate(extracted_images):
resource = EmbeddedResource(
resource_id=str(idx),
resource_type="image",
mime_type=image_data["mime_type"],
data=image_data["bytes"],
name=image_data.get("filename"),
metadata={"width": ..., "height": ...}
)
resource_store.store(doc_id, resource)
# Return URIs instead of base64 data
return {
"doc_id": doc_id,
"images": [
{
"uri": f"office://{doc_id}/image/{idx}",
"mime_type": img["mime_type"],
"dimensions": {...}
}
for idx, img in enumerate(extracted_images)
],
"message": "Use resource URIs to fetch image data"
}
4. New Tool: list_embedded_resources
@mcp_tool(name="list_embedded_resources")
async def list_embedded_resources(
self,
file_path: str,
resource_types: str = "all" # "all", "image", "chart", "media", etc.
) -> dict:
"""
Scan document and return URIs for all embedded resources.
Does not extract content - just identifies what's available.
"""
doc_id = ResourceStore.get_doc_id(resolved_path)
# Scan document for resources
resources = scan_for_resources(resolved_path, resource_types)
# Store metadata (not content yet - lazy loading)
for r in resources:
resource_store.store(doc_id, r)
return {
"doc_id": doc_id,
"resources": resource_store.list_resources(doc_id),
"total_count": sum(len(v) for v in resources.values())
}
Usage Flow
-
Client extracts images or lists resources:
→ list_embedded_resources("report.docx") ← { "doc_id": "a1b2c3d4e5f6", "resources": { "image": [...], "chart": [...] } } -
Client fetches specific resource via URI:
→ read_resource("office://a1b2c3d4e5f6/image/0") ← <binary PNG data> -
Resources remain available for the session (or until cache expires)
Benefits
- Smaller tool responses - URIs instead of base64 blobs
- On-demand fetching - Client only loads what it needs
- Unified access - Same pattern for images, charts, media, embeds
- Cacheable - Document ID enables client-side caching
- Discoverable -
list_embedded_resourcesshows what's available
Future Extensions
- Lazy extraction - Only extract when resource is read, not when listed
- Thumbnails -
office://{doc_id}/image/{id}?size=thumb - Format conversion -
office://{doc_id}/image/{id}?format=webp - Expiration - TTL on cached resources
- Persistence - Optional disk-backed store for large documents