# MCP Resources Design for Embedded Office Content ## Overview Expose embedded content from Office documents as MCP resources, allowing clients to fetch specific items on-demand rather than bloating tool responses. ## URI Scheme ``` office://{doc_id}/{resource_type}/{resource_id} ``` **Examples:** - `office://abc123/image/0` - First image from document abc123 - `office://abc123/chart/revenue-q4` - Named chart - `office://abc123/media/video-1` - Embedded video - `office://abc123/embed/attached.pdf` - Embedded PDF ## Supported Resource Types | Type | MIME Types | Sources | |------|-----------|---------| | `image` | image/png, image/jpeg, image/gif, image/wmf, image/emf | All Office formats | | `chart` | image/png (rendered), application/json (data) | Excel, Word, PowerPoint | | `media` | audio/*, video/* | PowerPoint, Word | | `embed` | application/pdf, application/msword, etc. | OLE embedded objects | | `font` | font/ttf, font/otf | Embedded fonts | | `slide` | image/png (rendered) | PowerPoint slides as images | ## Document ID Strategy Documents need stable IDs for resource URIs. Options: 1. **Content hash** - SHA256 of file content (stable across sessions) 2. **Path hash** - Hash of file path (simpler, works for local files) 3. **Session ID** - Random ID per extraction (only valid during session) **Recommendation:** Use content hash prefix (first 12 chars of SHA256) for stability. ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ MCP Client │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Resource Template: office://{doc_id}/{type}/{resource_id} │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Resource Manager │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ ImageStore │ │ ChartStore │ │ MediaStore │ ... │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Document Cache │ │ { doc_id: { images: [...], charts: [...], media: [...] } } │ └─────────────────────────────────────────────────────────────┘ ``` ## Implementation ### 1. Resource Store (in-memory cache) ```python from dataclasses import dataclass from typing import Dict, List, Optional import hashlib @dataclass class EmbeddedResource: """Represents an embedded resource from an Office document.""" resource_id: str resource_type: str # image, chart, media, embed mime_type: str data: bytes name: Optional[str] = None # Original filename if available metadata: Optional[dict] = None # Size, dimensions, etc. class ResourceStore: """Manages extracted resources from Office documents.""" def __init__(self): self._documents: Dict[str, Dict[str, List[EmbeddedResource]]] = {} @staticmethod def get_doc_id(file_path: str) -> str: """Generate stable document ID from file content.""" with open(file_path, 'rb') as f: content_hash = hashlib.sha256(f.read()).hexdigest() return content_hash[:12] def store(self, doc_id: str, resource: EmbeddedResource): """Store an extracted resource.""" if doc_id not in self._documents: self._documents[doc_id] = {} rtype = resource.resource_type if rtype not in self._documents[doc_id]: self._documents[doc_id][rtype] = [] self._documents[doc_id][rtype].append(resource) def get(self, doc_id: str, resource_type: str, resource_id: str) -> Optional[EmbeddedResource]: """Retrieve a specific resource.""" if doc_id not in self._documents: return None resources = self._documents[doc_id].get(resource_type, []) # Try by index first if resource_id.isdigit(): idx = int(resource_id) if 0 <= idx < len(resources): return resources[idx] # Try by name for r in resources: if r.resource_id == resource_id or r.name == resource_id: return r return None def list_resources(self, doc_id: str) -> Dict[str, List[dict]]: """List all resources for a document.""" if doc_id not in self._documents: return {} result = {} for rtype, resources in self._documents[doc_id].items(): result[rtype] = [ { "id": r.resource_id, "name": r.name, "mime_type": r.mime_type, "uri": f"office://{doc_id}/{rtype}/{r.resource_id}" } for r in resources ] return result # Global instance resource_store = ResourceStore() ``` ### 2. Resource Template Registration ```python from fastmcp import FastMCP app = FastMCP("MCP Office Tools") @app.resource( "office://{doc_id}/{resource_type}/{resource_id}", name="office_embedded_resource", description="Embedded content from Office documents (images, charts, media, etc.)" ) def get_office_resource(doc_id: str, resource_type: str, resource_id: str) -> bytes: """Retrieve embedded resource from an Office document.""" resource = resource_store.get(doc_id, resource_type, resource_id) if resource is None: raise ValueError( f"Resource not found: office://{doc_id}/{resource_type}/{resource_id}" ) return resource.data ``` ### 3. Integration with extract_images Tool Modify `extract_images` to populate the resource store: ```python @mcp_tool(name="extract_images") async def extract_images(self, file_path: str, ...) -> dict: # ... existing extraction logic ... doc_id = ResourceStore.get_doc_id(resolved_path) for idx, image_data in enumerate(extracted_images): resource = EmbeddedResource( resource_id=str(idx), resource_type="image", mime_type=image_data["mime_type"], data=image_data["bytes"], name=image_data.get("filename"), metadata={"width": ..., "height": ...} ) resource_store.store(doc_id, resource) # Return URIs instead of base64 data return { "doc_id": doc_id, "images": [ { "uri": f"office://{doc_id}/image/{idx}", "mime_type": img["mime_type"], "dimensions": {...} } for idx, img in enumerate(extracted_images) ], "message": "Use resource URIs to fetch image data" } ``` ### 4. New Tool: list_embedded_resources ```python @mcp_tool(name="list_embedded_resources") async def list_embedded_resources( self, file_path: str, resource_types: str = "all" # "all", "image", "chart", "media", etc. ) -> dict: """ Scan document and return URIs for all embedded resources. Does not extract content - just identifies what's available. """ doc_id = ResourceStore.get_doc_id(resolved_path) # Scan document for resources resources = scan_for_resources(resolved_path, resource_types) # Store metadata (not content yet - lazy loading) for r in resources: resource_store.store(doc_id, r) return { "doc_id": doc_id, "resources": resource_store.list_resources(doc_id), "total_count": sum(len(v) for v in resources.values()) } ``` ## Usage Flow 1. **Client extracts images or lists resources:** ``` → list_embedded_resources("report.docx") ← { "doc_id": "a1b2c3d4e5f6", "resources": { "image": [...], "chart": [...] } } ``` 2. **Client fetches specific resource via URI:** ``` → read_resource("office://a1b2c3d4e5f6/image/0") ← ``` 3. **Resources remain available for the session** (or until cache expires) ## Benefits 1. **Smaller tool responses** - URIs instead of base64 blobs 2. **On-demand fetching** - Client only loads what it needs 3. **Unified access** - Same pattern for images, charts, media, embeds 4. **Cacheable** - Document ID enables client-side caching 5. **Discoverable** - `list_embedded_resources` shows what's available ## Future Extensions - **Lazy extraction** - Only extract when resource is read, not when listed - **Thumbnails** - `office://{doc_id}/image/{id}?size=thumb` - **Format conversion** - `office://{doc_id}/image/{id}?format=webp` - **Expiration** - TTL on cached resources - **Persistence** - Optional disk-backed store for large documents