mcp-office-tools/docs/RESOURCE_DESIGN.md

# MCP Resources Design for Embedded Office Content

## Overview

Expose embedded content from Office documents as MCP resources, allowing clients to fetch specific items on-demand rather than bloating tool responses.

## URI Scheme

```
office://{doc_id}/{resource_type}/{resource_id}
```

**Examples:**
- `office://abc123/image/0` - First image from document abc123
- `office://abc123/chart/revenue-q4` - Named chart
- `office://abc123/media/video-1` - Embedded video
- `office://abc123/embed/attached.pdf` - Embedded PDF

## Supported Resource Types

| Type | MIME Types | Sources |
|------|-----------|---------|
| `image` | image/png, image/jpeg, image/gif, image/wmf, image/emf | All Office formats |
| `chart` | image/png (rendered), application/json (data) | Excel, Word, PowerPoint |
| `media` | audio/*, video/* | PowerPoint, Word |
| `embed` | application/pdf, application/msword, etc. | OLE embedded objects |
| `font` | font/ttf, font/otf | Embedded fonts |
| `slide` | image/png (rendered) | PowerPoint slides as images |

## Document ID Strategy

Documents need stable IDs for resource URIs. Options:

1. **Content hash** - SHA256 of file content (stable across sessions)
2. **Path hash** - Hash of file path (simpler, works for local files)
3. **Session ID** - Random ID per extraction (only valid during session)

**Recommendation:** Use content hash prefix (first 12 chars of SHA256) for stability.

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     MCP Client                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Resource Template: office://{doc_id}/{type}/{resource_id}  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Resource Manager                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │ ImageStore  │  │ ChartStore  │  │ MediaStore  │  ...     │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Document Cache                            │
│  { doc_id: { images: [...], charts: [...], media: [...] } } │
└─────────────────────────────────────────────────────────────┘
```

## Implementation

### 1. Resource Store (in-memory cache)

```python
from dataclasses import dataclass
from typing import Dict, List, Optional
import hashlib

@dataclass
class EmbeddedResource:
    """Represents an embedded resource from an Office document."""
    resource_id: str
    resource_type: str  # image, chart, media, embed
    mime_type: str
    data: bytes
    name: Optional[str] = None  # Original filename if available
    metadata: Optional[dict] = None  # Size, dimensions, etc.

class ResourceStore:
    """Manages extracted resources from Office documents."""

    def __init__(self):
        self._documents: Dict[str, Dict[str, List[EmbeddedResource]]] = {}

    @staticmethod
    def get_doc_id(file_path: str) -> str:
        """Generate stable document ID from file content."""
        with open(file_path, 'rb') as f:
            content_hash = hashlib.sha256(f.read()).hexdigest()
        return content_hash[:12]

    def store(self, doc_id: str, resource: EmbeddedResource):
        """Store an extracted resource."""
        if doc_id not in self._documents:
            self._documents[doc_id] = {}
        rtype = resource.resource_type
        if rtype not in self._documents[doc_id]:
            self._documents[doc_id][rtype] = []
        self._documents[doc_id][rtype].append(resource)

    def get(self, doc_id: str, resource_type: str, resource_id: str) -> Optional[EmbeddedResource]:
        """Retrieve a specific resource."""
        if doc_id not in self._documents:
            return None
        resources = self._documents[doc_id].get(resource_type, [])

        # Try by index first
        if resource_id.isdigit():
            idx = int(resource_id)
            if 0 <= idx < len(resources):
                return resources[idx]

        # Try by name
        for r in resources:
            if r.resource_id == resource_id or r.name == resource_id:
                return r
        return None

    def list_resources(self, doc_id: str) -> Dict[str, List[dict]]:
        """List all resources for a document."""
        if doc_id not in self._documents:
            return {}

        result = {}
        for rtype, resources in self._documents[doc_id].items():
            result[rtype] = [
                {
                    "id": r.resource_id,
                    "name": r.name,
                    "mime_type": r.mime_type,
                    "uri": f"office://{doc_id}/{rtype}/{r.resource_id}"
                }
                for r in resources
            ]
        return result

# Global instance
resource_store = ResourceStore()
```

### 2. Resource Template Registration

```python
from fastmcp import FastMCP

app = FastMCP("MCP Office Tools")

@app.resource(
    "office://{doc_id}/{resource_type}/{resource_id}",
    name="office_embedded_resource",
    description="Embedded content from Office documents (images, charts, media, etc.)"
)
def get_office_resource(doc_id: str, resource_type: str, resource_id: str) -> bytes:
    """Retrieve embedded resource from an Office document."""
    resource = resource_store.get(doc_id, resource_type, resource_id)
    if resource is None:
        raise ValueError(
            f"Resource not found: office://{doc_id}/{resource_type}/{resource_id}"
        )
    return resource.data
```

### 3. Integration with extract_images Tool

Modify `extract_images` to populate the resource store:

```python
@mcp_tool(name="extract_images")
async def extract_images(self, file_path: str, ...) -> dict:
    # ... existing extraction logic ...

    doc_id = ResourceStore.get_doc_id(resolved_path)

    for idx, image_data in enumerate(extracted_images):
        resource = EmbeddedResource(
            resource_id=str(idx),
            resource_type="image",
            mime_type=image_data["mime_type"],
            data=image_data["bytes"],
            name=image_data.get("filename"),
            metadata={"width": ..., "height": ...}
        )
        resource_store.store(doc_id, resource)

    # Return URIs instead of base64 data
    return {
        "doc_id": doc_id,
        "images": [
            {
                "uri": f"office://{doc_id}/image/{idx}",
                "mime_type": img["mime_type"],
                "dimensions": {...}
            }
            for idx, img in enumerate(extracted_images)
        ],
        "message": "Use resource URIs to fetch image data"
    }
```

### 4. New Tool: list_embedded_resources

```python
@mcp_tool(name="list_embedded_resources")
async def list_embedded_resources(
    self,
    file_path: str,
    resource_types: str = "all"  # "all", "image", "chart", "media", etc.
) -> dict:
    """
    Scan document and return URIs for all embedded resources.
    Does not extract content - just identifies what's available.
    """
    doc_id = ResourceStore.get_doc_id(resolved_path)

    # Scan document for resources
    resources = scan_for_resources(resolved_path, resource_types)

    # Store metadata (not content yet - lazy loading)
    for r in resources:
        resource_store.store(doc_id, r)

    return {
        "doc_id": doc_id,
        "resources": resource_store.list_resources(doc_id),
        "total_count": sum(len(v) for v in resources.values())
    }
```

## Usage Flow

1. **Client extracts images or lists resources:**
   ```
   → list_embedded_resources("report.docx")
   ← { "doc_id": "a1b2c3d4e5f6", "resources": { "image": [...], "chart": [...] } }
   ```

2. **Client fetches specific resource via URI:**
   ```
   → read_resource("office://a1b2c3d4e5f6/image/0")
   ← <binary PNG data>
   ```

3. **Resources remain available for the session** (or until cache expires)

## Benefits

1. **Smaller tool responses** - URIs instead of base64 blobs
2. **On-demand fetching** - Client only loads what it needs
3. **Unified access** - Same pattern for images, charts, media, embeds
4. **Cacheable** - Document ID enables client-side caching
5. **Discoverable** - `list_embedded_resources` shows what's available

## Future Extensions

- **Lazy extraction** - Only extract when resource is read, not when listed
- **Thumbnails** - `office://{doc_id}/image/{id}?size=thumb`
- **Format conversion** - `office://{doc_id}/image/{id}?format=webp`
- **Expiration** - TTL on cached resources
- **Persistence** - Optional disk-backed store for large documents