Test Dashboard / test-and-dashboard (push) Waiting to run

Details

Add MCP resource system for embedded document content

Implements URI-based access to document content with:

- ResourceStore for caching extracted images, chapters, sheets, slides
- Content-based document IDs (SHA256 hash) for stable URIs across sessions
- 11 resource templates with flexible URI patterns:
  - Binary: image://, chart://, media://, embed://
  - Text: chapter://, section://, sheet://, slide://
  - Ranges: chapters://doc/1-5, slides://doc/1,3,5
  - Hierarchical: paragraph://doc/3/5

- Format suffixes for output control:
  - chapter://doc/3.md (default markdown)
  - chapter://doc/3.txt (plain text)
  - chapter://doc/3.html (basic HTML)

- index_document tool scans and populates resources:
  - Word: chapters as markdown, embedded images
  - Excel: sheets as markdown tables
  - PowerPoint: slides as markdown

Tool responses return URIs instead of blobs - clients fetch only what they need.

2026-01-11 09:04:29 -07:00

10 KiB

Raw Blame History

MCP Resources Design for Embedded Office Content

Overview

Expose embedded content from Office documents as MCP resources, allowing clients to fetch specific items on-demand rather than bloating tool responses.

URI Scheme

office://{doc_id}/{resource_type}/{resource_id}

Examples:

office://abc123/image/0 - First image from document abc123
office://abc123/chart/revenue-q4 - Named chart
office://abc123/media/video-1 - Embedded video
office://abc123/embed/attached.pdf - Embedded PDF

Supported Resource Types

Type	MIME Types	Sources
`image`	image/png, image/jpeg, image/gif, image/wmf, image/emf	All Office formats
`chart`	image/png (rendered), application/json (data)	Excel, Word, PowerPoint
`media`	audio/, video/	PowerPoint, Word
`embed`	application/pdf, application/msword, etc.	OLE embedded objects
`font`	font/ttf, font/otf	Embedded fonts
`slide`	image/png (rendered)	PowerPoint slides as images

Document ID Strategy

Documents need stable IDs for resource URIs. Options:

Content hash - SHA256 of file content (stable across sessions)
Path hash - Hash of file path (simpler, works for local files)
Session ID - Random ID per extraction (only valid during session)

Recommendation: Use content hash prefix (first 12 chars of SHA256) for stability.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     MCP Client                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Resource Template: office://{doc_id}/{type}/{resource_id}  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Resource Manager                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │ ImageStore  │  │ ChartStore  │  │ MediaStore  │  ...     │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Document Cache                            │
│  { doc_id: { images: [...], charts: [...], media: [...] } } │
└─────────────────────────────────────────────────────────────┘

Implementation

1. Resource Store (in-memory cache)

from dataclasses import dataclass
from typing import Dict, List, Optional
import hashlib

@dataclass
class EmbeddedResource:
    """Represents an embedded resource from an Office document."""
    resource_id: str
    resource_type: str  # image, chart, media, embed
    mime_type: str
    data: bytes
    name: Optional[str] = None  # Original filename if available
    metadata: Optional[dict] = None  # Size, dimensions, etc.

class ResourceStore:
    """Manages extracted resources from Office documents."""

    def __init__(self):
        self._documents: Dict[str, Dict[str, List[EmbeddedResource]]] = {}

    @staticmethod
    def get_doc_id(file_path: str) -> str:
        """Generate stable document ID from file content."""
        with open(file_path, 'rb') as f:
            content_hash = hashlib.sha256(f.read()).hexdigest()
        return content_hash[:12]

    def store(self, doc_id: str, resource: EmbeddedResource):
        """Store an extracted resource."""
        if doc_id not in self._documents:
            self._documents[doc_id] = {}
        rtype = resource.resource_type
        if rtype not in self._documents[doc_id]:
            self._documents[doc_id][rtype] = []
        self._documents[doc_id][rtype].append(resource)

    def get(self, doc_id: str, resource_type: str, resource_id: str) -> Optional[EmbeddedResource]:
        """Retrieve a specific resource."""
        if doc_id not in self._documents:
            return None
        resources = self._documents[doc_id].get(resource_type, [])

        # Try by index first
        if resource_id.isdigit():
            idx = int(resource_id)
            if 0 <= idx < len(resources):
                return resources[idx]

        # Try by name
        for r in resources:
            if r.resource_id == resource_id or r.name == resource_id:
                return r
        return None

    def list_resources(self, doc_id: str) -> Dict[str, List[dict]]:
        """List all resources for a document."""
        if doc_id not in self._documents:
            return {}

        result = {}
        for rtype, resources in self._documents[doc_id].items():
            result[rtype] = [
                {
                    "id": r.resource_id,
                    "name": r.name,
                    "mime_type": r.mime_type,
                    "uri": f"office://{doc_id}/{rtype}/{r.resource_id}"
                }
                for r in resources
            ]
        return result

# Global instance
resource_store = ResourceStore()

2. Resource Template Registration

from fastmcp import FastMCP

app = FastMCP("MCP Office Tools")

@app.resource(
    "office://{doc_id}/{resource_type}/{resource_id}",
    name="office_embedded_resource",
    description="Embedded content from Office documents (images, charts, media, etc.)"
)
def get_office_resource(doc_id: str, resource_type: str, resource_id: str) -> bytes:
    """Retrieve embedded resource from an Office document."""
    resource = resource_store.get(doc_id, resource_type, resource_id)
    if resource is None:
        raise ValueError(
            f"Resource not found: office://{doc_id}/{resource_type}/{resource_id}"
        )
    return resource.data

3. Integration with extract_images Tool

Modify extract_images to populate the resource store:

@mcp_tool(name="extract_images")
async def extract_images(self, file_path: str, ...) -> dict:
    # ... existing extraction logic ...

    doc_id = ResourceStore.get_doc_id(resolved_path)

    for idx, image_data in enumerate(extracted_images):
        resource = EmbeddedResource(
            resource_id=str(idx),
            resource_type="image",
            mime_type=image_data["mime_type"],
            data=image_data["bytes"],
            name=image_data.get("filename"),
            metadata={"width": ..., "height": ...}
        )
        resource_store.store(doc_id, resource)

    # Return URIs instead of base64 data
    return {
        "doc_id": doc_id,
        "images": [
            {
                "uri": f"office://{doc_id}/image/{idx}",
                "mime_type": img["mime_type"],
                "dimensions": {...}
            }
            for idx, img in enumerate(extracted_images)
        ],
        "message": "Use resource URIs to fetch image data"
    }

4. New Tool: list_embedded_resources

@mcp_tool(name="list_embedded_resources")
async def list_embedded_resources(
    self,
    file_path: str,
    resource_types: str = "all"  # "all", "image", "chart", "media", etc.
) -> dict:
    """
    Scan document and return URIs for all embedded resources.
    Does not extract content - just identifies what's available.
    """
    doc_id = ResourceStore.get_doc_id(resolved_path)

    # Scan document for resources
    resources = scan_for_resources(resolved_path, resource_types)

    # Store metadata (not content yet - lazy loading)
    for r in resources:
        resource_store.store(doc_id, r)

    return {
        "doc_id": doc_id,
        "resources": resource_store.list_resources(doc_id),
        "total_count": sum(len(v) for v in resources.values())
    }

Usage Flow

Client extracts images or lists resources:

→ list_embedded_resources("report.docx")
← { "doc_id": "a1b2c3d4e5f6", "resources": { "image": [...], "chart": [...] } }

Client fetches specific resource via URI:

→ read_resource("office://a1b2c3d4e5f6/image/0")
← <binary PNG data>

Resources remain available for the session (or until cache expires)

Benefits

Smaller tool responses - URIs instead of base64 blobs
On-demand fetching - Client only loads what it needs
Unified access - Same pattern for images, charts, media, embeds
Cacheable - Document ID enables client-side caching
Discoverable - list_embedded_resources shows what's available

Future Extensions

Lazy extraction - Only extract when resource is read, not when listed
Thumbnails - office://{doc_id}/image/{id}?size=thumb
Format conversion - office://{doc_id}/image/{id}?format=webp
Expiration - TTL on cached resources
Persistence - Optional disk-backed store for large documents

10 KiB Raw Blame History