- Add complete PDF form lifecycle management - Create new forms with text, checkbox, dropdown, signature fields - Fill existing forms with JSON data and optional flattening - Add fields to existing PDFs with flexible positioning - Advanced field types: radio groups, textareas, date fields - Comprehensive validation engine with regex patterns - Email, phone, number, date format validation - Required field checking and length constraints - Visual validation cues with asterisks and format hints - Multi-field error reporting with detailed feedback - International character support and edge case handling - Enterprise-ready for complex business forms
503 lines
20 KiB
Markdown
503 lines
20 KiB
Markdown
# MCP Office Tools - Comprehensive Planning Document
|
|
|
|
*A companion server for Microsoft Office document processing to complement MCP PDF Tools*
|
|
|
|
---
|
|
|
|
## 🎯 Project Vision
|
|
|
|
Create a comprehensive **Microsoft Office document processing server** that matches the quality and scope of MCP PDF Tools, providing 25+ specialized tools for **all Microsoft Office formats** including:
|
|
|
|
- **Word Documents**: `.docx`, `.doc`, `.docm`, `.dotx`, `.dot`
|
|
- **Excel Spreadsheets**: `.xlsx`, `.xls`, `.xlsm`, `.xltx`, `.xlt`, `.csv`
|
|
- **PowerPoint Presentations**: `.pptx`, `.ppt`, `.pptm`, `.potx`, `.pot`
|
|
- **Legacy Formats**: Full support for Office 97-2003 formats
|
|
- **Template Files**: Document, spreadsheet, and presentation templates
|
|
|
|
## 📊 Architecture Overview
|
|
|
|
### **Core Libraries by Format**
|
|
|
|
**Word Documents (.docx, .doc, .docm)**
|
|
- **`python-docx`**: Modern DOCX manipulation and reading
|
|
- **`python-docx2`**: Enhanced DOCX features and complex documents
|
|
- **`olefile`**: Legacy .doc format processing (OLE compound documents)
|
|
- **`msoffcrypto-tool`**: Encrypted/password-protected files
|
|
- **`mammoth`**: High-quality HTML/Markdown conversion
|
|
- **`docx2txt`**: Fallback text extraction for damaged files
|
|
|
|
**Excel Spreadsheets (.xlsx, .xls, .xlsm)**
|
|
- **`openpyxl`**: Modern Excel file manipulation (.xlsx, .xlsm)
|
|
- **`xlrd`**: Legacy Excel file reading (.xls)
|
|
- **`xlwt`**: Legacy Excel file writing (.xls)
|
|
- **`pandas`**: Data analysis and CSV processing
|
|
- **`xlsxwriter`**: High-performance Excel file creation
|
|
|
|
**PowerPoint Presentations (.pptx, .ppt, .pptm)**
|
|
- **`python-pptx`**: Modern PowerPoint manipulation
|
|
- **`pyodp`**: OpenDocument presentation support
|
|
- **`olefile`**: Legacy .ppt format processing
|
|
|
|
**Universal Libraries**
|
|
- **`lxml`**: Advanced XML processing for Office Open XML
|
|
- **`Pillow`**: Image extraction and processing
|
|
- **`beautifulsoup4`**: HTML processing for conversions
|
|
- **`chardet`**: Character encoding detection for legacy files
|
|
|
|
### **Project Structure**
|
|
```
|
|
mcp-office-tools/
|
|
├── src/
|
|
│ └── mcp_office_tools/
|
|
│ ├── __init__.py
|
|
│ ├── server.py # Main FastMCP server
|
|
│ ├── word/ # Word document processing
|
|
│ │ ├── extractors.py # Text, tables, images, metadata
|
|
│ │ ├── analyzers.py # Content analysis, classification
|
|
│ │ └── converters.py # Format conversion
|
|
│ ├── excel/ # Excel spreadsheet processing
|
|
│ │ ├── extractors.py # Data, charts, formulas
|
|
│ │ ├── analyzers.py # Data analysis, validation
|
|
│ │ └── converters.py # CSV, JSON, HTML export
|
|
│ ├── powerpoint/ # PowerPoint presentation processing
|
|
│ │ ├── extractors.py # Text, images, slide content
|
|
│ │ ├── analyzers.py # Presentation analysis
|
|
│ │ └── converters.py # HTML, markdown export
|
|
│ ├── legacy/ # Legacy format handlers
|
|
│ │ ├── doc_handler.py # .doc file processing
|
|
│ │ ├── xls_handler.py # .xls file processing
|
|
│ │ └── ppt_handler.py # .ppt file processing
|
|
│ └── utils/ # Shared utilities
|
|
│ ├── file_detection.py # Format detection
|
|
│ ├── caching.py # URL caching
|
|
│ └── validation.py # File validation
|
|
├── tests/
|
|
├── examples/
|
|
├── docs/
|
|
├── pyproject.toml
|
|
├── README.md
|
|
└── CLAUDE.md
|
|
```
|
|
|
|
## 🔧 Comprehensive Tool Suite (30 Tools)
|
|
|
|
### **📄 Universal Processing Tools (8 Tools)**
|
|
*Work across all Office formats with intelligent format detection*
|
|
|
|
| Tool | Description | Formats Supported | Priority |
|
|
|------|-------------|-------------------|----------|
|
|
| `extract_text` | Multi-method text extraction with formatting preservation | All Word, Excel, PowerPoint | High |
|
|
| `extract_images` | Image extraction with metadata and format options | All formats | High |
|
|
| `extract_metadata` | Document properties, statistics, and technical info | All formats | High |
|
|
| `detect_format` | Intelligent file format detection and validation | All formats | High |
|
|
| `analyze_document_health` | File integrity, corruption detection, version analysis | All formats | High |
|
|
| `compare_documents` | Cross-format document comparison and change tracking | All formats | Medium |
|
|
| `convert_to_pdf` | Universal PDF conversion (requires LibreOffice) | All formats | Medium |
|
|
| `extract_hyperlinks` | URL and internal link extraction and analysis | All formats | Medium |
|
|
|
|
### **📝 Word Document Tools (8 Tools)**
|
|
*Specialized for .docx, .doc, .docm, .dotx, .dot formats*
|
|
|
|
| Tool | Description | Legacy Support | Priority |
|
|
|------|-------------|----------------|----------|
|
|
| `word_extract_tables` | Table extraction optimized for Word documents | ✅ .doc support | High |
|
|
| `word_get_structure` | Heading hierarchy, outline, TOC, and section analysis | ✅ .doc support | High |
|
|
| `word_extract_comments` | Comments, tracked changes, and review data | ✅ .doc support | High |
|
|
| `word_extract_footnotes` | Footnotes, endnotes, and citations | ✅ .doc support | High |
|
|
| `word_to_markdown` | Clean markdown conversion with structure preservation | ✅ .doc support | High |
|
|
| `word_to_html` | HTML export with inline CSS styling | ✅ .doc support | Medium |
|
|
| `word_merge_documents` | Combine multiple Word documents with style preservation | ✅ .doc support | Medium |
|
|
| `word_split_document` | Split by sections, pages, or heading levels | ✅ .doc support | Medium |
|
|
|
|
### **📊 Excel Spreadsheet Tools (8 Tools)**
|
|
*Specialized for .xlsx, .xls, .xlsm, .xltx, .xlt, .csv formats*
|
|
|
|
| Tool | Description | Legacy Support | Priority |
|
|
|------|-------------|----------------|----------|
|
|
| `excel_extract_data` | Cell data extraction with formula evaluation | ✅ .xls support | High |
|
|
| `excel_extract_charts` | Chart and graph extraction with data | ✅ .xls support | High |
|
|
| `excel_get_sheets` | Worksheet enumeration and metadata | ✅ .xls support | High |
|
|
| `excel_extract_formulas` | Formula extraction and dependency analysis | ✅ .xls support | High |
|
|
| `excel_to_csv` | CSV export with sheet and range selection | ✅ .xls support | High |
|
|
| `excel_to_json` | JSON export with hierarchical data structure | ✅ .xls support | Medium |
|
|
| `excel_analyze_data` | Data quality, statistics, and validation | ✅ .xls support | Medium |
|
|
| `excel_merge_workbooks` | Combine multiple Excel files | ✅ .xls support | Medium |
|
|
|
|
### **🎯 PowerPoint Tools (6 Tools)**
|
|
*Specialized for .pptx, .ppt, .pptm, .potx, .pot formats*
|
|
|
|
| Tool | Description | Legacy Support | Priority |
|
|
|------|-------------|----------------|----------|
|
|
| `ppt_extract_slides` | Slide content and structure extraction | ✅ .ppt support | High |
|
|
| `ppt_extract_speaker_notes` | Speaker notes and hidden content | ✅ .ppt support | High |
|
|
| `ppt_to_html` | HTML export with slide navigation | ✅ .ppt support | High |
|
|
| `ppt_to_markdown` | Markdown conversion with slide structure | ✅ .ppt support | Medium |
|
|
| `ppt_extract_animations` | Animation and transition analysis | ✅ .ppt support | Low |
|
|
| `ppt_merge_presentations` | Combine multiple PowerPoint files | ✅ .ppt support | Medium |
|
|
|
|
## 🌟 Key Features & Innovations
|
|
|
|
### **1. Universal Format Support**
|
|
Complete Microsoft Office ecosystem coverage:
|
|
```python
|
|
# Intelligent format detection and processing
|
|
file_info = await detect_format("document.unknown")
|
|
# Returns: {"format": "doc", "version": "Office 97-2003", "encrypted": false}
|
|
|
|
if file_info["format"] in ["docx", "doc"]:
|
|
text = await extract_text("document.unknown") # Auto-handles format
|
|
elif file_info["format"] in ["xlsx", "xls"]:
|
|
data = await excel_extract_data("document.unknown")
|
|
elif file_info["format"] in ["pptx", "ppt"]:
|
|
slides = await ppt_extract_slides("document.unknown")
|
|
```
|
|
|
|
### **2. Legacy Format Excellence**
|
|
Full support for Office 97-2003 formats:
|
|
- **OLE Compound Document parsing** for .doc, .xls, .ppt
|
|
- **Character encoding detection** for international documents
|
|
- **Password-protected file handling** with msoffcrypto-tool
|
|
- **Graceful degradation** when features aren't available in legacy formats
|
|
|
|
### **3. Intelligent Multi-Library Fallbacks**
|
|
```python
|
|
# Word document processing with fallbacks
|
|
async def extract_word_text_with_fallback(file_path: str):
|
|
try:
|
|
return await extract_with_python_docx(file_path) # Modern .docx
|
|
except Exception:
|
|
try:
|
|
return await extract_with_mammoth(file_path) # Better formatting
|
|
except Exception:
|
|
try:
|
|
return await extract_with_olefile(file_path) # Legacy .doc
|
|
except Exception:
|
|
return await extract_with_docx2txt(file_path) # Last resort
|
|
```
|
|
|
|
### **4. Cross-Format Intelligence**
|
|
- **Unified metadata extraction** across all formats
|
|
- **Cross-format document comparison** (compare .docx with .doc)
|
|
- **Format conversion pipelines** (Excel → CSV → Markdown)
|
|
- **Content analysis** that works regardless of source format
|
|
|
|
### **🔧 Content Manipulation (4 Tools)**
|
|
| Tool | Description | Priority |
|
|
|------|-------------|----------|
|
|
| `merge_documents` | Combine multiple DOCX files with style preservation | High |
|
|
| `split_document` | Split by sections, pages, or heading levels | High |
|
|
| `extract_sections` | Extract specific sections or page ranges | Medium |
|
|
| `modify_styles` | Apply consistent formatting and style changes | Medium |
|
|
|
|
### **🔄 Format Conversion (4 Tools)**
|
|
| Tool | Description | Priority |
|
|
|------|-------------|----------|
|
|
| `docx_to_markdown` | Clean markdown conversion with structure preservation | High |
|
|
| `docx_to_html` | HTML export with inline CSS styling | High |
|
|
| `docx_to_txt` | Plain text extraction with layout options | Medium |
|
|
| `docx_to_pdf` | PDF conversion (requires LibreOffice/pandoc) | Low |
|
|
|
|
### **📎 Advanced Features (3 Tools)**
|
|
| Tool | Description | Priority |
|
|
|------|-------------|----------|
|
|
| `extract_hyperlinks` | URL extraction and link analysis | Medium |
|
|
| `extract_comments` | Comments, tracked changes, and review data | Medium |
|
|
| `extract_footnotes` | Footnotes, endnotes, and citations | Low |
|
|
|
|
## 🌟 Key Features & Innovations
|
|
|
|
### **1. Multi-Library Fallback System**
|
|
Similar to PDF Tools' intelligent fallback:
|
|
```python
|
|
# Text extraction with fallbacks
|
|
async def extract_text_with_fallback(docx_path: str):
|
|
try:
|
|
return await extract_with_python_docx(docx_path) # Primary method
|
|
except Exception:
|
|
try:
|
|
return await extract_with_mammoth(docx_path) # Formatting-aware
|
|
except Exception:
|
|
return await extract_with_docx2txt(docx_path) # Maximum compatibility
|
|
```
|
|
|
|
### **2. URL Support**
|
|
- Direct processing of DOCX files from HTTPS URLs
|
|
- Intelligent caching (1-hour cache like PDF Tools)
|
|
- Content validation and security headers
|
|
- Support for cloud storage links (OneDrive, Google Drive, etc.)
|
|
|
|
### **3. Smart Document Detection**
|
|
- Automatic detection of document types
|
|
- Template identification
|
|
- Style analysis and recommendations
|
|
- Corruption detection and repair suggestions
|
|
|
|
### **4. Modern Async Architecture**
|
|
- Full async/await implementation
|
|
- Concurrent processing capabilities
|
|
- Resource management and cleanup
|
|
- Performance monitoring and timing
|
|
|
|
## 📊 Real-World Use Cases
|
|
|
|
### **📈 Business Intelligence & Reporting**
|
|
```python
|
|
# Comprehensive quarterly report analysis (Word + Excel + PowerPoint)
|
|
word_summary = await extract_text("quarterly-report.docx")
|
|
excel_data = await excel_extract_data("financial-data.xlsx", sheets=["Revenue", "Expenses"])
|
|
ppt_insights = await ppt_extract_slides("presentation.pptx")
|
|
|
|
# Cross-format analysis
|
|
tables = await word_extract_tables("quarterly-report.docx")
|
|
charts = await excel_extract_charts("financial-data.xlsx")
|
|
metadata = await extract_metadata("quarterly-report.doc") # Legacy support
|
|
```
|
|
|
|
### **📚 Academic Research & Paper Processing**
|
|
```python
|
|
# Multi-format research workflow
|
|
paper_structure = await word_get_structure("research-paper.docx")
|
|
data_analysis = await excel_analyze_data("research-data.xls") # Legacy Excel
|
|
citations = await word_extract_footnotes("research-paper.docx")
|
|
|
|
# Legacy format support
|
|
old_paper = await extract_text("archive-paper.doc") # Office 97-2003
|
|
old_data = await excel_extract_data("legacy-dataset.xls")
|
|
```
|
|
|
|
### **🏢 Corporate Document Management**
|
|
```python
|
|
# Legacy document migration and modernization
|
|
legacy_docs = ["policy.doc", "procedures.xls", "training.ppt"]
|
|
for doc in legacy_docs:
|
|
format_info = await detect_format(doc)
|
|
health = await analyze_document_health(doc)
|
|
|
|
if format_info["format"] == "doc":
|
|
modern_content = await word_to_markdown(doc)
|
|
elif format_info["format"] == "xls":
|
|
csv_data = await excel_to_csv(doc)
|
|
elif format_info["format"] == "ppt":
|
|
html_slides = await ppt_to_html(doc)
|
|
```
|
|
|
|
### **📋 Data Analysis & Business Intelligence**
|
|
```python
|
|
# Excel-focused data processing
|
|
workbook_info = await excel_get_sheets("sales-data.xlsx")
|
|
quarterly_data = await excel_extract_data("sales-data.xlsx",
|
|
sheets=["Q1", "Q2", "Q3", "Q4"])
|
|
formulas = await excel_extract_formulas("calculations.xlsm")
|
|
|
|
# Legacy Excel processing
|
|
old_data = await excel_extract_data("historical-sales.xls") # Pre-2007 format
|
|
combined_data = await excel_merge_workbooks(["new-data.xlsx", "old-data.xls"])
|
|
```
|
|
|
|
### **🎯 Presentation Analysis & Content Extraction**
|
|
```python
|
|
# PowerPoint content extraction and analysis
|
|
slides = await ppt_extract_slides("company-presentation.pptx")
|
|
speaker_notes = await ppt_extract_speaker_notes("training-deck.pptx")
|
|
images = await extract_images("product-showcase.ppt") # Legacy PowerPoint
|
|
|
|
# Cross-format presentation workflows
|
|
presentation_text = await extract_text("slides.pptx")
|
|
supporting_data = await excel_extract_data("presentation-data.xlsx")
|
|
documentation = await word_extract_text("presentation-notes.docx")
|
|
```
|
|
|
|
### **🔄 Format Conversion & Migration**
|
|
```python
|
|
# Universal format conversion pipelines
|
|
office_files = ["document.doc", "spreadsheet.xls", "presentation.ppt"]
|
|
|
|
for file in office_files:
|
|
# Convert everything to modern formats and web-friendly outputs
|
|
if file.endswith(('.doc', '.docx')):
|
|
markdown = await word_to_markdown(file)
|
|
html = await word_to_html(file)
|
|
elif file.endswith(('.xls', '.xlsx')):
|
|
csv = await excel_to_csv(file)
|
|
json_data = await excel_to_json(file)
|
|
elif file.endswith(('.ppt', '.pptx')):
|
|
html_slides = await ppt_to_html(file)
|
|
slide_markdown = await ppt_to_markdown(file)
|
|
```
|
|
|
|
## 🔧 Technical Implementation Plan
|
|
|
|
### **Phase 1: Foundation (5 Tools)**
|
|
1. `extract_text` - Multi-method text extraction
|
|
2. `extract_metadata` - Document properties and statistics
|
|
3. `get_document_structure` - Heading and outline analysis
|
|
4. `docx_to_markdown` - Clean markdown conversion
|
|
5. `analyze_document_health` - Basic integrity checking
|
|
|
|
### **Phase 2: Intelligence (6 Tools)**
|
|
1. `extract_tables` - Table extraction and conversion
|
|
2. `extract_images` - Image extraction with metadata
|
|
3. `classify_content` - Document type detection
|
|
4. `summarize_content` - Content summarization
|
|
5. `compare_documents` - Document comparison
|
|
6. `analyze_readability` - Reading level analysis
|
|
|
|
### **Phase 3: Manipulation (6 Tools)**
|
|
1. `merge_documents` - Document combination
|
|
2. `split_document` - Document splitting
|
|
3. `extract_sections` - Section extraction
|
|
4. `docx_to_html` - HTML conversion
|
|
5. `extract_hyperlinks` - Link analysis
|
|
6. `extract_comments` - Review data extraction
|
|
|
|
### **Phase 4: Advanced (5 Tools)**
|
|
1. `modify_styles` - Style manipulation
|
|
2. `analyze_formatting` - Format analysis
|
|
3. `docx_to_txt` - Text conversion
|
|
4. `extract_footnotes` - Citation extraction
|
|
5. `docx_to_pdf` - PDF conversion
|
|
|
|
## 📚 Dependencies
|
|
|
|
### **Core Libraries**
|
|
```toml
|
|
[dependencies]
|
|
python = "^3.11"
|
|
fastmcp = "^0.5.0"
|
|
python-docx = "^1.1.0"
|
|
mammoth = "^1.6.0"
|
|
docx2txt = "^0.8"
|
|
lxml = "^4.9.0"
|
|
pillow = "^10.0.0"
|
|
beautifulsoup4 = "^4.12.0"
|
|
aiohttp = "^3.9.0"
|
|
aiofiles = "^23.2.0"
|
|
```
|
|
|
|
### **Optional Libraries**
|
|
```toml
|
|
[dependencies.optional]
|
|
pypandoc = "^1.11" # For PDF conversion
|
|
nltk = "^3.8" # For readability analysis
|
|
spacy = "^3.7" # For advanced NLP
|
|
textstat = "^0.7" # For readability metrics
|
|
```
|
|
|
|
## 🧪 Testing Strategy
|
|
|
|
### **Unit Tests**
|
|
- Document parsing validation
|
|
- Text extraction accuracy
|
|
- Format conversion quality
|
|
- Error handling robustness
|
|
|
|
### **Integration Tests**
|
|
- Multi-format processing
|
|
- URL handling and caching
|
|
- Concurrent operation testing
|
|
- Performance benchmarking
|
|
|
|
### **Document Test Suite**
|
|
- Various DOCX format versions
|
|
- Complex formatting scenarios
|
|
- Corrupted file handling
|
|
- Large document processing
|
|
|
|
## 📖 Documentation Plan
|
|
|
|
### **README Structure**
|
|
Following the successful PDF Tools model:
|
|
1. **Compelling Introduction** - What we built and why
|
|
2. **Tool Categories** - Organized by functionality
|
|
3. **Real-World Examples** - Practical usage scenarios
|
|
4. **Installation Guide** - Quick start and integration
|
|
5. **API Documentation** - Complete reference
|
|
6. **Architecture Deep-Dive** - Technical implementation
|
|
|
|
### **Examples and Tutorials**
|
|
- Business document automation
|
|
- Academic paper processing
|
|
- Content migration workflows
|
|
- Document analysis pipelines
|
|
|
|
## 🚀 Success Metrics
|
|
|
|
### **Functionality Goals**
|
|
- ✅ 22 comprehensive tools covering all DOCX processing needs
|
|
- ✅ Multi-library fallback system for robust operation
|
|
- ✅ URL processing with intelligent caching
|
|
- ✅ Professional documentation with examples
|
|
|
|
### **Quality Standards**
|
|
- ✅ 100% lint-free code (ruff compliance)
|
|
- ✅ Comprehensive type hints
|
|
- ✅ Async-first architecture
|
|
- ✅ Robust error handling
|
|
- ✅ Performance optimization
|
|
|
|
### **User Experience**
|
|
- ✅ Intuitive API design
|
|
- ✅ Clear error messages
|
|
- ✅ Comprehensive examples
|
|
- ✅ Easy integration paths
|
|
|
|
## 🔗 Integration with MCP PDF Tools
|
|
|
|
### **Shared Patterns**
|
|
- Consistent API design
|
|
- Similar caching strategies
|
|
- Matching error handling
|
|
- Parallel documentation structure
|
|
|
|
### **Complementary Features**
|
|
- Cross-format conversion (DOCX ↔ PDF)
|
|
- Document comparison across formats
|
|
- Unified document analysis pipelines
|
|
- Shared utility functions
|
|
|
|
### **Combined Workflows**
|
|
```python
|
|
# Process both PDF and DOCX in same workflow
|
|
pdf_summary = await pdf_tools.summarize_content("document.pdf")
|
|
docx_summary = await docx_tools.summarize_content("document.docx")
|
|
comparison = await compare_cross_format(pdf_summary, docx_summary)
|
|
```
|
|
|
|
## 📅 Development Timeline
|
|
|
|
### **Week 1-2: Foundation**
|
|
- Project setup and core architecture
|
|
- Basic text extraction and metadata tools
|
|
- Testing framework and CI/CD
|
|
|
|
### **Week 3-4: Core Features**
|
|
- Table and image extraction
|
|
- Document structure analysis
|
|
- Format conversion basics
|
|
|
|
### **Week 5-6: Intelligence**
|
|
- Document classification and analysis
|
|
- Content summarization
|
|
- Health assessment
|
|
|
|
### **Week 7-8: Advanced Features**
|
|
- Document manipulation
|
|
- Advanced conversions
|
|
- Performance optimization
|
|
|
|
### **Week 9-10: Polish**
|
|
- Comprehensive documentation
|
|
- Example creation
|
|
- Integration testing
|
|
|
|
---
|
|
|
|
## 🎯 Next Steps
|
|
|
|
1. **Create project repository** with proper structure
|
|
2. **Set up development environment** with uv and dependencies
|
|
3. **Implement core text extraction** as foundation
|
|
4. **Build out tool categories** systematically
|
|
5. **Create comprehensive documentation** following PDF Tools model
|
|
|
|
This companion server will provide the same level of quality and comprehensiveness as MCP PDF Tools, creating a powerful document processing ecosystem for the MCP protocol. |