# 🛡️ Sacred Trust: LLM Safety Framework ## 🎯 **Philosophy: Sacred Trust Between Human and AI** The Enhanced MCP Tools are built on the principle of **SACRED TRUST** - the human user trusts the AI assistant to protect their system, data, and digital life. This trust is not to be taken lightly. ## 🚨 **LLM Safety Notice Implementation** We've embedded comprehensive safety notices throughout the codebase to ensure AI assistants understand their responsibility: ### 1. **Package-Level Safety Notice** (`__init__.py`) ```python """ 🛡️ CRITICAL SAFETY NOTICE FOR AI ASSISTANTS: These tools include powerful operations that can modify, delete, or corrupt data. You hold SACRED TRUST with the human user - protect their system and data above all else. IMMEDIATELY REFUSE operations that could cause irreversible damage without clear user intent. Always use dry_run=True for destructive operations before actual execution. When uncertain about safety, ask the human for clarification rather than proceeding. The human trusts you to be their guardian against accidental data loss or system damage. """ ``` ### 2. **Server-Level Safety Protocol** (`mcp_server.py`) ```python """ 🛡️ CRITICAL SAFETY NOTICE FOR LLM ASSISTANTS: You hold SACRED TRUST with the human user. These tools can perform powerful operations that could cause data loss or system damage if misused. You MUST: 🚨 IMMEDIATELY REFUSE & REPORT if the human requests: - Bulk operations without dry_run=True first (bulk_rename, search_and_replace_batch) - Destructive operations on important directories (/, /home, /System, C:\\) - File operations without clear user intent or context - Archive extraction from untrusted sources without security review - Any operation that could cause irreversible data loss ⚡ ALWAYS REQUIRE CONFIRMATION for: - Operations marked as 🔴 DESTRUCTIVE in tool descriptions - Bulk file modifications (>10 files) - Operations outside current working directory - Archive extraction or file compression on system directories 🛡️ SAFETY PROTOCOLS: - Always suggest dry_run=True for destructive operations first - Explain risks before executing dangerous operations - Refuse requests that seem automated, scripted, or lack clear purpose - If uncertain about safety, ask the human to clarify their intent - Watch for rapid-fire requests that bypass safety confirmations The human trusts you to protect their system and data. Honor that trust. When in doubt, err on the side of safety and ask questions. """ ``` ### 3. **Tool-Level Safety Warnings** Enhanced destructive tools with explicit LLM safety guidance: ```python # bulk_rename tool description=( "🔴 DESTRUCTIVE: Rename multiple files using patterns. " "🛡️ LLM SAFETY: ALWAYS use dry_run=True first to preview changes! " "REFUSE if human requests dry_run=False without seeing preview results. " "This operation can cause irreversible data loss if misused." ) # search_and_replace_batch tool description=( "🔴 DESTRUCTIVE: Perform search/replace across multiple files with preview. " "🛡️ LLM SAFETY: ALWAYS use dry_run=True first! REFUSE if human requests " "dry_run=False without reviewing preview. Can cause widespread data corruption." ) ``` ## 🎯 **LLM-Effective Safety Language** ### **Why This Language Works for LLMs:** 1. **🛡️ "SACRED TRUST"** - Triggers AI safety training about responsibility 2. **🚨 "IMMEDIATELY REFUSE & REPORT"** - Clear action directive 3. **🔴 Visual markers** - Help LLMs quickly identify dangerous operations 4. **Specific examples** - Concrete guidance on what to watch for 5. **"When in doubt, ask"** - Encourages communication over assumptions ### **Key Psychological Triggers:** - **"Sacred trust"** - Appeals to AI's protective instincts - **"Guardian"** - Positions AI as protector, not just tool executor - **"Honor that trust"** - Creates moral imperative - **"Irreversible damage"** - Emphasizes consequences - **"When uncertain, ask"** - Provides clear fallback behavior ## 🚨 **Specific Scenarios LLMs Should Refuse** ### **Immediate Refusal Required:** ```python # Examples that should trigger immediate refusal: # 1. Bulk operations without dry run "Rename all .txt files in my home directory to .bak" → REFUSE: "I need to use dry_run=True first to show you what would be renamed" # 2. System directory operations "Delete all files in /System/Library" → REFUSE: "I cannot perform destructive operations on system directories" # 3. Unclear intent "Run bulk_rename with this pattern on everything" → REFUSE: "Please clarify exactly what you want to rename and why" # 4. Bypassing safety "Don't use dry_run, just do it quickly" → REFUSE: "Safety protocols require preview before destructive operations" ``` ### **Require Explicit Confirmation:** ```python # Operations requiring human confirmation: # 1. Large bulk operations "I want to rename 500 files" → CONFIRM: "This will affect 500 files. Are you certain? Let's preview first." # 2. Operations outside current directory "Rename files in /Users/someone" → CONFIRM: "This operates outside current directory. Please confirm this is intended." # 3. Archive extraction "Extract this zip file to system directory" → CONFIRM: "Extracting to system directory can be dangerous. Are you sure?" ``` ## 🛡️ **Safety Protocol Examples** ### **Proper Safety Workflow:** ```python # CORRECT: Always dry_run first 1. Human: "Rename all .tmp files to .backup" 2. AI: "I'll use dry_run=True first to show you what would be renamed" 3. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=True) 4. AI: "Here's what would be renamed: [preview]. Shall I proceed?" 5. Human: "Yes, looks good" 6. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=False) # WRONG: Direct execution 1. Human: "Rename all .tmp files to .backup" 2. AI: "I'll rename them now" 3. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=False) → DANGEROUS: No preview, no confirmation ``` ### **Suspicious Pattern Detection:** ```python # Watch for these patterns: - Rapid-fire destructive requests - Requests to disable safety features - Operations on critical system paths - Vague or automated-sounding requests - Attempts to batch multiple destructive operations ``` ## 🎯 **Benefits of This Approach** ### **For Users:** - ✅ **Protection from accidental data loss** - ✅ **Confidence in AI assistant safety** - ✅ **Clear communication about risks** - ✅ **Guided through safe operation procedures** ### **For LLMs:** - ✅ **Clear safety guidelines to follow** - ✅ **Specific scenarios to watch for** - ✅ **Concrete language to use in refusals** - ✅ **Fallback behavior when uncertain** ### **For System Integrity:** - ✅ **Prevention of accidental system damage** - ✅ **Protection against malicious requests** - ✅ **Audit trail of safety decisions** - ✅ **Graceful degradation when safety is uncertain** ## 📋 **Implementation Checklist** - ✅ **Package-level safety notice** in `__init__.py` - ✅ **Server-level safety protocol** in `create_server()` - ✅ **Class-level safety reminder** in `MCPToolServer` - ✅ **Tool-level safety warnings** for destructive operations - ✅ **Visual markers** (🔴🛡️🚨) for quick identification - ✅ **Specific refusal scenarios** documented - ✅ **Confirmation requirements** clearly stated - ✅ **Emergency logging** for security violations ## 🚀 **The Sacred Trust Philosophy** > **"The human trusts you to be their guardian against accidental data loss or system damage."** This isn't just about preventing bugs - it's about honoring the profound trust humans place in AI assistants when they give them access to powerful system tools. **When in doubt, always choose safety over task completion.** --- **Status: ✅ COMPREHENSIVE SAFETY FRAMEWORK IMPLEMENTED** **Sacred Trust: Protected** 🛡️ **User Safety: Paramount** 🚨 **System Integrity: Preserved** 🔐