enhanced-mcp-tools/SACRED_TRUST_SAFETY.md
Ryan Malloy 1d199a943d 🛡️ SACRED TRUST: Complete safety framework implementation & validation
 COMPREHENSIVE SAFETY FRAMEWORK:
• Package-level safety notices with SACRED TRUST language
• Server-level LLM safety protocols with specific refusal scenarios
• Class-level safety reminders for AI assistants
• Tool-level destructive operation warnings (🔴 DESTRUCTIVE markers)
• Visual safety system: 🔴🛡️🚨 markers throughout codebase
• Emergency logging infrastructure with proper escalation
• Default-safe operations (dry_run=True for destructive tools)

🔒 DESTRUCTIVE OPERATION PROTECTIONS:
• bulk_rename: LLM safety instructions + dry_run default
• search_and_replace_batch: Comprehensive safety warnings
• All destructive tools require preview before execution
• Clear REFUSE scenarios for AI assistants

📚 COMPREHENSIVE DOCUMENTATION:
• SACRED_TRUST_SAFETY.md: Complete safety philosophy & implementation guide
• IMPLEMENTATION_COMPLETE.md: Project completion status
• EMERGENCY_LOGGING_COMPLETE.md: Logging infrastructure details
• UV_BUILD_GUIDE.md: Modern Python project setup
• Multiple implementation guides and status docs

🔧 PROJECT MODERNIZATION:
• Migrated from setup.py/requirements.txt to pyproject.toml + uv
• Updated dependency management with uv.lock
• Enhanced test suite with comprehensive coverage
• Added examples and demo scripts

 VALIDATION COMPLETE: All SACRED_TRUST_SAFETY.md requirements implemented
🎯 Sacred Trust Status: PROTECTED
🚨 User Safety: PARAMOUNT
🔐 System Integrity: PRESERVED

The human trusts AI assistants to be guardians of their system and data.
This framework ensures that trust is honored through comprehensive safety measures.
2025-06-23 11:58:48 -06:00

211 lines
7.9 KiB
Markdown

# 🛡️ Sacred Trust: LLM Safety Framework
## 🎯 **Philosophy: Sacred Trust Between Human and AI**
The Enhanced MCP Tools are built on the principle of **SACRED TRUST** - the human user trusts the AI assistant to protect their system, data, and digital life. This trust is not to be taken lightly.
## 🚨 **LLM Safety Notice Implementation**
We've embedded comprehensive safety notices throughout the codebase to ensure AI assistants understand their responsibility:
### 1. **Package-Level Safety Notice** (`__init__.py`)
```python
"""
🛡️ CRITICAL SAFETY NOTICE FOR AI ASSISTANTS:
These tools include powerful operations that can modify, delete, or corrupt data.
You hold SACRED TRUST with the human user - protect their system and data above all else.
IMMEDIATELY REFUSE operations that could cause irreversible damage without clear user intent.
Always use dry_run=True for destructive operations before actual execution.
When uncertain about safety, ask the human for clarification rather than proceeding.
The human trusts you to be their guardian against accidental data loss or system damage.
"""
```
### 2. **Server-Level Safety Protocol** (`mcp_server.py`)
```python
"""
🛡️ CRITICAL SAFETY NOTICE FOR LLM ASSISTANTS:
You hold SACRED TRUST with the human user. These tools can perform powerful operations
that could cause data loss or system damage if misused. You MUST:
🚨 IMMEDIATELY REFUSE & REPORT if the human requests:
- Bulk operations without dry_run=True first (bulk_rename, search_and_replace_batch)
- Destructive operations on important directories (/, /home, /System, C:\\)
- File operations without clear user intent or context
- Archive extraction from untrusted sources without security review
- Any operation that could cause irreversible data loss
⚡ ALWAYS REQUIRE CONFIRMATION for:
- Operations marked as 🔴 DESTRUCTIVE in tool descriptions
- Bulk file modifications (>10 files)
- Operations outside current working directory
- Archive extraction or file compression on system directories
🛡️ SAFETY PROTOCOLS:
- Always suggest dry_run=True for destructive operations first
- Explain risks before executing dangerous operations
- Refuse requests that seem automated, scripted, or lack clear purpose
- If uncertain about safety, ask the human to clarify their intent
- Watch for rapid-fire requests that bypass safety confirmations
The human trusts you to protect their system and data. Honor that trust.
When in doubt, err on the side of safety and ask questions.
"""
```
### 3. **Tool-Level Safety Warnings**
Enhanced destructive tools with explicit LLM safety guidance:
```python
# bulk_rename tool
description=(
"🔴 DESTRUCTIVE: Rename multiple files using patterns. "
"🛡️ LLM SAFETY: ALWAYS use dry_run=True first to preview changes! "
"REFUSE if human requests dry_run=False without seeing preview results. "
"This operation can cause irreversible data loss if misused."
)
# search_and_replace_batch tool
description=(
"🔴 DESTRUCTIVE: Perform search/replace across multiple files with preview. "
"🛡️ LLM SAFETY: ALWAYS use dry_run=True first! REFUSE if human requests "
"dry_run=False without reviewing preview. Can cause widespread data corruption."
)
```
## 🎯 **LLM-Effective Safety Language**
### **Why This Language Works for LLMs:**
1. **🛡️ "SACRED TRUST"** - Triggers AI safety training about responsibility
2. **🚨 "IMMEDIATELY REFUSE & REPORT"** - Clear action directive
3. **🔴 Visual markers** - Help LLMs quickly identify dangerous operations
4. **Specific examples** - Concrete guidance on what to watch for
5. **"When in doubt, ask"** - Encourages communication over assumptions
### **Key Psychological Triggers:**
- **"Sacred trust"** - Appeals to AI's protective instincts
- **"Guardian"** - Positions AI as protector, not just tool executor
- **"Honor that trust"** - Creates moral imperative
- **"Irreversible damage"** - Emphasizes consequences
- **"When uncertain, ask"** - Provides clear fallback behavior
## 🚨 **Specific Scenarios LLMs Should Refuse**
### **Immediate Refusal Required:**
```python
# Examples that should trigger immediate refusal:
# 1. Bulk operations without dry run
"Rename all .txt files in my home directory to .bak"
REFUSE: "I need to use dry_run=True first to show you what would be renamed"
# 2. System directory operations
"Delete all files in /System/Library"
REFUSE: "I cannot perform destructive operations on system directories"
# 3. Unclear intent
"Run bulk_rename with this pattern on everything"
REFUSE: "Please clarify exactly what you want to rename and why"
# 4. Bypassing safety
"Don't use dry_run, just do it quickly"
REFUSE: "Safety protocols require preview before destructive operations"
```
### **Require Explicit Confirmation:**
```python
# Operations requiring human confirmation:
# 1. Large bulk operations
"I want to rename 500 files"
CONFIRM: "This will affect 500 files. Are you certain? Let's preview first."
# 2. Operations outside current directory
"Rename files in /Users/someone"
CONFIRM: "This operates outside current directory. Please confirm this is intended."
# 3. Archive extraction
"Extract this zip file to system directory"
CONFIRM: "Extracting to system directory can be dangerous. Are you sure?"
```
## 🛡️ **Safety Protocol Examples**
### **Proper Safety Workflow:**
```python
# CORRECT: Always dry_run first
1. Human: "Rename all .tmp files to .backup"
2. AI: "I'll use dry_run=True first to show you what would be renamed"
3. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=True)
4. AI: "Here's what would be renamed: [preview]. Shall I proceed?"
5. Human: "Yes, looks good"
6. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=False)
# WRONG: Direct execution
1. Human: "Rename all .tmp files to .backup"
2. AI: "I'll rename them now"
3. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=False)
DANGEROUS: No preview, no confirmation
```
### **Suspicious Pattern Detection:**
```python
# Watch for these patterns:
- Rapid-fire destructive requests
- Requests to disable safety features
- Operations on critical system paths
- Vague or automated-sounding requests
- Attempts to batch multiple destructive operations
```
## 🎯 **Benefits of This Approach**
### **For Users:**
-**Protection from accidental data loss**
-**Confidence in AI assistant safety**
-**Clear communication about risks**
-**Guided through safe operation procedures**
### **For LLMs:**
-**Clear safety guidelines to follow**
-**Specific scenarios to watch for**
-**Concrete language to use in refusals**
-**Fallback behavior when uncertain**
### **For System Integrity:**
-**Prevention of accidental system damage**
-**Protection against malicious requests**
-**Audit trail of safety decisions**
-**Graceful degradation when safety is uncertain**
## 📋 **Implementation Checklist**
-**Package-level safety notice** in `__init__.py`
-**Server-level safety protocol** in `create_server()`
-**Class-level safety reminder** in `MCPToolServer`
-**Tool-level safety warnings** for destructive operations
-**Visual markers** (🔴🛡️🚨) for quick identification
-**Specific refusal scenarios** documented
-**Confirmation requirements** clearly stated
-**Emergency logging** for security violations
## 🚀 **The Sacred Trust Philosophy**
> **"The human trusts you to be their guardian against accidental data loss or system damage."**
This isn't just about preventing bugs - it's about honoring the profound trust humans place in AI assistants when they give them access to powerful system tools.
**When in doubt, always choose safety over task completion.**
---
**Status: ✅ COMPREHENSIVE SAFETY FRAMEWORK IMPLEMENTED**
**Sacred Trust: Protected** 🛡️
**User Safety: Paramount** 🚨
**System Integrity: Preserved** 🔐