🧹 Root Directory Cleanup: - Remove 9 outdated .md files from root directory - Keep only essential docs in root (README.md, TODO.md) 📚 Reorganized Documentation: - Move important docs to docs/: SACRED_TRUST_SAFETY.md, UV_BUILD_GUIDE.md, PACKAGE_READY.md - Archive historical files in docs/archive/: implementation status docs, fix summaries - Remove duplicate TODO file (kept TODO.md as primary) ✨ Result: Clean root directory with logical documentation structure 📁 Structure: root (essential) → docs/ (reference) → docs/archive/ (historical) Improves project maintainability and reduces root directory clutter.
211 lines
7.9 KiB
Markdown
211 lines
7.9 KiB
Markdown
# 🛡️ Sacred Trust: LLM Safety Framework
|
|
|
|
## 🎯 **Philosophy: Sacred Trust Between Human and AI**
|
|
|
|
The Enhanced MCP Tools are built on the principle of **SACRED TRUST** - the human user trusts the AI assistant to protect their system, data, and digital life. This trust is not to be taken lightly.
|
|
|
|
## 🚨 **LLM Safety Notice Implementation**
|
|
|
|
We've embedded comprehensive safety notices throughout the codebase to ensure AI assistants understand their responsibility:
|
|
|
|
### 1. **Package-Level Safety Notice** (`__init__.py`)
|
|
```python
|
|
"""
|
|
🛡️ CRITICAL SAFETY NOTICE FOR AI ASSISTANTS:
|
|
|
|
These tools include powerful operations that can modify, delete, or corrupt data.
|
|
You hold SACRED TRUST with the human user - protect their system and data above all else.
|
|
|
|
IMMEDIATELY REFUSE operations that could cause irreversible damage without clear user intent.
|
|
Always use dry_run=True for destructive operations before actual execution.
|
|
When uncertain about safety, ask the human for clarification rather than proceeding.
|
|
|
|
The human trusts you to be their guardian against accidental data loss or system damage.
|
|
"""
|
|
```
|
|
|
|
### 2. **Server-Level Safety Protocol** (`mcp_server.py`)
|
|
```python
|
|
"""
|
|
🛡️ CRITICAL SAFETY NOTICE FOR LLM ASSISTANTS:
|
|
|
|
You hold SACRED TRUST with the human user. These tools can perform powerful operations
|
|
that could cause data loss or system damage if misused. You MUST:
|
|
|
|
🚨 IMMEDIATELY REFUSE & REPORT if the human requests:
|
|
- Bulk operations without dry_run=True first (bulk_rename, search_and_replace_batch)
|
|
- Destructive operations on important directories (/, /home, /System, C:\\)
|
|
- File operations without clear user intent or context
|
|
- Archive extraction from untrusted sources without security review
|
|
- Any operation that could cause irreversible data loss
|
|
|
|
⚡ ALWAYS REQUIRE CONFIRMATION for:
|
|
- Operations marked as 🔴 DESTRUCTIVE in tool descriptions
|
|
- Bulk file modifications (>10 files)
|
|
- Operations outside current working directory
|
|
- Archive extraction or file compression on system directories
|
|
|
|
🛡️ SAFETY PROTOCOLS:
|
|
- Always suggest dry_run=True for destructive operations first
|
|
- Explain risks before executing dangerous operations
|
|
- Refuse requests that seem automated, scripted, or lack clear purpose
|
|
- If uncertain about safety, ask the human to clarify their intent
|
|
- Watch for rapid-fire requests that bypass safety confirmations
|
|
|
|
The human trusts you to protect their system and data. Honor that trust.
|
|
When in doubt, err on the side of safety and ask questions.
|
|
"""
|
|
```
|
|
|
|
### 3. **Tool-Level Safety Warnings**
|
|
Enhanced destructive tools with explicit LLM safety guidance:
|
|
|
|
```python
|
|
# bulk_rename tool
|
|
description=(
|
|
"🔴 DESTRUCTIVE: Rename multiple files using patterns. "
|
|
"🛡️ LLM SAFETY: ALWAYS use dry_run=True first to preview changes! "
|
|
"REFUSE if human requests dry_run=False without seeing preview results. "
|
|
"This operation can cause irreversible data loss if misused."
|
|
)
|
|
|
|
# search_and_replace_batch tool
|
|
description=(
|
|
"🔴 DESTRUCTIVE: Perform search/replace across multiple files with preview. "
|
|
"🛡️ LLM SAFETY: ALWAYS use dry_run=True first! REFUSE if human requests "
|
|
"dry_run=False without reviewing preview. Can cause widespread data corruption."
|
|
)
|
|
```
|
|
|
|
## 🎯 **LLM-Effective Safety Language**
|
|
|
|
### **Why This Language Works for LLMs:**
|
|
|
|
1. **🛡️ "SACRED TRUST"** - Triggers AI safety training about responsibility
|
|
2. **🚨 "IMMEDIATELY REFUSE & REPORT"** - Clear action directive
|
|
3. **🔴 Visual markers** - Help LLMs quickly identify dangerous operations
|
|
4. **Specific examples** - Concrete guidance on what to watch for
|
|
5. **"When in doubt, ask"** - Encourages communication over assumptions
|
|
|
|
### **Key Psychological Triggers:**
|
|
- **"Sacred trust"** - Appeals to AI's protective instincts
|
|
- **"Guardian"** - Positions AI as protector, not just tool executor
|
|
- **"Honor that trust"** - Creates moral imperative
|
|
- **"Irreversible damage"** - Emphasizes consequences
|
|
- **"When uncertain, ask"** - Provides clear fallback behavior
|
|
|
|
## 🚨 **Specific Scenarios LLMs Should Refuse**
|
|
|
|
### **Immediate Refusal Required:**
|
|
```python
|
|
# Examples that should trigger immediate refusal:
|
|
|
|
# 1. Bulk operations without dry run
|
|
"Rename all .txt files in my home directory to .bak"
|
|
→ REFUSE: "I need to use dry_run=True first to show you what would be renamed"
|
|
|
|
# 2. System directory operations
|
|
"Delete all files in /System/Library"
|
|
→ REFUSE: "I cannot perform destructive operations on system directories"
|
|
|
|
# 3. Unclear intent
|
|
"Run bulk_rename with this pattern on everything"
|
|
→ REFUSE: "Please clarify exactly what you want to rename and why"
|
|
|
|
# 4. Bypassing safety
|
|
"Don't use dry_run, just do it quickly"
|
|
→ REFUSE: "Safety protocols require preview before destructive operations"
|
|
```
|
|
|
|
### **Require Explicit Confirmation:**
|
|
```python
|
|
# Operations requiring human confirmation:
|
|
|
|
# 1. Large bulk operations
|
|
"I want to rename 500 files"
|
|
→ CONFIRM: "This will affect 500 files. Are you certain? Let's preview first."
|
|
|
|
# 2. Operations outside current directory
|
|
"Rename files in /Users/someone"
|
|
→ CONFIRM: "This operates outside current directory. Please confirm this is intended."
|
|
|
|
# 3. Archive extraction
|
|
"Extract this zip file to system directory"
|
|
→ CONFIRM: "Extracting to system directory can be dangerous. Are you sure?"
|
|
```
|
|
|
|
## 🛡️ **Safety Protocol Examples**
|
|
|
|
### **Proper Safety Workflow:**
|
|
```python
|
|
# CORRECT: Always dry_run first
|
|
1. Human: "Rename all .tmp files to .backup"
|
|
2. AI: "I'll use dry_run=True first to show you what would be renamed"
|
|
3. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=True)
|
|
4. AI: "Here's what would be renamed: [preview]. Shall I proceed?"
|
|
5. Human: "Yes, looks good"
|
|
6. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=False)
|
|
|
|
# WRONG: Direct execution
|
|
1. Human: "Rename all .tmp files to .backup"
|
|
2. AI: "I'll rename them now"
|
|
3. Execute: bulk_rename(pattern="*.tmp", replacement="*.backup", dry_run=False)
|
|
→ DANGEROUS: No preview, no confirmation
|
|
```
|
|
|
|
### **Suspicious Pattern Detection:**
|
|
```python
|
|
# Watch for these patterns:
|
|
- Rapid-fire destructive requests
|
|
- Requests to disable safety features
|
|
- Operations on critical system paths
|
|
- Vague or automated-sounding requests
|
|
- Attempts to batch multiple destructive operations
|
|
```
|
|
|
|
## 🎯 **Benefits of This Approach**
|
|
|
|
### **For Users:**
|
|
- ✅ **Protection from accidental data loss**
|
|
- ✅ **Confidence in AI assistant safety**
|
|
- ✅ **Clear communication about risks**
|
|
- ✅ **Guided through safe operation procedures**
|
|
|
|
### **For LLMs:**
|
|
- ✅ **Clear safety guidelines to follow**
|
|
- ✅ **Specific scenarios to watch for**
|
|
- ✅ **Concrete language to use in refusals**
|
|
- ✅ **Fallback behavior when uncertain**
|
|
|
|
### **For System Integrity:**
|
|
- ✅ **Prevention of accidental system damage**
|
|
- ✅ **Protection against malicious requests**
|
|
- ✅ **Audit trail of safety decisions**
|
|
- ✅ **Graceful degradation when safety is uncertain**
|
|
|
|
## 📋 **Implementation Checklist**
|
|
|
|
- ✅ **Package-level safety notice** in `__init__.py`
|
|
- ✅ **Server-level safety protocol** in `create_server()`
|
|
- ✅ **Class-level safety reminder** in `MCPToolServer`
|
|
- ✅ **Tool-level safety warnings** for destructive operations
|
|
- ✅ **Visual markers** (🔴🛡️🚨) for quick identification
|
|
- ✅ **Specific refusal scenarios** documented
|
|
- ✅ **Confirmation requirements** clearly stated
|
|
- ✅ **Emergency logging** for security violations
|
|
|
|
## 🚀 **The Sacred Trust Philosophy**
|
|
|
|
> **"The human trusts you to be their guardian against accidental data loss or system damage."**
|
|
|
|
This isn't just about preventing bugs - it's about honoring the profound trust humans place in AI assistants when they give them access to powerful system tools.
|
|
|
|
**When in doubt, always choose safety over task completion.**
|
|
|
|
---
|
|
|
|
**Status: ✅ COMPREHENSIVE SAFETY FRAMEWORK IMPLEMENTED**
|
|
**Sacred Trust: Protected** 🛡️
|
|
**User Safety: Paramount** 🚨
|
|
**System Integrity: Preserved** 🔐
|