Checkpointing and Recovery Guide¶

Ralph Orchestrator provides robust checkpointing mechanisms to ensure work is never lost and tasks can be resumed after interruptions.

Overview¶

Checkpointing saves the state of your orchestration at regular intervals, enabling: - Recovery from crashes or interruptions - Progress tracking across iterations - State inspection for debugging - Audit trails for compliance

Checkpoint Types¶

1. Git Checkpoints¶

Automatic git commits at specified intervals:

# Enable git checkpointing (default)
python ralph_orchestrator.py --checkpoint-interval 5

# Disable git checkpointing
python ralph_orchestrator.py --no-git

What's saved: - Current prompt file state - Any files created/modified by the agent - Timestamp and iteration number

2. Prompt Archives¶

Historical versions of the prompt file:

# Enable prompt archiving (default)
python ralph_orchestrator.py

# Disable prompt archiving
python ralph_orchestrator.py --no-archive

Location: .agent/prompts/prompt_YYYYMMDD_HHMMSS.md

3. State Snapshots¶

JSON files containing orchestrator state:

{
  "iteration": 15,
  "agent": "claude",
  "start_time": "2024-01-10T10:00:00",
  "tokens_used": 50000,
  "cost_incurred": 2.50,
  "status": "running"
}

Location: .agent/metrics/state_*.json

Configuration¶

Checkpoint Interval¶

Control how often checkpoints occur:

# Checkpoint every iteration (maximum safety)
python ralph_orchestrator.py --checkpoint-interval 1

# Checkpoint every 10 iterations (balanced)
python ralph_orchestrator.py --checkpoint-interval 10

# Checkpoint every 50 iterations (minimal overhead)
python ralph_orchestrator.py --checkpoint-interval 50

Checkpoint Strategies¶

Aggressive Checkpointing¶

For critical or experimental tasks:

python ralph_orchestrator.py \
  --checkpoint-interval 1 \
  --metrics-interval 1 \
  --verbose

Balanced Checkpointing¶

For standard production tasks:

python ralph_orchestrator.py \
  --checkpoint-interval 5 \
  --metrics-interval 10

Minimal Checkpointing¶

For simple, fast tasks:

python ralph_orchestrator.py \
  --checkpoint-interval 20 \
  --no-archive

Recovery Procedures¶

Automatic Recovery¶

Ralph Orchestrator automatically recovers from the last checkpoint:

Detect interruption
Load last checkpoint
Resume from last known state
Continue iteration

Manual Recovery¶

From Git Checkpoint¶

# View checkpoint history
git log --oneline | grep "Ralph checkpoint"

# Restore specific checkpoint
git checkout <commit-hash>

# Resume orchestration
python ralph_orchestrator.py --prompt PROMPT.md

From Prompt Archive¶

# List archived prompts
ls -la .agent/prompts/

# Restore archived prompt
cp .agent/prompts/prompt_20240110_100000.md PROMPT.md

# Resume orchestration
python ralph_orchestrator.py

From State Snapshot¶

# Load state programmatically
import json

with open('.agent/metrics/state_20240110_100000.json') as f:
    state = json.load(f)

print(f"Last iteration: {state['iteration']}")
print(f"Tokens used: {state['tokens_used']}")
print(f"Cost incurred: ${state['cost_incurred']}")

Checkpoint Storage¶

Directory Structure¶

.agent/
├── checkpoints/       # Git checkpoint metadata
├── prompts/          # Archived prompt files
│   ├── prompt_20240110_100000.md
│   ├── prompt_20240110_101500.md
│   └── prompt_20240110_103000.md
├── metrics/          # State and metrics
│   ├── state_20240110_100000.json
│   ├── state_20240110_101500.json
│   └── metrics_20240110_103000.json
└── logs/            # Execution logs

Storage Management¶

Clean Old Checkpoints¶

# Remove checkpoints older than 7 days
find .agent/prompts -mtime +7 -delete
find .agent/metrics -name "*.json" -mtime +7 -delete

# Keep only last 100 checkpoints
ls -t .agent/prompts/*.md | tail -n +101 | xargs rm -f

Backup Checkpoints¶

# Create backup archive
tar -czf ralph_checkpoints_$(date +%Y%m%d).tar.gz .agent/

# Backup to remote
rsync -av .agent/ user@backup-server:/backups/ralph/

Advanced Checkpointing¶

Custom Checkpoint Triggers¶

Beyond interval-based checkpointing, you can trigger checkpoints in your prompt:

## Progress
- Step 1 complete [CHECKPOINT]
- Step 2 complete [CHECKPOINT]
- Step 3 complete [CHECKPOINT]

Checkpoint Hooks¶

Use git hooks for custom checkpoint processing:

# .git/hooks/post-commit
#!/bin/bash
if [[ $1 == *"Ralph checkpoint"* ]]; then
    # Custom backup or notification
    cp PROMPT.md /backup/location/
    echo "Checkpoint created" | mail -s "Ralph Progress" admin@example.com
fi

Distributed Checkpointing¶

For team environments:

# Push checkpoints to shared repository
python ralph_orchestrator.py --checkpoint-interval 5

# In another terminal/machine
git pull  # Get latest checkpoints

# Or use automated sync
watch -n 60 'git pull'

Best Practices¶

1. Choose Appropriate Intervals¶

Task Type	Recommended Interval	Rationale
Experimental	1-2	Maximum recovery points
Development	5-10	Balance safety/performance
Production	10-20	Minimize overhead
Simple	20-50	Low risk tasks

2. Monitor Checkpoint Size¶

# Check checkpoint storage usage
du -sh .agent/

# Monitor growth
watch -n 60 'du -sh .agent/*'

3. Test Recovery¶

Regularly test recovery procedures:

# Simulate interruption
python ralph_orchestrator.py &
PID=$!
sleep 30
kill $PID

# Verify recovery
python ralph_orchestrator.py  # Should resume

4. Clean Up Regularly¶

Implement checkpoint rotation:

# Keep last 50 checkpoints
#!/bin/bash
MAX_CHECKPOINTS=50
COUNT=$(ls .agent/prompts/*.md 2>/dev/null | wc -l)
if [ $COUNT -gt $MAX_CHECKPOINTS ]; then
    ls -t .agent/prompts/*.md | tail -n +$(($MAX_CHECKPOINTS+1)) | xargs rm
fi

Troubleshooting¶

Common Issues¶

1. Git Checkpointing Fails¶

Error: "Not a git repository"

Solution:

# Initialize git repository
git init
git add .
git commit -m "Initial commit"

# Or disable git checkpointing
python ralph_orchestrator.py --no-git

2. Checkpoint Storage Full¶

Error: "No space left on device"

Solution:

# Clean old checkpoints
find .agent -type f -mtime +30 -delete

# Move to larger storage
mv .agent /larger/disk/
ln -s /larger/disk/.agent .agent

3. Corrupted Checkpoint¶

Error: "Invalid checkpoint data"

Solution:

# Use previous checkpoint
ls -la .agent/prompts/  # Find earlier version
cp .agent/prompts/prompt_EARLIER.md PROMPT.md

Recovery Validation¶

Verify checkpoint integrity:

#!/usr/bin/env python3
import json
import os
from pathlib import Path

def validate_checkpoints():
    checkpoint_dir = Path('.agent/metrics')
    for state_file in checkpoint_dir.glob('state_*.json'):
        try:
            with open(state_file) as f:
                data = json.load(f)
                assert 'iteration' in data
                assert 'agent' in data
                print(f"✓ {state_file.name}")
        except Exception as e:
            print(f"✗ {state_file.name}: {e}")

validate_checkpoints()

Performance Impact¶

Checkpoint Overhead¶

Interval	Overhead	Use Case
1	High (5-10%)	Critical tasks
5	Moderate (2-5%)	Standard tasks
10	Low (1-2%)	Long tasks
20+	Minimal (<1%)	Simple tasks

Optimization Tips¶

Use SSDs for checkpoint storage
Disable unnecessary features (e.g., --no-archive if not needed)
Adjust intervals based on task criticality
Clean up regularly to maintain performance

Integration¶

CI/CD Integration¶

# .github/workflows/ralph.yml
name: Ralph Orchestration
on:
  push:
    branches: [main]

jobs:
  orchestrate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Run Ralph
        run: |
          python ralph_orchestrator.py \
            --checkpoint-interval 10 \
            --max-iterations 100

      - name: Save Checkpoints
        uses: actions/upload-artifact@v2
        with:
          name: ralph-checkpoints
          path: .agent/

Monitoring Integration¶

# Send checkpoint events to monitoring
#!/bin/bash
CHECKPOINT_COUNT=$(ls .agent/prompts/*.md 2>/dev/null | wc -l)
curl -X POST https://metrics.example.com/api/v1/metrics \
  -d "ralph.checkpoints.count=$CHECKPOINT_COUNT"

Next Steps¶

Learn about Cost Management to optimize checkpoint costs
Explore Configuration for checkpoint options
Review Troubleshooting for recovery issues
See Examples for checkpoint patterns