Monitoring and Observability¶
Overview¶
Ralph Orchestrator provides comprehensive monitoring capabilities to track execution, performance, and system health. This guide covers monitoring tools, metrics, and best practices.
Built-in Monitoring¶
Ralph's monitoring system collects and routes execution data through multiple channels:
📊 Metrics Collection Flow
┌────────────────────┐
┌──> │ .agent/metrics/ │
│ └────────────────────┘
┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ ┌────────────────────┐
│ Orchestrator │ ──> │ Iteration Events │ ──> │ Metrics │ ┼──> │ .agent/logs/ │
└─────────────────┘ └──────────────────┘ │ Collector │ │ └────────────────────┘
└───────────────┘ │ ┌────────────────────┐
└──> │ Console │
└────────────────────┘
graph-easy source
State Files¶
Ralph automatically generates state files in .agent/metrics/:
{
"iteration_count": 15,
"runtime": 234.5,
"start_time": "2025-09-07T15:44:35",
"agent": "claude",
"prompt_file": "PROMPT.md",
"status": "running",
"errors": [],
"checkpoints": [5, 10, 15],
"last_output_size": 2048
}
Real-time Status¶
# Check current status
./ralph status
# Output:
Ralph Orchestrator Status
=========================
Status: RUNNING
Current Iteration: 15
Runtime: 3m 54s
Agent: claude
Last Checkpoint: iteration 15
Errors: 0
Execution Logs¶
Verbose Mode¶
# Enable detailed logging
./ralph run --verbose
# Output includes:
# - Agent commands
# - Execution times
# - Output summaries
# - Error details
Log Levels¶
import logging
# Configure log level
logging.basicConfig(
level=logging.DEBUG, # DEBUG, INFO, WARNING, ERROR
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('.agent/logs/ralph.log'),
logging.StreamHandler()
]
)
Metrics Collection¶
Performance Metrics¶
# Automatically collected metrics
metrics = {
'iteration_times': [], # Time per iteration
'agent_response_times': [], # Agent execution duration
'output_sizes': [], # Response size per iteration
'error_rate': 0.0, # Errors per iteration
'checkpoint_times': [], # Checkpoint creation duration
'total_api_calls': 0 # Total agent invocations
}
Custom Metrics¶
# Add custom metrics collection
class MetricsCollector:
def record_metric(self, name: str, value: float):
"""Record custom metric"""
timestamp = time.time()
self.metrics.append({
'name': name,
'value': value,
'timestamp': timestamp
})
def export_metrics(self):
"""Export metrics to JSON"""
with open('.agent/metrics/custom.json', 'w') as f:
json.dump(self.metrics, f, indent=2)
Monitoring Tools¶
1. Ralph Monitor (Built-in)¶
# Continuous monitoring
watch -n 5 './ralph status'
# Tail logs
tail -f .agent/logs/ralph.log
# Monitor metrics
watch -n 10 'cat .agent/metrics/state_*.json | jq .'
2. Git History Monitoring¶
# View checkpoint history
git log --oneline | grep "Ralph checkpoint"
# Analyze code changes over time
git diff --stat HEAD~10..HEAD
# Track file modifications
git log --follow -p PROMPT.md
3. System Resource Monitoring¶
# Monitor Ralph process
htop -p $(pgrep -f ralph_orchestrator)
# Track resource usage
pidstat -p $(pgrep -f ralph_orchestrator) 1
# Monitor file system changes
inotifywait -m -r . -e modify,create,delete
Dashboard Setup¶
Terminal Dashboard¶
Create monitor.sh:
#!/bin/bash
# Ralph Monitoring Dashboard
while true; do
clear
echo "=== RALPH ORCHESTRATOR MONITOR ==="
echo ""
# Status
./ralph status
echo ""
# Recent errors
echo "Recent Errors:"
tail -n 5 .agent/logs/ralph.log | grep ERROR || echo "No errors"
echo ""
# Resource usage
echo "Resource Usage:"
ps aux | grep ralph_orchestrator | grep -v grep
echo ""
# Latest checkpoint
echo "Latest Checkpoint:"
ls -lt .agent/checkpoints/ | head -2
sleep 5
done
Web Dashboard (Optional)¶
# Simple Flask dashboard
from flask import Flask, jsonify, render_template_string
import json
import glob
app = Flask(__name__)
@app.route('/metrics')
def metrics():
# Get latest state file
state_files = glob.glob('.agent/metrics/state_*.json')
if state_files:
latest = max(state_files)
with open(latest) as f:
return jsonify(json.load(f))
return jsonify({'status': 'no data'})
@app.route('/')
def dashboard():
return render_template_string('''
<html>
<head>
<title>Ralph Dashboard</title>
<script>
function updateMetrics() {
fetch('/metrics')
.then(response => response.json())
.then(data => {
document.getElementById('metrics').innerHTML =
JSON.stringify(data, null, 2);
});
}
setInterval(updateMetrics, 5000);
</script>
</head>
<body onload="updateMetrics()">
<h1>Ralph Orchestrator Dashboard</h1>
<pre id="metrics"></pre>
</body>
</html>
''')
if __name__ == '__main__':
app.run(debug=True, port=5000)
Alerting¶
Error Detection¶
# Monitor for errors
def check_errors():
with open('.agent/metrics/state_latest.json') as f:
state = json.load(f)
if state.get('errors'):
send_alert(f"Ralph encountered errors: {state['errors']}")
if state.get('iteration_count', 0) > 100:
send_alert("Ralph exceeded 100 iterations")
if state.get('runtime', 0) > 14400: # 4 hours
send_alert("Ralph runtime exceeded 4 hours")
Notification Methods¶
# Desktop notification
notify-send "Ralph Alert" "Task completed successfully"
# Email alert
echo "Ralph task failed" | mail -s "Ralph Alert" admin@example.com
# Slack webhook
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Ralph task completed"}' \
YOUR_SLACK_WEBHOOK_URL
Performance Analysis¶
Iteration Analysis¶
# Analyze iteration performance
import pandas as pd
import matplotlib.pyplot as plt
def analyze_iterations():
# Load metrics
metrics = []
for file in glob.glob('.agent/metrics/state_*.json'):
with open(file) as f:
metrics.append(json.load(f))
# Create DataFrame
df = pd.DataFrame(metrics)
# Plot iteration times
plt.figure(figsize=(10, 6))
plt.plot(df['iteration_count'], df['runtime'])
plt.xlabel('Iteration')
plt.ylabel('Cumulative Runtime (seconds)')
plt.title('Ralph Execution Performance')
plt.savefig('.agent/performance.png')
# Statistics
print(f"Average iteration time: {df['runtime'].diff().mean():.2f}s")
print(f"Total iterations: {df['iteration_count'].max()}")
print(f"Error rate: {len(df[df['errors'].notna()]) / len(df):.2%}")
Cost Tracking¶
# Estimate API costs
def calculate_costs():
costs = {
'claude': 0.01, # $ per call
'gemini': 0.005, # $ per call
'q': 0.0 # Free
}
total_cost = 0
for file in glob.glob('.agent/metrics/state_*.json'):
with open(file) as f:
state = json.load(f)
agent = state.get('agent', 'claude')
total_cost += costs.get(agent, 0)
print(f"Estimated cost: ${total_cost:.2f}")
return total_cost
Log Management¶
Log Rotation¶
# Configure log rotation
import logging.handlers
handler = logging.handlers.RotatingFileHandler(
'.agent/logs/ralph.log',
maxBytes=10*1024*1024, # 10MB
backupCount=5
)
Log Aggregation¶
# Combine all logs
cat .agent/logs/*.log > combined.log
# Filter by date
grep "2025-09-07" .agent/logs/*.log
# Extract errors only
grep -E "ERROR|CRITICAL" .agent/logs/*.log > errors.log
Log Analysis¶
# Count errors by type
grep ERROR .agent/logs/*.log | cut -d: -f4 | sort | uniq -c
# Find longest running iterations
grep "Iteration .* completed" .agent/logs/*.log | \
awk '{print $NF}' | sort -rn | head -10
# Agent usage statistics
grep "Using agent:" .agent/logs/*.log | \
cut -d: -f4 | sort | uniq -c
Health Checks¶
Automated Health Checks¶
def health_check():
"""Comprehensive health check"""
health = {
'status': 'healthy',
'checks': []
}
# Check prompt file exists
if not os.path.exists('PROMPT.md'):
health['status'] = 'unhealthy'
health['checks'].append('PROMPT.md missing')
# Check agent availability
for agent in ['claude', 'q', 'gemini']:
if shutil.which(agent):
health['checks'].append(f'{agent}: available')
else:
health['checks'].append(f'{agent}: not found')
# Check disk space
stat = os.statvfs('.')
free_space = stat.f_bavail * stat.f_frsize / (1024**3) # GB
if free_space < 1:
health['status'] = 'warning'
health['checks'].append(f'Low disk space: {free_space:.2f}GB')
# Check Git status
result = subprocess.run(['git', 'status', '--porcelain'],
capture_output=True, text=True)
if result.stdout:
health['checks'].append('Uncommitted changes present')
return health
Troubleshooting with Monitoring¶
Common Issues¶
| Symptom | Check | Solution |
|---|---|---|
| High iteration count | .agent/metrics/state_*.json |
Review prompt clarity |
| Slow performance | Iteration times in logs | Check agent response times |
| Memory issues | System monitor | Increase limits or add swap |
| Repeated errors | Error patterns in logs | Fix underlying issue |
| No progress | Git diff output | Check if agent is making changes |
Debug Mode¶
# Maximum verbosity
RALPH_DEBUG=1 ./ralph run --verbose
# Trace execution
python -m trace -t ralph_orchestrator.py
# Profile performance
python -m cProfile -o profile.stats ralph_orchestrator.py
Best Practices¶
- Regular Monitoring
- Check status every 10-15 minutes
- Review logs for anomalies
-
Monitor resource usage
-
Metric Retention
- Archive old metrics weekly
- Compress logs monthly
-
Maintain 30-day history
-
Alert Fatigue
- Set reasonable thresholds
- Group related alerts
-
Prioritize critical issues
-
Documentation
- Document custom metrics
- Track performance baselines
- Note configuration changes