Skip to content

Monitoring and Observability

Overview

Ralph Orchestrator provides comprehensive monitoring capabilities to track execution, performance, and system health. This guide covers monitoring tools, metrics, and best practices.

Built-in Monitoring

Ralph's monitoring system collects and routes execution data through multiple channels:

                           📊 Metrics Collection Flow

                                                                        ┌────────────────────┐
                                                                   ┌──> │ .agent/metrics/    │
                                                                   │    └────────────────────┘
┌─────────────────┐     ┌──────────────────┐     ┌───────────────┐ │    ┌────────────────────┐
│   Orchestrator  │ ──> │ Iteration Events │ ──> │    Metrics    │ ┼──> │   .agent/logs/     │
└─────────────────┘     └──────────────────┘     │   Collector   │ │    └────────────────────┘
                                                 └───────────────┘ │    ┌────────────────────┐
                                                                   └──> │     Console        │
                                                                        └────────────────────┘
graph-easy source
graph { label: "📊 Metrics Collection Flow"; flow: east; }
[ Orchestrator ] -> [ Iteration Events ] -> [ Metrics Collector ]
[ Metrics Collector ] -> [ .agent/metrics/ ]
[ Metrics Collector ] -> [ .agent/logs/ ]
[ Metrics Collector ] -> [ Console ]

State Files

Ralph automatically generates state files in .agent/metrics/:

{
  "iteration_count": 15,
  "runtime": 234.5,
  "start_time": "2025-09-07T15:44:35",
  "agent": "claude",
  "prompt_file": "PROMPT.md",
  "status": "running",
  "errors": [],
  "checkpoints": [5, 10, 15],
  "last_output_size": 2048
}

Real-time Status

# Check current status
./ralph status

# Output:
Ralph Orchestrator Status
=========================
Status: RUNNING
Current Iteration: 15
Runtime: 3m 54s
Agent: claude
Last Checkpoint: iteration 15
Errors: 0

Execution Logs

Verbose Mode

# Enable detailed logging
./ralph run --verbose

# Output includes:
# - Agent commands
# - Execution times
# - Output summaries
# - Error details

Log Levels

import logging

# Configure log level
logging.basicConfig(
    level=logging.DEBUG,  # DEBUG, INFO, WARNING, ERROR
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('.agent/logs/ralph.log'),
        logging.StreamHandler()
    ]
)

Metrics Collection

Performance Metrics

# Automatically collected metrics
metrics = {
    'iteration_times': [],      # Time per iteration
    'agent_response_times': [], # Agent execution duration
    'output_sizes': [],         # Response size per iteration
    'error_rate': 0.0,         # Errors per iteration
    'checkpoint_times': [],     # Checkpoint creation duration
    'total_api_calls': 0       # Total agent invocations
}

Custom Metrics

# Add custom metrics collection
class MetricsCollector:
    def record_metric(self, name: str, value: float):
        """Record custom metric"""
        timestamp = time.time()
        self.metrics.append({
            'name': name,
            'value': value,
            'timestamp': timestamp
        })

    def export_metrics(self):
        """Export metrics to JSON"""
        with open('.agent/metrics/custom.json', 'w') as f:
            json.dump(self.metrics, f, indent=2)

Monitoring Tools

1. Ralph Monitor (Built-in)

# Continuous monitoring
watch -n 5 './ralph status'

# Tail logs
tail -f .agent/logs/ralph.log

# Monitor metrics
watch -n 10 'cat .agent/metrics/state_*.json | jq .'

2. Git History Monitoring

# View checkpoint history
git log --oneline | grep "Ralph checkpoint"

# Analyze code changes over time
git diff --stat HEAD~10..HEAD

# Track file modifications
git log --follow -p PROMPT.md

3. System Resource Monitoring

# Monitor Ralph process
htop -p $(pgrep -f ralph_orchestrator)

# Track resource usage
pidstat -p $(pgrep -f ralph_orchestrator) 1

# Monitor file system changes
inotifywait -m -r . -e modify,create,delete

Dashboard Setup

Terminal Dashboard

Create monitor.sh:

#!/bin/bash
# Ralph Monitoring Dashboard

while true; do
    clear
    echo "=== RALPH ORCHESTRATOR MONITOR ==="
    echo ""

    # Status
    ./ralph status
    echo ""

    # Recent errors
    echo "Recent Errors:"
    tail -n 5 .agent/logs/ralph.log | grep ERROR || echo "No errors"
    echo ""

    # Resource usage
    echo "Resource Usage:"
    ps aux | grep ralph_orchestrator | grep -v grep
    echo ""

    # Latest checkpoint
    echo "Latest Checkpoint:"
    ls -lt .agent/checkpoints/ | head -2

    sleep 5
done

Web Dashboard (Optional)

# Simple Flask dashboard
from flask import Flask, jsonify, render_template_string
import json
import glob

app = Flask(__name__)

@app.route('/metrics')
def metrics():
    # Get latest state file
    state_files = glob.glob('.agent/metrics/state_*.json')
    if state_files:
        latest = max(state_files)
        with open(latest) as f:
            return jsonify(json.load(f))
    return jsonify({'status': 'no data'})

@app.route('/')
def dashboard():
    return render_template_string('''
    <html>
        <head>
            <title>Ralph Dashboard</title>
            <script>
                function updateMetrics() {
                    fetch('/metrics')
                        .then(response => response.json())
                        .then(data => {
                            document.getElementById('metrics').innerHTML =
                                JSON.stringify(data, null, 2);
                        });
                }
                setInterval(updateMetrics, 5000);
            </script>
        </head>
        <body onload="updateMetrics()">
            <h1>Ralph Orchestrator Dashboard</h1>
            <pre id="metrics"></pre>
        </body>
    </html>
    ''')

if __name__ == '__main__':
    app.run(debug=True, port=5000)

Alerting

Error Detection

# Monitor for errors
def check_errors():
    with open('.agent/metrics/state_latest.json') as f:
        state = json.load(f)

    if state.get('errors'):
        send_alert(f"Ralph encountered errors: {state['errors']}")

    if state.get('iteration_count', 0) > 100:
        send_alert("Ralph exceeded 100 iterations")

    if state.get('runtime', 0) > 14400:  # 4 hours
        send_alert("Ralph runtime exceeded 4 hours")

Notification Methods

# Desktop notification
notify-send "Ralph Alert" "Task completed successfully"

# Email alert
echo "Ralph task failed" | mail -s "Ralph Alert" admin@example.com

# Slack webhook
curl -X POST -H 'Content-type: application/json' \
    --data '{"text":"Ralph task completed"}' \
    YOUR_SLACK_WEBHOOK_URL

Performance Analysis

Iteration Analysis

# Analyze iteration performance
import pandas as pd
import matplotlib.pyplot as plt

def analyze_iterations():
    # Load metrics
    metrics = []
    for file in glob.glob('.agent/metrics/state_*.json'):
        with open(file) as f:
            metrics.append(json.load(f))

    # Create DataFrame
    df = pd.DataFrame(metrics)

    # Plot iteration times
    plt.figure(figsize=(10, 6))
    plt.plot(df['iteration_count'], df['runtime'])
    plt.xlabel('Iteration')
    plt.ylabel('Cumulative Runtime (seconds)')
    plt.title('Ralph Execution Performance')
    plt.savefig('.agent/performance.png')

    # Statistics
    print(f"Average iteration time: {df['runtime'].diff().mean():.2f}s")
    print(f"Total iterations: {df['iteration_count'].max()}")
    print(f"Error rate: {len(df[df['errors'].notna()]) / len(df):.2%}")

Cost Tracking

# Estimate API costs
def calculate_costs():
    costs = {
        'claude': 0.01,    # $ per call
        'gemini': 0.005,   # $ per call
        'q': 0.0           # Free
    }

    total_cost = 0
    for file in glob.glob('.agent/metrics/state_*.json'):
        with open(file) as f:
            state = json.load(f)
            agent = state.get('agent', 'claude')
            total_cost += costs.get(agent, 0)

    print(f"Estimated cost: ${total_cost:.2f}")
    return total_cost

Log Management

Log Rotation

# Configure log rotation
import logging.handlers

handler = logging.handlers.RotatingFileHandler(
    '.agent/logs/ralph.log',
    maxBytes=10*1024*1024,  # 10MB
    backupCount=5
)

Log Aggregation

# Combine all logs
cat .agent/logs/*.log > combined.log

# Filter by date
grep "2025-09-07" .agent/logs/*.log

# Extract errors only
grep -E "ERROR|CRITICAL" .agent/logs/*.log > errors.log

Log Analysis

# Count errors by type
grep ERROR .agent/logs/*.log | cut -d: -f4 | sort | uniq -c

# Find longest running iterations
grep "Iteration .* completed" .agent/logs/*.log | \
    awk '{print $NF}' | sort -rn | head -10

# Agent usage statistics
grep "Using agent:" .agent/logs/*.log | \
    cut -d: -f4 | sort | uniq -c

Health Checks

Automated Health Checks

def health_check():
    """Comprehensive health check"""
    health = {
        'status': 'healthy',
        'checks': []
    }

    # Check prompt file exists
    if not os.path.exists('PROMPT.md'):
        health['status'] = 'unhealthy'
        health['checks'].append('PROMPT.md missing')

    # Check agent availability
    for agent in ['claude', 'q', 'gemini']:
        if shutil.which(agent):
            health['checks'].append(f'{agent}: available')
        else:
            health['checks'].append(f'{agent}: not found')

    # Check disk space
    stat = os.statvfs('.')
    free_space = stat.f_bavail * stat.f_frsize / (1024**3)  # GB
    if free_space < 1:
        health['status'] = 'warning'
        health['checks'].append(f'Low disk space: {free_space:.2f}GB')

    # Check Git status
    result = subprocess.run(['git', 'status', '--porcelain'],
                          capture_output=True, text=True)
    if result.stdout:
        health['checks'].append('Uncommitted changes present')

    return health

Troubleshooting with Monitoring

Common Issues

Symptom Check Solution
High iteration count .agent/metrics/state_*.json Review prompt clarity
Slow performance Iteration times in logs Check agent response times
Memory issues System monitor Increase limits or add swap
Repeated errors Error patterns in logs Fix underlying issue
No progress Git diff output Check if agent is making changes

Debug Mode

# Maximum verbosity
RALPH_DEBUG=1 ./ralph run --verbose

# Trace execution
python -m trace -t ralph_orchestrator.py

# Profile performance
python -m cProfile -o profile.stats ralph_orchestrator.py

Best Practices

  1. Regular Monitoring
  2. Check status every 10-15 minutes
  3. Review logs for anomalies
  4. Monitor resource usage

  5. Metric Retention

  6. Archive old metrics weekly
  7. Compress logs monthly
  8. Maintain 30-day history

  9. Alert Fatigue

  10. Set reasonable thresholds
  11. Group related alerts
  12. Prioritize critical issues

  13. Documentation

  14. Document custom metrics
  15. Track performance baselines
  16. Note configuration changes