Production Deployment Guide¶

This guide covers deployment patterns and operational checklists for running Ralph Orchestrator in production environments. Ralph is alpha-quality and evolving, so treat this as a solid starting point and validate/tune it for your infrastructure, risk profile, and compliance needs.

Pre-Production Checklist¶

Infrastructure Requirements¶

Compute Resources
Minimum: 2 vCPUs, 4GB RAM per instance
Recommended: 4 vCPUs, 8GB RAM per instance
Auto-scaling configured (2-10 instances)
Storage
100GB SSD for application and logs
Separate persistent volumes for data
Backup storage configured
Network
Load balancer configured
SSL/TLS certificates installed
Firewall rules configured
DDoS protection enabled
High Availability
Multi-AZ deployment
Database replication
Redis cluster for caching
Message queue redundancy

Security Requirements¶

Access Control
RBAC configured
Service accounts created
API keys rotated
MFA enforced for admin access
Secrets Management
HashiCorp Vault or AWS Secrets Manager configured
All secrets encrypted at rest
Secrets rotation policy implemented
Compliance
Data encryption in transit and at rest
Audit logging enabled
GDPR/CCPA compliance verified
Security scanning completed

Production Architecture¶

graph TB
    subgraph "Internet"
        U[Users]
        API[API Clients]
    end

    subgraph "Edge Layer"
        CDN[CloudFlare CDN]
        WAF[Web Application Firewall]
    end

    subgraph "Load Balancing"
        ALB[Application Load Balancer]
        NLB[Network Load Balancer]
    end

    subgraph "Application Layer"
        R1[Ralph Instance 1]
        R2[Ralph Instance 2]
        R3[Ralph Instance N]
    end

    subgraph "Data Layer"
        PG[(PostgreSQL Primary)]
        PGR[(PostgreSQL Replica)]
        REDIS[(Redis Cluster)]
        S3[S3 Object Storage]
    end

    subgraph "Monitoring"
        PROM[Prometheus]
        GRAF[Grafana]
        ELK[ELK Stack]
    end

    U --> CDN
    API --> WAF
    CDN --> ALB
    WAF --> NLB
    ALB --> R1
    ALB --> R2
    ALB --> R3
    NLB --> R1
    R1 --> PG
    R2 --> PGR
    R3 --> REDIS
    R1 --> S3
    R1 --> PROM
    PROM --> GRAF
    R1 --> ELK

Deployment Steps¶

1. Infrastructure Setup¶

AWS Production Setup¶

# Create VPC and subnets
aws ec2 create-vpc --cidr-block 10.0.0.0/16

# Create EKS cluster
eksctl create cluster \
  --name ralph-production \
  --version 1.28 \
  --region us-west-2 \
  --nodegroup-name workers \
  --node-type m5.xlarge \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 10 \
  --managed

# Create RDS instance
aws rds create-db-instance \
  --db-instance-identifier ralph-prod-db \
  --db-instance-class db.t3.large \
  --engine postgres \
  --engine-version 15.4 \
  --master-username ralph \
  --master-user-password $DB_PASSWORD \
  --allocated-storage 100 \
  --backup-retention-period 30 \
  --multi-az

# Create ElastiCache Redis cluster
aws elasticache create-cache-cluster \
  --cache-cluster-id ralph-prod-cache \
  --engine redis \
  --cache-node-type cache.r6g.large \
  --num-cache-nodes 3 \
  --cache-subnet-group-name ralph-cache-subnet

GCP Production Setup¶

# Create GKE cluster
gcloud container clusters create ralph-production \
  --zone us-central1-a \
  --machine-type n2-standard-4 \
  --num-nodes 3 \
  --enable-autoscaling \
  --min-nodes 2 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade

# Create Cloud SQL instance
gcloud sql instances create ralph-prod-db \
  --database-version=POSTGRES_15 \
  --cpu=4 \
  --memory=16GB \
  --region=us-central1 \
  --availability-type=REGIONAL \
  --backup \
  --backup-start-time=03:00

# Create Memorystore Redis
gcloud redis instances create ralph-prod-cache \
  --size=5 \
  --region=us-central1 \
  --redis-version=redis_6_x \
  --tier=STANDARD_HA

2. Application Configuration¶

Production Environment Variables¶

# .env.production
# API Configuration
RALPH_AGENT=auto
RALPH_MAX_ITERATIONS=1000
RALPH_MAX_RUNTIME=28800  # 8 hours
RALPH_MAX_TOKENS=5000000
RALPH_MAX_COST=500.0

# Database
DATABASE_URL=postgresql://user:pass@db.example.com:5432/ralph_prod
DATABASE_POOL_SIZE=20
DATABASE_MAX_OVERFLOW=40

# Redis Cache
REDIS_URL=redis://redis.example.com:6379/0
REDIS_POOL_SIZE=50
REDIS_TTL=3600

# Monitoring
ENABLE_METRICS=true
METRICS_PORT=8080
LOG_LEVEL=INFO
SENTRY_DSN=https://xxx@sentry.io/xxx

# Security
SECRET_KEY=$(openssl rand -hex 32)
ENCRYPTION_KEY=$(openssl rand -base64 32)
API_RATE_LIMIT=100/minute
ALLOWED_HOSTS=ralph.example.com
CORS_ORIGINS=https://app.example.com

# AI Service Keys (from secrets manager)
CLAUDE_API_KEY_SECRET=arn:aws:secretsmanager:region:account:secret:claude-key
GEMINI_API_KEY_SECRET=arn:aws:secretsmanager:region:account:secret:gemini-key

Production Settings¶

# config/production.py
import os
from typing import Dict, Any

class ProductionConfig:
    """Production configuration"""

    # Application
    DEBUG = False
    TESTING = False
    ENV = 'production'

    # Security
    SECRET_KEY = os.environ['SECRET_KEY']
    SESSION_COOKIE_SECURE = True
    SESSION_COOKIE_HTTPONLY = True
    SESSION_COOKIE_SAMESITE = 'Strict'
    PERMANENT_SESSION_LIFETIME = 3600

    # Database
    SQLALCHEMY_DATABASE_URI = os.environ['DATABASE_URL']
    SQLALCHEMY_ENGINE_OPTIONS = {
        'pool_size': 20,
        'pool_recycle': 3600,
        'pool_pre_ping': True,
        'max_overflow': 40,
        'connect_args': {
            'connect_timeout': 10,
            'application_name': 'ralph-orchestrator'
        }
    }

    # Caching
    CACHE_TYPE = 'redis'
    CACHE_REDIS_URL = os.environ['REDIS_URL']
    CACHE_DEFAULT_TIMEOUT = 3600

    # Rate Limiting
    RATELIMIT_STORAGE_URL = os.environ['REDIS_URL']
    RATELIMIT_STRATEGY = 'fixed-window'
    RATELIMIT_DEFAULT = "100/hour"

    # Monitoring
    PROMETHEUS_METRICS = True
    LOGGING_CONFIG = {
        'version': 1,
        'disable_existing_loggers': False,
        'formatters': {
            'json': {
                'class': 'pythonjsonlogger.jsonlogger.JsonFormatter',
                'format': '%(asctime)s %(name)s %(levelname)s %(message)s'
            }
        },
        'handlers': {
            'console': {
                'class': 'logging.StreamHandler',
                'formatter': 'json',
                'level': 'INFO'
            },
            'file': {
                'class': 'logging.handlers.RotatingFileHandler',
                'filename': '/var/log/ralph/app.log',
                'maxBytes': 104857600,  # 100MB
                'backupCount': 10,
                'formatter': 'json'
            }
        },
        'root': {
            'level': 'INFO',
            'handlers': ['console', 'file']
        }
    }

3. Deployment Configuration¶

Kubernetes Production Manifests¶

# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ralph-orchestrator
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: ralph-orchestrator
  template:
    metadata:
      labels:
        app: ralph-orchestrator
        version: v1.0.0
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ralph-orchestrator
            topologyKey: kubernetes.io/hostname
      containers:
      - name: ralph
        image: ghcr.io/mikeyobrien/ralph-orchestrator:v1.0.0
        ports:
        - containerPort: 8080
          name: metrics
        envFrom:
        - secretRef:
            name: ralph-secrets
        - configMapRef:
            name: ralph-config
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        volumeMounts:
        - name: config
          mountPath: /app/config
          readOnly: true
        - name: secrets
          mountPath: /app/secrets
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: ralph-config
      - name: secrets
        secret:
          secretName: ralph-secrets
          defaultMode: 0400

4. Load Balancing and SSL¶

Ingress Configuration¶

# k8s/production/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ralph-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
  tls:
  - hosts:
    - ralph.example.com
    secretName: ralph-tls
  rules:
  - host: ralph.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ralph-service
            port:
              number: 80

5. Database Setup¶

Production Database Initialization¶

-- Create production database
CREATE DATABASE ralph_production;
CREATE USER ralph_app WITH ENCRYPTED PASSWORD 'secure_password';
GRANT ALL PRIVILEGES ON DATABASE ralph_production TO ralph_app;

-- Enable extensions
\c ralph_production
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_stat_statements";

-- Create schemas
CREATE SCHEMA IF NOT EXISTS ralph;
CREATE SCHEMA IF NOT EXISTS audit;

-- Set search path
ALTER USER ralph_app SET search_path TO ralph, public;

-- Create tables
CREATE TABLE ralph.orchestrations (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    task_id VARCHAR(255) UNIQUE NOT NULL,
    agent_type VARCHAR(50) NOT NULL,
    status VARCHAR(50) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP WITH TIME ZONE,
    metadata JSONB,
    metrics JSONB
);

-- Create indexes
CREATE INDEX idx_orchestrations_status ON ralph.orchestrations(status);
CREATE INDEX idx_orchestrations_created ON ralph.orchestrations(created_at);
CREATE INDEX idx_orchestrations_agent ON ralph.orchestrations(agent_type);

-- Audit table
CREATE TABLE audit.activity_log (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    user_id VARCHAR(255),
    action VARCHAR(100),
    resource VARCHAR(255),
    details JSONB,
    ip_address INET
);

6. Monitoring Setup¶

Prometheus Configuration¶

# monitoring/prometheus.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'ralph-orchestrator'
    kubernetes_sd_configs:
    - role: pod
      namespaces:
        names:
        - production
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      action: keep
      regex: ralph-orchestrator
    - source_labels: [__meta_kubernetes_pod_name]
      target_label: instance

Alert Rules¶

# monitoring/alerts.yml
groups:
- name: ralph_alerts
  interval: 30s
  rules:
  - alert: HighErrorRate
    expr: rate(ralph_errors_total[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High error rate detected
      description: "Error rate is {{ $value }} errors/sec"

  - alert: HighMemoryUsage
    expr: container_memory_usage_bytes{pod=~"ralph-.*"} / container_spec_memory_limit_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High memory usage
      description: "Memory usage is above 90%"

  - alert: LongRunningTask
    expr: ralph_task_duration_seconds > 14400
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Task running longer than 4 hours
      description: "Task {{ $labels.task_id }} has been running for {{ $value }} seconds"

7. Backup and Recovery¶

Automated Backup Script¶

#!/bin/bash
# backup.sh

# Configuration
BACKUP_DIR="/backups"
S3_BUCKET="s3://ralph-backups"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

# Database backup
pg_dump $DATABASE_URL | gzip > $BACKUP_DIR/db_$DATE.sql.gz

# Application data backup
tar -czf $BACKUP_DIR/data_$DATE.tar.gz /app/data

# Upload to S3
aws s3 cp $BACKUP_DIR/db_$DATE.sql.gz $S3_BUCKET/db/
aws s3 cp $BACKUP_DIR/data_$DATE.tar.gz $S3_BUCKET/data/

# Clean old backups
find $BACKUP_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete
aws s3 ls $S3_BUCKET/db/ | while read -r line; do
  createDate=$(echo $line | awk '{print $1" "$2}')
  createDate=$(date -d "$createDate" +%s)
  olderThan=$(date -d "$RETENTION_DAYS days ago" +%s)
  if [[ $createDate -lt $olderThan ]]; then
    fileName=$(echo $line | awk '{print $4}')
    aws s3 rm $S3_BUCKET/db/$fileName
  fi
done

8. Security Hardening¶

Security Policies¶

# k8s/production/security.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: ralph-psp
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: true
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ralph-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: ralph-orchestrator
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 5432  # PostgreSQL
    - protocol: TCP
      port: 6379  # Redis

Performance Optimization¶

Caching Strategy¶

# cache_config.py
from functools import lru_cache
import redis
import json

class CacheManager:
    def __init__(self, redis_url):
        self.redis = redis.from_url(redis_url)

    def cache_result(self, key, value, ttl=3600):
        """Cache result with TTL"""
        self.redis.setex(
            key,
            ttl,
            json.dumps(value)
        )

    @lru_cache(maxsize=1000)
    def get_cached(self, key):
        """Get cached value"""
        value = self.redis.get(key)
        return json.loads(value) if value else None

Database Optimization¶

-- Optimize queries
CREATE INDEX CONCURRENTLY idx_orchestrations_composite 
ON ralph.orchestrations(agent_type, status, created_at);

-- Partition large tables
CREATE TABLE ralph.orchestrations_2024 PARTITION OF ralph.orchestrations
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

-- Vacuum and analyze
VACUUM ANALYZE ralph.orchestrations;

Disaster Recovery¶

Recovery Plan¶

RTO (Recovery Time Objective): 4 hours
RPO (Recovery Point Objective): 1 hour

Failover Procedure¶

#!/bin/bash
# failover.sh

# Check primary health
if ! curl -f https://ralph.example.com/health; then
  echo "Primary unhealthy, initiating failover"

  # Update DNS to point to standby
  aws route53 change-resource-record-sets \
    --hosted-zone-id Z123456 \
    --change-batch file://failover-dns.json

  # Promote standby database
  aws rds promote-read-replica \
    --db-instance-identifier ralph-standby-db

  # Scale up standby region
  kubectl scale deployment ralph-orchestrator \
    --replicas=5 \
    --context=standby-cluster

  # Notify team
  aws sns publish \
    --topic-arn arn:aws:sns:region:account:alerts \
    --message "Failover initiated to standby region"
fi

Maintenance¶

Rolling Updates¶

# Zero-downtime deployment
kubectl set image deployment/ralph-orchestrator \
  ralph=ghcr.io/mikeyobrien/ralph-orchestrator:v1.0.1 \
  --record

# Monitor rollout
kubectl rollout status deployment/ralph-orchestrator

# Rollback if needed
kubectl rollout undo deployment/ralph-orchestrator

Health Checks¶

# health.py
from flask import Flask, jsonify
import psutil
import redis

app = Flask(__name__)

@app.route('/health')
def health():
    """Health check endpoint"""
    checks = {
        'database': check_database(),
        'redis': check_redis(),
        'disk': check_disk_space(),
        'memory': check_memory()
    }

    status = 'healthy' if all(checks.values()) else 'unhealthy'
    return jsonify({
        'status': status,
        'checks': checks
    }), 200 if status == 'healthy' else 503

def check_database():
    try:
        db.session.execute('SELECT 1')
        return True
    except:
        return False

def check_redis():
    try:
        r = redis.from_url(REDIS_URL)
        r.ping()
        return True
    except:
        return False

def check_disk_space():
    usage = psutil.disk_usage('/')
    return usage.percent < 90

def check_memory():
    usage = psutil.virtual_memory()
    return usage.percent < 90

Production Checklist¶

Pre-Deployment¶

Deployment¶

Post-Deployment¶

Support and Maintenance¶

SLA Targets¶

Uptime: 99.9% (43.8 minutes downtime/month)
Response Time: < 500ms p95
Error Rate: < 0.1%
Recovery Time: < 4 hours

On-Call Procedures¶

Alert received via PagerDuty
Check runbook for known issues
Assess severity and impact
Implement fix or workaround
Document incident
Post-mortem if needed

Next Steps¶

Monitoring Guide - Set up comprehensive monitoring
Security Best Practices - Security hardening
Troubleshooting Guide - Common issues and solutions