Production Deployment Guide¶
Complete guide for deploying Ralph Orchestrator in production environments with enterprise-grade reliability, security, and scalability.
Pre-Production Checklist¶
Infrastructure Requirements¶
- Compute Resources
- Minimum: 2 vCPUs, 4GB RAM per instance
- Recommended: 4 vCPUs, 8GB RAM per instance
-
Auto-scaling configured (2-10 instances)
-
Storage
- 100GB SSD for application and logs
- Separate persistent volumes for data
-
Backup storage configured
-
Network
- Load balancer configured
- SSL/TLS certificates installed
- Firewall rules configured
-
DDoS protection enabled
-
High Availability
- Multi-AZ deployment
- Database replication
- Redis cluster for caching
- Message queue redundancy
Security Requirements¶
- Access Control
- RBAC configured
- Service accounts created
- API keys rotated
-
MFA enforced for admin access
-
Secrets Management
- HashiCorp Vault or AWS Secrets Manager configured
- All secrets encrypted at rest
-
Secrets rotation policy implemented
-
Compliance
- Data encryption in transit and at rest
- Audit logging enabled
- GDPR/CCPA compliance verified
- Security scanning completed
Production Architecture¶
graph TB
subgraph "Internet"
U[Users]
API[API Clients]
end
subgraph "Edge Layer"
CDN[CloudFlare CDN]
WAF[Web Application Firewall]
end
subgraph "Load Balancing"
ALB[Application Load Balancer]
NLB[Network Load Balancer]
end
subgraph "Application Layer"
R1[Ralph Instance 1]
R2[Ralph Instance 2]
R3[Ralph Instance N]
end
subgraph "Data Layer"
PG[(PostgreSQL Primary)]
PGR[(PostgreSQL Replica)]
REDIS[(Redis Cluster)]
S3[S3 Object Storage]
end
subgraph "Monitoring"
PROM[Prometheus]
GRAF[Grafana]
ELK[ELK Stack]
end
U --> CDN
API --> WAF
CDN --> ALB
WAF --> NLB
ALB --> R1
ALB --> R2
ALB --> R3
NLB --> R1
R1 --> PG
R2 --> PGR
R3 --> REDIS
R1 --> S3
R1 --> PROM
PROM --> GRAF
R1 --> ELK
Deployment Steps¶
1. Infrastructure Setup¶
AWS Production Setup¶
# Create VPC and subnets
aws ec2 create-vpc --cidr-block 10.0.0.0/16
# Create EKS cluster
eksctl create cluster \
--name ralph-production \
--version 1.28 \
--region us-west-2 \
--nodegroup-name workers \
--node-type m5.xlarge \
--nodes 3 \
--nodes-min 2 \
--nodes-max 10 \
--managed
# Create RDS instance
aws rds create-db-instance \
--db-instance-identifier ralph-prod-db \
--db-instance-class db.t3.large \
--engine postgres \
--engine-version 15.4 \
--master-username ralph \
--master-user-password $DB_PASSWORD \
--allocated-storage 100 \
--backup-retention-period 30 \
--multi-az
# Create ElastiCache Redis cluster
aws elasticache create-cache-cluster \
--cache-cluster-id ralph-prod-cache \
--engine redis \
--cache-node-type cache.r6g.large \
--num-cache-nodes 3 \
--cache-subnet-group-name ralph-cache-subnet
GCP Production Setup¶
# Create GKE cluster
gcloud container clusters create ralph-production \
--zone us-central1-a \
--machine-type n2-standard-4 \
--num-nodes 3 \
--enable-autoscaling \
--min-nodes 2 \
--max-nodes 10 \
--enable-autorepair \
--enable-autoupgrade
# Create Cloud SQL instance
gcloud sql instances create ralph-prod-db \
--database-version=POSTGRES_15 \
--cpu=4 \
--memory=16GB \
--region=us-central1 \
--availability-type=REGIONAL \
--backup \
--backup-start-time=03:00
# Create Memorystore Redis
gcloud redis instances create ralph-prod-cache \
--size=5 \
--region=us-central1 \
--redis-version=redis_6_x \
--tier=STANDARD_HA
2. Application Configuration¶
Production Environment Variables¶
# .env.production
# API Configuration
RALPH_AGENT=auto
RALPH_MAX_ITERATIONS=1000
RALPH_MAX_RUNTIME=28800 # 8 hours
RALPH_MAX_TOKENS=5000000
RALPH_MAX_COST=500.0
# Database
DATABASE_URL=postgresql://user:pass@db.example.com:5432/ralph_prod
DATABASE_POOL_SIZE=20
DATABASE_MAX_OVERFLOW=40
# Redis Cache
REDIS_URL=redis://redis.example.com:6379/0
REDIS_POOL_SIZE=50
REDIS_TTL=3600
# Monitoring
ENABLE_METRICS=true
METRICS_PORT=8080
LOG_LEVEL=INFO
SENTRY_DSN=https://xxx@sentry.io/xxx
# Security
SECRET_KEY=$(openssl rand -hex 32)
ENCRYPTION_KEY=$(openssl rand -base64 32)
API_RATE_LIMIT=100/minute
ALLOWED_HOSTS=ralph.example.com
CORS_ORIGINS=https://app.example.com
# AI Service Keys (from secrets manager)
CLAUDE_API_KEY_SECRET=arn:aws:secretsmanager:region:account:secret:claude-key
GEMINI_API_KEY_SECRET=arn:aws:secretsmanager:region:account:secret:gemini-key
Production Settings¶
# config/production.py
import os
from typing import Dict, Any
class ProductionConfig:
"""Production configuration"""
# Application
DEBUG = False
TESTING = False
ENV = 'production'
# Security
SECRET_KEY = os.environ['SECRET_KEY']
SESSION_COOKIE_SECURE = True
SESSION_COOKIE_HTTPONLY = True
SESSION_COOKIE_SAMESITE = 'Strict'
PERMANENT_SESSION_LIFETIME = 3600
# Database
SQLALCHEMY_DATABASE_URI = os.environ['DATABASE_URL']
SQLALCHEMY_ENGINE_OPTIONS = {
'pool_size': 20,
'pool_recycle': 3600,
'pool_pre_ping': True,
'max_overflow': 40,
'connect_args': {
'connect_timeout': 10,
'application_name': 'ralph-orchestrator'
}
}
# Caching
CACHE_TYPE = 'redis'
CACHE_REDIS_URL = os.environ['REDIS_URL']
CACHE_DEFAULT_TIMEOUT = 3600
# Rate Limiting
RATELIMIT_STORAGE_URL = os.environ['REDIS_URL']
RATELIMIT_STRATEGY = 'fixed-window'
RATELIMIT_DEFAULT = "100/hour"
# Monitoring
PROMETHEUS_METRICS = True
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'json': {
'class': 'pythonjsonlogger.jsonlogger.JsonFormatter',
'format': '%(asctime)s %(name)s %(levelname)s %(message)s'
}
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'json',
'level': 'INFO'
},
'file': {
'class': 'logging.handlers.RotatingFileHandler',
'filename': '/var/log/ralph/app.log',
'maxBytes': 104857600, # 100MB
'backupCount': 10,
'formatter': 'json'
}
},
'root': {
'level': 'INFO',
'handlers': ['console', 'file']
}
}
3. Deployment Configuration¶
Kubernetes Production Manifests¶
# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ralph-orchestrator
namespace: production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: ralph-orchestrator
template:
metadata:
labels:
app: ralph-orchestrator
version: v1.0.0
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ralph-orchestrator
topologyKey: kubernetes.io/hostname
containers:
- name: ralph
image: ghcr.io/mikeyobrien/ralph-orchestrator:v1.0.0
ports:
- containerPort: 8080
name: metrics
envFrom:
- secretRef:
name: ralph-secrets
- configMapRef:
name: ralph-config
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
- name: secrets
mountPath: /app/secrets
readOnly: true
volumes:
- name: config
configMap:
name: ralph-config
- name: secrets
secret:
secretName: ralph-secrets
defaultMode: 0400
4. Load Balancing and SSL¶
Ingress Configuration¶
# k8s/production/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ralph-ingress
namespace: production
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
spec:
tls:
- hosts:
- ralph.example.com
secretName: ralph-tls
rules:
- host: ralph.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ralph-service
port:
number: 80
5. Database Setup¶
Production Database Initialization¶
-- Create production database
CREATE DATABASE ralph_production;
CREATE USER ralph_app WITH ENCRYPTED PASSWORD 'secure_password';
GRANT ALL PRIVILEGES ON DATABASE ralph_production TO ralph_app;
-- Enable extensions
\c ralph_production
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_stat_statements";
-- Create schemas
CREATE SCHEMA IF NOT EXISTS ralph;
CREATE SCHEMA IF NOT EXISTS audit;
-- Set search path
ALTER USER ralph_app SET search_path TO ralph, public;
-- Create tables
CREATE TABLE ralph.orchestrations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
task_id VARCHAR(255) UNIQUE NOT NULL,
agent_type VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP WITH TIME ZONE,
metadata JSONB,
metrics JSONB
);
-- Create indexes
CREATE INDEX idx_orchestrations_status ON ralph.orchestrations(status);
CREATE INDEX idx_orchestrations_created ON ralph.orchestrations(created_at);
CREATE INDEX idx_orchestrations_agent ON ralph.orchestrations(agent_type);
-- Audit table
CREATE TABLE audit.activity_log (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
user_id VARCHAR(255),
action VARCHAR(100),
resource VARCHAR(255),
details JSONB,
ip_address INET
);
6. Monitoring Setup¶
Prometheus Configuration¶
# monitoring/prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'ralph-orchestrator'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: ralph-orchestrator
- source_labels: [__meta_kubernetes_pod_name]
target_label: instance
Alert Rules¶
# monitoring/alerts.yml
groups:
- name: ralph_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(ralph_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: "Error rate is {{ $value }} errors/sec"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes{pod=~"ralph-.*"} / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: High memory usage
description: "Memory usage is above 90%"
- alert: LongRunningTask
expr: ralph_task_duration_seconds > 14400
for: 1m
labels:
severity: warning
annotations:
summary: Task running longer than 4 hours
description: "Task {{ $labels.task_id }} has been running for {{ $value }} seconds"
7. Backup and Recovery¶
Automated Backup Script¶
#!/bin/bash
# backup.sh
# Configuration
BACKUP_DIR="/backups"
S3_BUCKET="s3://ralph-backups"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30
# Database backup
pg_dump $DATABASE_URL | gzip > $BACKUP_DIR/db_$DATE.sql.gz
# Application data backup
tar -czf $BACKUP_DIR/data_$DATE.tar.gz /app/data
# Upload to S3
aws s3 cp $BACKUP_DIR/db_$DATE.sql.gz $S3_BUCKET/db/
aws s3 cp $BACKUP_DIR/data_$DATE.tar.gz $S3_BUCKET/data/
# Clean old backups
find $BACKUP_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete
aws s3 ls $S3_BUCKET/db/ | while read -r line; do
createDate=$(echo $line | awk '{print $1" "$2}')
createDate=$(date -d "$createDate" +%s)
olderThan=$(date -d "$RETENTION_DAYS days ago" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk '{print $4}')
aws s3 rm $S3_BUCKET/db/$fileName
fi
done
8. Security Hardening¶
Security Policies¶
# k8s/production/security.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: ralph-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'persistentVolumeClaim'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
readOnlyRootFilesystem: true
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ralph-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: ralph-orchestrator
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 5432 # PostgreSQL
- protocol: TCP
port: 6379 # Redis
Performance Optimization¶
Caching Strategy¶
# cache_config.py
from functools import lru_cache
import redis
import json
class CacheManager:
def __init__(self, redis_url):
self.redis = redis.from_url(redis_url)
def cache_result(self, key, value, ttl=3600):
"""Cache result with TTL"""
self.redis.setex(
key,
ttl,
json.dumps(value)
)
@lru_cache(maxsize=1000)
def get_cached(self, key):
"""Get cached value"""
value = self.redis.get(key)
return json.loads(value) if value else None
Database Optimization¶
-- Optimize queries
CREATE INDEX CONCURRENTLY idx_orchestrations_composite
ON ralph.orchestrations(agent_type, status, created_at);
-- Partition large tables
CREATE TABLE ralph.orchestrations_2024 PARTITION OF ralph.orchestrations
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
-- Vacuum and analyze
VACUUM ANALYZE ralph.orchestrations;
Disaster Recovery¶
Recovery Plan¶
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 1 hour
Failover Procedure¶
#!/bin/bash
# failover.sh
# Check primary health
if ! curl -f https://ralph.example.com/health; then
echo "Primary unhealthy, initiating failover"
# Update DNS to point to standby
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch file://failover-dns.json
# Promote standby database
aws rds promote-read-replica \
--db-instance-identifier ralph-standby-db
# Scale up standby region
kubectl scale deployment ralph-orchestrator \
--replicas=5 \
--context=standby-cluster
# Notify team
aws sns publish \
--topic-arn arn:aws:sns:region:account:alerts \
--message "Failover initiated to standby region"
fi
Maintenance¶
Rolling Updates¶
# Zero-downtime deployment
kubectl set image deployment/ralph-orchestrator \
ralph=ghcr.io/mikeyobrien/ralph-orchestrator:v1.0.1 \
--record
# Monitor rollout
kubectl rollout status deployment/ralph-orchestrator
# Rollback if needed
kubectl rollout undo deployment/ralph-orchestrator
Health Checks¶
# health.py
from flask import Flask, jsonify
import psutil
import redis
app = Flask(__name__)
@app.route('/health')
def health():
"""Health check endpoint"""
checks = {
'database': check_database(),
'redis': check_redis(),
'disk': check_disk_space(),
'memory': check_memory()
}
status = 'healthy' if all(checks.values()) else 'unhealthy'
return jsonify({
'status': status,
'checks': checks
}), 200 if status == 'healthy' else 503
def check_database():
try:
db.session.execute('SELECT 1')
return True
except:
return False
def check_redis():
try:
r = redis.from_url(REDIS_URL)
r.ping()
return True
except:
return False
def check_disk_space():
usage = psutil.disk_usage('/')
return usage.percent < 90
def check_memory():
usage = psutil.virtual_memory()
return usage.percent < 90
Production Checklist¶
Pre-Deployment¶
- All tests passing
- Security scan completed
- Performance testing done
- Documentation updated
- Runbook created
- Rollback plan ready
Deployment¶
- Database migrations run
- Secrets configured
- Monitoring enabled
- Alerts configured
- Health checks passing
- Load balancer configured
Post-Deployment¶
- Smoke tests passed
- Performance metrics normal
- Error rates acceptable
- User acceptance testing
- Documentation published
- Team notified
Support and Maintenance¶
SLA Targets¶
- Uptime: 99.9% (43.8 minutes downtime/month)
- Response Time: < 500ms p95
- Error Rate: < 0.1%
- Recovery Time: < 4 hours
On-Call Procedures¶
- Alert received via PagerDuty
- Check runbook for known issues
- Assess severity and impact
- Implement fix or workaround
- Document incident
- Post-mortem if needed
Next Steps¶
- Monitoring Guide - Set up comprehensive monitoring
- Security Best Practices - Security hardening
- Troubleshooting Guide - Common issues and solutions