-
Notifications
You must be signed in to change notification settings - Fork 211
Description
π§ Chore Summary
Implement comprehensive chaos engineering tests for fault tolerance validation: make chaos-test
to systematically inject failures (network partitions, service crashes, resource exhaustion) and validate mcpgateway resilience across Docker Compose, Kubernetes/Helm, and Minikube environments.
π§± Areas Affected
- Chaos testing infrastructure / Make targets (
make chaos-test
,make chaos-local
,make chaos-k8s
) - Docker Compose multi-service setup for local chaos testing
- Kubernetes/Helm chaos testing with Chaos Mesh integration
- Minikube local Kubernetes testing environment
- Service resilience validation and failure recovery testing
- Network partition simulation and service discovery testing
- Database and Redis failure scenario testing
- Load balancing and failover validation
βοΈ Context / Rationale
Chaos engineering proactively discovers weaknesses in distributed systems by deliberately introducing controlled failures. Instead of waiting for production outages, chaos tests simulate real-world failure scenarios to validate that mcpgateway can handle service crashes, network partitions, database failures, and resource exhaustion while maintaining availability and data consistency.
What is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. It involves systematically injecting failures to identify weaknesses before they cause outages.
Key Chaos Testing Scenarios:
- Service Failures: Container crashes, process kills, graceful/ungraceful shutdowns
- Network Partitions: Split-brain scenarios, intermittent connectivity, packet loss
- Resource Exhaustion: CPU/memory/disk pressure, connection pool exhaustion
- Database Failures: Primary/replica failures, connection timeouts, query failures
- Dependency Failures: Redis crashes, external API timeouts, DNS failures
Simple Docker Compose Chaos Test:
# docker-compose.chaos.yml - Multi-service setup for chaos testing
version: '3.8'
services:
mcpgateway-1:
image: mcpgateway:latest
environment:
- DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
- REDIS_URL=redis://redis:6379/0
depends_on: [postgres, redis]
mcpgateway-2:
image: mcpgateway:latest
environment:
- DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
- REDIS_URL=redis://redis:6379/0
depends_on: [postgres, redis]
postgres:
image: postgres:15
environment:
POSTGRES_PASSWORD: password
redis:
image: redis:7
load-balancer:
image: nginx:alpine
ports: ["4444:80"]
volumes: ["./chaos/nginx.conf:/etc/nginx/nginx.conf"]
depends_on: [mcpgateway-1, mcpgateway-2]
chaos-controller:
image: chaos-toolkit:latest
volumes: ["./chaos:/chaos"]
command: ["chaos", "run", "/chaos/experiments.json"]
Advanced Kubernetes Chaos Testing:
# chaos/k8s-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: mcpgateway-network-partition
spec:
action: partition
mode: all
selector:
namespaces: ["mcpgateway"]
labelSelectors:
app: mcpgateway
direction: both
duration: "30s"
---
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: mcpgateway-pod-kill
spec:
action: pod-kill
mode: one
selector:
namespaces: ["mcpgateway"]
labelSelectors:
app: mcpgateway
duration: "10s"
MCPGateway Specific Chaos Scenarios:
# Chaos test scenarios for mcpgateway distributed setup
chaos_scenarios:
# Service resilience testing
- name: "gateway_instance_failure"
description: "Kill one mcpgateway instance while maintaining service"
target: "mcpgateway-1"
failure_type: "container_kill"
duration: "30s"
expected_behavior:
- load_balancer_routes_to_healthy_instance
- no_request_failures
- automatic_service_recovery
# Database resilience testing
- name: "database_connection_failure"
description: "Simulate PostgreSQL connection failures"
target: "postgres"
failure_type: "network_partition"
duration: "60s"
expected_behavior:
- connection_pool_recovery
- graceful_error_handling
- automatic_reconnection
# Redis caching failure
- name: "redis_cache_failure"
description: "Test behavior when Redis becomes unavailable"
target: "redis"
failure_type: "container_stop"
duration: "45s"
expected_behavior:
- fallback_to_direct_database_access
- no_cache_errors_propagated
- cache_recovery_on_restart
# Network partition between services
- name: "service_network_partition"
description: "Partition gateway instances from database"
target: "mcpgateway-*"
failure_type: "network_delay"
parameters:
latency: "1000ms"
jitter: "500ms"
duration: "120s"
expected_behavior:
- request_timeout_handling
- circuit_breaker_activation
- health_check_failures
# Resource exhaustion
- name: "memory_pressure_test"
description: "Exhaust memory resources on gateway instances"
target: "mcpgateway-1"
failure_type: "memory_stress"
parameters:
memory_percentage: 90
duration: "60s"
expected_behavior:
- graceful_degradation
- oom_killer_protection
- load_balancer_removes_unhealthy_instance
# Load balancer failure
- name: "load_balancer_failure"
description: "Test direct instance access when LB fails"
target: "load-balancer"
failure_type: "container_stop"
duration: "30s"
expected_behavior:
- direct_instance_access_possible
- service_discovery_fallback
- health_monitoring_continues
π¦ Related Make Targets
Target | Purpose |
---|---|
make chaos-test |
Run complete chaos engineering test suite across all environments |
make chaos-local |
Run Docker Compose based chaos tests locally |
make chaos-k8s |
Run Kubernetes chaos tests with Chaos Mesh |
make chaos-minikube |
Run chaos tests in local Minikube environment |
make chaos-network |
Test network partition and connectivity failure scenarios |
make chaos-services |
Test service failure and recovery scenarios |
make chaos-resources |
Test resource exhaustion and pressure scenarios |
make chaos-setup |
Set up chaos testing infrastructure and dependencies |
make chaos-report |
Generate chaos test reports and resilience analysis |
make chaos-clean |
Clean chaos test environments and artifacts |
Bold targets are mandatory; CI must fail if critical resilience requirements are not met.
π Acceptance Criteria
-
make chaos-test
validates system resilience across all failure scenarios and environments. -
make chaos-local
successfully tests multi-container failures using Docker Compose. -
make chaos-k8s
executes Kubernetes-native chaos tests with Chaos Mesh integration. -
make chaos-minikube
runs complete chaos test suite in local Kubernetes environment. - Service failure scenarios validate automatic recovery and failover capabilities.
- Network partition tests ensure split-brain scenario handling and service discovery.
- Database failure tests validate connection pool recovery and graceful degradation.
- Resource exhaustion tests confirm system stability under pressure.
- Load balancing tests ensure traffic routing during instance failures.
- All chaos tests include automated validation of expected resilience behaviors.
- Changelog entry under "Testing" or "Reliability".
π οΈ Task List (suggested flow)
-
Chaos testing infrastructure setup
mkdir -p chaos/{experiments,k8s,reports,data} # Create chaos testing configuration cat > chaos/config.yaml << 'EOF' environments: local: type: "docker-compose" compose_file: "docker-compose.chaos.yml" services: ["mcpgateway-1", "mcpgateway-2", "postgres", "redis", "load-balancer"] minikube: type: "kubernetes" namespace: "mcpgateway-chaos" chaos_mesh_enabled: true k8s: type: "kubernetes" namespace: "mcpgateway" chaos_mesh_enabled: true chaos_scenarios: service_failures: - gateway_instance_kill - database_connection_failure - redis_cache_failure network_issues: - network_partition - network_delay - packet_loss resource_pressure: - memory_exhaustion - cpu_pressure - disk_pressure validation_checks: - service_availability - data_consistency - recovery_time - error_handling EOF
-
Docker Compose chaos testing setup
# docker-compose.chaos.yml - Multi-service chaos testing environment version: '3.8' services: # Multiple gateway instances for failover testing mcpgateway-1: build: . container_name: mcpgateway-1 environment: - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway - REDIS_URL=redis://redis:6379/0 - PORT=8000 - INSTANCE_ID=gateway-1 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 10s timeout: 5s retries: 3 depends_on: postgres: condition: service_healthy redis: condition: service_healthy networks: [mcpgateway-net] mcpgateway-2: build: . container_name: mcpgateway-2 environment: - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway - REDIS_URL=redis://redis:6379/0 - PORT=8000 - INSTANCE_ID=gateway-2 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 10s timeout: 5s retries: 3 depends_on: postgres: condition: service_healthy redis: condition: service_healthy networks: [mcpgateway-net] # Database with health checks postgres: image: postgres:15 container_name: postgres-chaos environment: POSTGRES_PASSWORD: password POSTGRES_DB: mcpgateway POSTGRES_USER: postgres healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 10s timeout: 5s retries: 3 volumes: - postgres_data:/var/lib/postgresql/data networks: [mcpgateway-net] # Redis cache redis: image: redis:7-alpine container_name: redis-chaos healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 5s retries: 3 networks: [mcpgateway-net] # Load balancer for multi-instance testing load-balancer: image: nginx:alpine container_name: nginx-lb ports: - "4444:80" volumes: - "./chaos/nginx.conf:/etc/nginx/nginx.conf:ro" depends_on: - mcpgateway-1 - mcpgateway-2 healthcheck: test: ["CMD", "curl", "-f", "http://localhost/health"] interval: 10s timeout: 5s retries: 3 networks: [mcpgateway-net] # Chaos testing controller chaos-monkey: image: python:3.12-alpine container_name: chaos-controller volumes: - "./chaos:/chaos" - "/var/run/docker.sock:/var/run/docker.sock" working_dir: /chaos command: ["python", "chaos_runner.py"] depends_on: - mcpgateway-1 - mcpgateway-2 - postgres - redis networks: [mcpgateway-net] volumes: postgres_data: networks: mcpgateway-net: driver: bridge
# chaos/nginx.conf - Load balancer configuration events { worker_connections 1024; } http { upstream mcpgateway { server mcpgateway-1:8000 max_fails=2 fail_timeout=30s; server mcpgateway-2:8000 max_fails=2 fail_timeout=30s; } server { listen 80; location /health { access_log off; proxy_pass http://mcpgateway/health; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } location / { proxy_pass http://mcpgateway; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_connect_timeout 5s; proxy_send_timeout 10s; proxy_read_timeout 10s; } } }
-
Makefile integration
# Chaos Engineering Testing Targets .PHONY: chaos-test chaos-local chaos-k8s chaos-minikube chaos-setup chaos-clean CHAOS_DIR := chaos CHAOS_REPORTS := $(CHAOS_DIR)/reports COMPOSE_CHAOS := docker-compose.chaos.yml MINIKUBE_PROFILE := mcpgateway-chaos chaos-test: chaos-local chaos-minikube @echo "π₯ Running complete chaos engineering test suite..." @python $(CHAOS_DIR)/generate_report.py \ --local-results $(CHAOS_REPORTS)/local \ --k8s-results $(CHAOS_REPORTS)/minikube \ --output $(CHAOS_REPORTS)/chaos-summary.html chaos-setup: @echo "π§ Setting up chaos testing infrastructure..." @pip install docker chaos-toolkit requests pytest @docker-compose -f $(COMPOSE_CHAOS) pull @minikube profile $(MINIKUBE_PROFILE) || minikube start -p $(MINIKUBE_PROFILE) @kubectl --context $(MINIKUBE_PROFILE) apply -f https://mirrors.chaos-mesh.org/v2.6.0/install.sh chaos-local: @echo "π Running Docker Compose chaos tests..." @mkdir -p $(CHAOS_REPORTS)/local @docker-compose -f $(COMPOSE_CHAOS) up -d @sleep 30 # Wait for services to be ready @python $(CHAOS_DIR)/chaos_runner.py \ --environment local \ --config $(CHAOS_DIR)/config.yaml \ --output $(CHAOS_REPORTS)/local @docker-compose -f $(COMPOSE_CHAOS) down chaos-minikube: @echo "βΈοΈ Running Minikube chaos tests..." @mkdir -p $(CHAOS_REPORTS)/minikube @minikube profile $(MINIKUBE_PROFILE) @helm upgrade --install mcpgateway charts/mcp-stack \ --namespace mcpgateway-chaos \ --create-namespace \ --set replicaCount=2 \ --wait @kubectl apply -f $(CHAOS_DIR)/k8s/ -n mcpgateway-chaos @python $(CHAOS_DIR)/k8s_chaos_runner.py \ --namespace mcpgateway-chaos \ --output $(CHAOS_REPORTS)/minikube chaos-k8s: @echo "βΈοΈ Running Kubernetes chaos tests..." @mkdir -p $(CHAOS_REPORTS)/k8s @kubectl apply -f $(CHAOS_DIR)/k8s/ -n mcpgateway @python $(CHAOS_DIR)/k8s_chaos_runner.py \ --namespace mcpgateway \ --output $(CHAOS_REPORTS)/k8s chaos-network: @echo "π Testing network partition scenarios..." @python $(CHAOS_DIR)/network_chaos.py \ --scenarios partition,delay,loss \ --duration 60 \ --validate-recovery chaos-services: @echo "β‘ Testing service failure scenarios..." @python $(CHAOS_DIR)/service_chaos.py \ --services mcpgateway,postgres,redis \ --failure-types kill,stop,restart \ --validate-failover chaos-resources: @echo "πΎ Testing resource exhaustion scenarios..." @python $(CHAOS_DIR)/resource_chaos.py \ --resources cpu,memory,disk \ --pressure-levels 80,90,95 \ --duration 120 chaos-clean: @echo "π§Ή Cleaning chaos test environments..." @docker-compose -f $(COMPOSE_CHAOS) down -v @minikube delete -p $(MINIKUBE_PROFILE) || true @rm -rf $(CHAOS_REPORTS)/*
-
Chaos test runner (Docker Compose)
# chaos/chaos_runner.py #!/usr/bin/env python3 """ Docker Compose based chaos testing runner. """ import time import docker import requests import json import argparse from typing import Dict, List, Any from pathlib import Path class DockerChaosRunner: """Run chaos experiments in Docker Compose environment.""" def __init__(self, config_path: str, output_dir: str): self.config = self._load_config(config_path) self.output_dir = Path(output_dir) self.output_dir.mkdir(parents=True, exist_ok=True) self.docker_client = docker.from_env() self.results = [] def run_all_experiments(self) -> List[Dict[str, Any]]: """Run all configured chaos experiments.""" print("π₯ Starting chaos experiments...") # Validate initial system health if not self._validate_system_health(): raise Exception("System not healthy before chaos testing") # Run service failure experiments for experiment in self.config['chaos_scenarios']['service_failures']: result = self._run_service_failure_experiment(experiment) self.results.append(result) # Run network chaos experiments for experiment in self.config['chaos_scenarios']['network_issues']: result = self._run_network_experiment(experiment) self.results.append(result) # Run resource pressure experiments for experiment in self.config['chaos_scenarios']['resource_pressure']: result = self._run_resource_experiment(experiment) self.results.append(result) # Save results self._save_results() return self.results def _run_service_failure_experiment(self, experiment_name: str) -> Dict[str, Any]: """Run a service failure experiment.""" print(f" π― Running service failure: {experiment_name}") experiment_config = { 'gateway_instance_kill': { 'target_container': 'mcpgateway-1', 'action': 'kill', 'duration': 30, 'expected_behavior': ['load_balancer_failover', 'no_request_failures'] }, 'database_connection_failure': { 'target_container': 'postgres-chaos', 'action': 'pause', 'duration': 60, 'expected_behavior': ['connection_pool_recovery', 'graceful_errors'] }, 'redis_cache_failure': { 'target_container': 'redis-chaos', 'action': 'stop', 'duration': 45, 'expected_behavior': ['cache_fallback', 'no_error_propagation'] } } config = experiment_config[experiment_name] result = { 'experiment': experiment_name, 'start_time': time.time(), 'success': False, 'observations': [], 'metrics': {} } try: # Record baseline metrics baseline_metrics = self._collect_metrics() # Inject failure container = self.docker_client.containers.get(config['target_container']) if config['action'] == 'kill': container.kill() elif config['action'] == 'pause': container.pause() elif config['action'] == 'stop': container.stop() # Monitor system during failure failure_metrics = [] for i in range(config['duration']): time.sleep(1) metrics = self._collect_metrics() failure_metrics.append(metrics) # Check if system is behaving as expected self._validate_expected_behavior(config['expected_behavior'], metrics) # Restore service if config['action'] == 'kill': # Container restart handled by compose restart policy pass elif config['action'] == 'pause': container.unpause() elif config['action'] == 'stop': container.start() # Wait for recovery and validate time.sleep(30) recovery_metrics = self._collect_metrics() # Validate recovery if self._validate_system_health(): result['success'] = True result['observations'].append("System recovered successfully") else: result['observations'].append("System failed to recover properly") result['metrics'] = { 'baseline': baseline_metrics, 'during_failure': failure_metrics, 'after_recovery': recovery_metrics } except Exception as e: result['observations'].append(f"Experiment failed: {str(e)}") result['end_time'] = time.time() result['duration'] = result['end_time'] - result['start_time'] return result def _collect_metrics(self) -> Dict[str, Any]: """Collect system metrics during chaos experiments.""" metrics = { 'timestamp': time.time(), 'services': {}, 'load_balancer': {}, 'response_times': [] } # Check service health services = ['mcpgateway-1', 'mcpgateway-2', 'postgres-chaos', 'redis-chaos'] for service in services: try: container = self.docker_client.containers.get(service) metrics['services'][service] = { 'status': container.status, 'health': self._get_container_health(container) } except Exception as e: metrics['services'][service] = {'status': 'not_found', 'error': str(e)} # Test load balancer response try: start_time = time.time() response = requests.get('http://localhost:4444/health', timeout=5) response_time = time.time() - start_time metrics['load_balancer'] = { 'status_code': response.status_code, 'response_time': response_time, 'accessible': response.status_code == 200 } except Exception as e: metrics['load_balancer'] = { 'accessible': False, 'error': str(e) } # Test API endpoints test_endpoints = ['/health', '/tools/', '/servers/'] for endpoint in test_endpoints: try: start_time = time.time() response = requests.get(f'http://localhost:4444{endpoint}', timeout=5) response_time = time.time() - start_time metrics['response_times'].append({ 'endpoint': endpoint, 'response_time': response_time, 'status_code': response.status_code }) except Exception as e: metrics['response_times'].append({ 'endpoint': endpoint, 'error': str(e), 'status_code': 0 }) return metrics
-
Kubernetes chaos testing
# chaos/k8s/network-partition.yaml apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: mcpgateway-network-partition namespace: mcpgateway-chaos spec: action: partition mode: all selector: namespaces: ["mcpgateway-chaos"] labelSelectors: app.kubernetes.io/name: mcpgateway direction: both duration: "30s" --- apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: mcpgateway-network-delay namespace: mcpgateway-chaos spec: action: delay mode: one selector: namespaces: ["mcpgateway-chaos"] labelSelectors: app.kubernetes.io/name: mcpgateway delay: latency: "1000ms" correlation: "100" jitter: "500ms" duration: "60s"
# chaos/k8s/pod-chaos.yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: mcpgateway-pod-kill namespace: mcpgateway-chaos spec: action: pod-kill mode: one selector: namespaces: ["mcpgateway-chaos"] labelSelectors: app.kubernetes.io/name: mcpgateway gracePeriod: 0 --- apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: postgres-pod-failure namespace: mcpgateway-chaos spec: action: pod-failure mode: one selector: namespaces: ["mcpgateway-chaos"] labelSelectors: app.kubernetes.io/name: postgresql duration: "60s"
# chaos/k8s/stress-chaos.yaml apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: mcpgateway-memory-stress namespace: mcpgateway-chaos spec: mode: one selector: namespaces: ["mcpgateway-chaos"] labelSelectors: app.kubernetes.io/name: mcpgateway duration: "120s" stressors: memory: workers: 1 size: "80%" --- apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: mcpgateway-cpu-stress namespace: mcpgateway-chaos spec: mode: one selector: namespaces: ["mcpgateway-chaos"] labelSelectors: app.kubernetes.io/name: mcpgateway duration: "90s" stressors: cpu: workers: 2 load: 90
-
Kubernetes chaos runner
# chaos/k8s_chaos_runner.py #!/usr/bin/env python3 """ Kubernetes chaos testing with Chaos Mesh integration. """ import time import yaml import subprocess import requests import argparse from typing import Dict, List, Any from pathlib import Path from kubernetes import client, config class KubernetesChaosRunner: """Run chaos experiments in Kubernetes using Chaos Mesh.""" def __init__(self, namespace: str, output_dir: str): self.namespace = namespace self.output_dir = Path(output_dir) self.output_dir.mkdir(parents=True, exist_ok=True) # Load Kubernetes config try: config.load_incluster_config() except: config.load_kube_config() self.k8s_client = client.ApiClient() self.v1 = client.CoreV1Api() self.apps_v1 = client.AppsV1Api() self.results = [] def run_all_chaos_experiments(self) -> List[Dict[str, Any]]: """Run all Kubernetes chaos experiments.""" print(f"βΈοΈ Starting Kubernetes chaos experiments in namespace: {self.namespace}") # Validate initial cluster state if not self._validate_cluster_health(): raise Exception("Cluster not healthy before chaos testing") # Get chaos experiment files chaos_files = list(Path("chaos/k8s").glob("*.yaml")) for chaos_file in chaos_files: result = self._run_chaos_experiment(chaos_file) self.results.append(result) # Wait between experiments time.sleep(30) # Save results self._save_results() return self.results def _run_chaos_experiment(self, chaos_file: Path) -> Dict[str, Any]: """Run a single chaos experiment.""" print(f" π― Running chaos experiment: {chaos_file.name}") result = { 'experiment_file': str(chaos_file), 'start_time': time.time(), 'success': False, 'observations': [], 'metrics': {} } try: # Load chaos experiment with open(chaos_file) as f: chaos_docs = list(yaml.safe_load_all(f)) # Record baseline metrics baseline_metrics = self._collect_k8s_metrics() # Apply chaos experiment for doc in chaos_docs: if doc: self._apply_chaos_manifest(doc) # Monitor during chaos chaos_duration = self._extract_duration(chaos_docs[0]) monitoring_metrics = [] for i in range(chaos_duration + 30): # Duration + recovery time time.sleep(1) metrics = self._collect_k8s_metrics() monitoring_metrics.append(metrics) # Log significant events if i % 10 == 0: print(f" Monitoring... {i}s elapsed") # Clean up chaos experiment for doc in chaos_docs: if doc: self._delete_chaos_manifest(doc) # Wait for recovery time.sleep(60) recovery_metrics = self._collect_k8s_metrics() # Validate recovery if self._validate_cluster_health(): result['success'] = True result['observations'].append("Cluster recovered successfully") else: result['observations'].append("Cluster failed to recover properly") result['metrics'] = { 'baseline': baseline_metrics, 'during_chaos': monitoring_metrics, 'after_recovery': recovery_metrics } except Exception as e: result['observations'].append(f"Chaos experiment failed: {str(e)}") result['end_time'] = time.time() result['duration'] = result['end_time'] - result['start_time'] return result def _collect_k8s_metrics(self) -> Dict[str, Any]: """Collect Kubernetes cluster metrics.""" metrics = { 'timestamp': time.time(), 'pods': {}, 'services': {}, 'endpoints': {}, 'nodes': {} } try: # Pod status pods = self.v1.list_namespaced_pod(namespace=self.namespace) for pod in pods.items: metrics['pods'][pod.metadata.name] = { 'phase': pod.status.phase, 'ready': self._is_pod_ready(pod), 'restart_count': sum( container.restart_count or 0 for container in pod.status.container_statuses or [] ) } # Service endpoints services = self.v1.list_namespaced_service(namespace=self.namespace) for service in services.items: endpoints = self.v1.read_namespaced_endpoints( name=service.metadata.name, namespace=self.namespace ) metrics['services'][service.metadata.name] = { 'type': service.spec.type, 'endpoints_ready': sum( len(subset.addresses or []) for subset in endpoints.subsets or [] ) } # Test service connectivity service_url = self._get_service_url() if service_url: try: response = requests.get(f"{service_url}/health", timeout=5) metrics['service_connectivity'] = { 'accessible': True, 'status_code': response.status_code, 'response_time': response.elapsed.total_seconds() } except Exception as e: metrics['service_connectivity'] = { 'accessible': False, 'error': str(e) } except Exception as e: metrics['collection_error'] = str(e) return metrics
-
Chaos test validation
# chaos/validation.py """ Validation logic for chaos engineering tests. """ import time import requests from typing import Dict, List, Any, Optional class ChaosTestValidator: """Validate system behavior during and after chaos experiments.""" def __init__(self, base_url: str = "http://localhost:4444"): self.base_url = base_url self.validation_results = [] def validate_service_resilience(self, experiment_type: str, during_chaos_metrics: List[Dict], recovery_metrics: Dict) -> Dict[str, Any]: """Validate service resilience during chaos experiments.""" validation = { 'experiment_type': experiment_type, 'passed': True, 'failures': [], 'warnings': [], 'recovery_time': None } if experiment_type == 'gateway_instance_kill': validation.update(self._validate_instance_failover( during_chaos_metrics, recovery_metrics )) elif experiment_type == 'database_connection_failure': validation.update(self._validate_database_resilience( during_chaos_metrics, recovery_metrics )) elif experiment_type == 'redis_cache_failure': validation.update(self._validate_cache_resilience( during_chaos_metrics, recovery_metrics )) elif experiment_type == 'network_partition': validation.update(self._validate_network_resilience( during_chaos_metrics, recovery_metrics )) return validation def _validate_instance_failover(self, chaos_metrics: List[Dict], recovery_metrics: Dict) -> Dict[str, Any]: """Validate gateway instance failover behavior.""" validation = {'passed': True, 'failures': [], 'warnings': []} # Check load balancer kept routing requests successful_requests = 0 total_requests = 0 for metrics in chaos_metrics: lb_metrics = metrics.get('load_balancer', {}) if lb_metrics.get('accessible', False): successful_requests += 1 total_requests += 1 success_rate = successful_requests / total_requests if total_requests > 0 else 0 if success_rate < 0.95: # 95% availability threshold validation['failures'].append( f"Service availability dropped to {success_rate:.2%} during instance failure" ) validation['passed'] = False else: validation['warnings'].append( f"Service maintained {success_rate:.2%} availability during failover" ) # Check recovery time recovery_time = self._calculate_recovery_time(chaos_metrics) if recovery_time > 60: # 60 second max recovery time validation['failures'].append( f"Recovery took {recovery_time}s, exceeds 60s threshold" ) validation['passed'] = False validation['recovery_time'] = recovery_time return validation def _validate_database_resilience(self, chaos_metrics: List[Dict], recovery_metrics: Dict) -> Dict[str, Any]: """Validate database connection resilience.""" validation = {'passed': True, 'failures': [], 'warnings': []} # Check for graceful error handling error_responses = [] for metrics in chaos_metrics: for response in metrics.get('response_times', []): if response.get('status_code', 0) >= 500: error_responses.append(response) # Database failures should result in 503 (service unavailable) not 500 internal_errors = [r for r in error_responses if r.get('status_code') == 500] if internal_errors: validation['failures'].append( f"Found {len(internal_errors)} internal server errors during DB failure" ) validation['passed'] = False # Check connection pool recovery if not recovery_metrics.get('services', {}).get('postgres-chaos', {}).get('health'): validation['failures'].append("Database service not healthy after recovery") validation['passed'] = False return validation def _validate_cache_resilience(self, chaos_metrics: List[Dict], recovery_metrics: Dict) -> Dict[str, Any]: """Validate Redis cache failure resilience.""" validation = {'passed': True, 'failures': [], 'warnings': []} # Cache failures should not impact service availability for metrics in chaos_metrics: if not metrics.get('load_balancer', {}).get('accessible', False): validation['failures'].append( "Service became unavailable during cache failure" ) validation['passed'] = False break # Performance may degrade but should remain functional response_times = [] for metrics in chaos_metrics: for response in metrics.get('response_times', []): if response.get('status_code') == 200: response_times.append(response.get('response_time', 0)) if response_times: avg_response_time = sum(response_times) / len(response_times) if avg_response_time > 5.0: # 5 second threshold validation['warnings'].append( f"Response time degraded to {avg_response_time:.2f}s during cache failure" ) return validation
-
CI integration
# Add to existing GitHub Actions workflow chaos-engineering: name: π₯ Chaos Engineering Tests runs-on: ubuntu-latest needs: [test, migration-testing] if: github.ref == 'refs/heads/main' # Only run on main branch strategy: matrix: environment: ["docker-compose", "minikube"] steps: - name: β¬οΈ Checkout source uses: actions/checkout@v4 with: fetch-depth: 1 - name: π Set up Python uses: actions/setup-python@v5 with: python-version: "3.12" cache: pip - name: π§ Install dependencies run: | python -m pip install --upgrade pip pip install -e .[dev] pip install docker chaos-toolkit kubernetes - name: π Set up Docker Compose environment if: matrix.environment == 'docker-compose' run: | docker-compose -f docker-compose.chaos.yml build make chaos-setup - name: βΈοΈ Set up Minikube environment if: matrix.environment == 'minikube' run: | # Install minikube curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 sudo install minikube-linux-amd64 /usr/local/bin/minikube # Install helm curl https://get.helm.sh/helm-v3.12.0-linux-amd64.tar.gz | tar xz sudo mv linux-amd64/helm /usr/local/bin/ # Start minikube minikube start --driver=docker --cpus=2 --memory=4096 # Install Chaos Mesh curl -sSL https://mirrors.chaos-mesh.org/v2.6.0/install.sh | bash - name: π₯ Run chaos tests - ${{ matrix.environment }} run: | case "${{ matrix.environment }}" in "docker-compose") make chaos-local ;; "minikube") make chaos-minikube ;; esac - name: π Generate chaos test report run: | make chaos-report - name: π Upload chaos test results uses: actions/upload-artifact@v4 with: name: chaos-test-results-${{ matrix.environment }} path: | chaos/reports/ retention-days: 30 - name: π¨ Validate resilience requirements run: | python chaos/validate_resilience.py \ --results chaos/reports/${{ matrix.environment }} \ --fail-on-critical - name: π§Ή Cleanup if: always() run: | make chaos-clean if [ "${{ matrix.environment }}" == "minikube" ]; then minikube delete fi
-
Monitoring and reporting
# chaos/generate_report.py #!/usr/bin/env python3 """ Generate comprehensive chaos engineering test reports. """ import json import plotly.graph_objects as go import plotly.express as px from plotly.subplots import make_subplots import pandas as pd from pathlib import Path class ChaosTestReporter: """Generate visual reports for chaos engineering tests.""" def generate_html_report(self, results_dir: Path, output_file: Path): """Generate comprehensive HTML chaos test report.""" # Load all test results all_results = self._load_all_results(results_dir) # Create visualizations fig = make_subplots( rows=3, cols=2, subplot_titles=[ 'Service Availability During Chaos', 'Recovery Time by Experiment', 'Response Time Impact', 'Error Rate Analysis', 'Resource Usage During Chaos', 'Resilience Score by Component' ], specs=[ [{"secondary_y": True}, {"type": "bar"}], [{"type": "scatter"}, {"type": "bar"}], [{"type": "heatmap"}, {"type": "indicator"}] ] ) # Service availability timeline self._add_availability_chart(fig, all_results, row=1, col=1) # Recovery time comparison self._add_recovery_time_chart(fig, all_results, row=1, col=2) # Response time impact self._add_response_time_chart(fig, all_results, row=2, col=1) # Generate HTML report html_template = self._get_html_template() html_content = html_template.format( charts=fig.to_html(include_plotlyjs='cdn'), summary_table=self._generate_summary_table(all_results), recommendations=self._generate_recommendations(all_results), timestamp=pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S') ) output_file.write_text(html_content) print(f"π Chaos test report generated: {output_file}")
-
Documentation
Add chaos engineering documentation:
# Chaos Engineering Testing ## Overview Chaos engineering tests validate mcpgateway's resilience by deliberately introducing failures and verifying the system's ability to handle them gracefully. ## Test Environments ### Docker Compose (Local) ```bash # Run local multi-container chaos tests make chaos-local
Minikube (Local Kubernetes)
# Run Kubernetes chaos tests locally make chaos-minikube
Production Kubernetes
# Run chaos tests in production cluster make chaos-k8s
Failure Scenarios
Category Scenario Expected Behavior Service Failures Gateway instance kill Load balancer failover, <5s recovery Database PostgreSQL connection loss Graceful errors, automatic reconnection Cache Redis failure Fallback to database, no error propagation Network Network partition Circuit breaker activation, timeout handling Resources Memory/CPU pressure Graceful degradation, auto-scaling Resilience Requirements
- Service Availability: >95% during single instance failures
- Recovery Time: <60 seconds for service restoration
- Error Handling: Graceful degradation, no 500 errors
- Data Consistency: No data loss during failures
- Performance: <2x response time degradation under pressure
π References
- Chaos Engineering Principles β Building confidence in system behavior Β· https://principlesofchaos.org/
- Chaos Mesh β Cloud-native chaos engineering platform Β· https://chaos-mesh.org/
- Docker Compose β Multi-container application definition Β· https://docs.docker.com/compose/
- Kubernetes β Container orchestration platform Β· https://kubernetes.io/
- Minikube β Local Kubernetes development Β· https://minikube.sigs.k8s.io/
π§© Additional Notes
- Start simple: Begin with Docker Compose chaos tests before moving to Kubernetes scenarios.
- Gradual failure injection: Test individual failure types before combining multiple failures.
- Real-world scenarios: Focus on failures that actually occur in production environments.
- Automated validation: Every chaos experiment should include automated validation of expected behaviors.
- Documentation: Document all failure scenarios and expected system responses for team knowledge.
- Safety first: Always run chaos tests in isolated environments before production testing.
- Monitoring integration: Chaos tests should integrate with your monitoring and alerting systems.
Chaos Engineering Best Practices:
- Define steady-state behavior before introducing chaos
- Minimize blast radius to avoid widespread impact
- Automate experiments to run consistently and frequently
- Build confidence gradually by starting with small experiments
- Learn from failures and improve system resilience based on findings
- Include business metrics in chaos experiment validation
- Document and share learnings across the team