Skip to content

[CHORE]: Implement chaos engineering tests for fault tolerance validation (network partitions, service failures)Β #253

@crivetimihai

Description

@crivetimihai

🧭 Chore Summary

Implement comprehensive chaos engineering tests for fault tolerance validation: make chaos-test to systematically inject failures (network partitions, service crashes, resource exhaustion) and validate mcpgateway resilience across Docker Compose, Kubernetes/Helm, and Minikube environments.


🧱 Areas Affected

  • Chaos testing infrastructure / Make targets (make chaos-test, make chaos-local, make chaos-k8s)
  • Docker Compose multi-service setup for local chaos testing
  • Kubernetes/Helm chaos testing with Chaos Mesh integration
  • Minikube local Kubernetes testing environment
  • Service resilience validation and failure recovery testing
  • Network partition simulation and service discovery testing
  • Database and Redis failure scenario testing
  • Load balancing and failover validation

βš™οΈ Context / Rationale

Chaos engineering proactively discovers weaknesses in distributed systems by deliberately introducing controlled failures. Instead of waiting for production outages, chaos tests simulate real-world failure scenarios to validate that mcpgateway can handle service crashes, network partitions, database failures, and resource exhaustion while maintaining availability and data consistency.

What is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. It involves systematically injecting failures to identify weaknesses before they cause outages.

Key Chaos Testing Scenarios:

  • Service Failures: Container crashes, process kills, graceful/ungraceful shutdowns
  • Network Partitions: Split-brain scenarios, intermittent connectivity, packet loss
  • Resource Exhaustion: CPU/memory/disk pressure, connection pool exhaustion
  • Database Failures: Primary/replica failures, connection timeouts, query failures
  • Dependency Failures: Redis crashes, external API timeouts, DNS failures

Simple Docker Compose Chaos Test:

# docker-compose.chaos.yml - Multi-service setup for chaos testing
version: '3.8'
services:
  mcpgateway-1:
    image: mcpgateway:latest
    environment:
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
      - REDIS_URL=redis://redis:6379/0
    depends_on: [postgres, redis]
    
  mcpgateway-2:
    image: mcpgateway:latest  
    environment:
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
      - REDIS_URL=redis://redis:6379/0
    depends_on: [postgres, redis]
    
  postgres:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: password
      
  redis:
    image: redis:7
    
  load-balancer:
    image: nginx:alpine
    ports: ["4444:80"]
    volumes: ["./chaos/nginx.conf:/etc/nginx/nginx.conf"]
    depends_on: [mcpgateway-1, mcpgateway-2]
    
  chaos-controller:
    image: chaos-toolkit:latest
    volumes: ["./chaos:/chaos"]
    command: ["chaos", "run", "/chaos/experiments.json"]

Advanced Kubernetes Chaos Testing:

# chaos/k8s-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mcpgateway-network-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces: ["mcpgateway"]
    labelSelectors:
      app: mcpgateway
  direction: both
  duration: "30s"
  
---
apiVersion: chaos-mesh.org/v1alpha1  
kind: PodChaos
metadata:
  name: mcpgateway-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: ["mcpgateway"]
    labelSelectors:
      app: mcpgateway
  duration: "10s"

MCPGateway Specific Chaos Scenarios:

# Chaos test scenarios for mcpgateway distributed setup
chaos_scenarios:
  # Service resilience testing
  - name: "gateway_instance_failure"
    description: "Kill one mcpgateway instance while maintaining service"
    target: "mcpgateway-1"
    failure_type: "container_kill"
    duration: "30s"
    expected_behavior:
      - load_balancer_routes_to_healthy_instance
      - no_request_failures
      - automatic_service_recovery
      
  # Database resilience testing  
  - name: "database_connection_failure"
    description: "Simulate PostgreSQL connection failures"
    target: "postgres"
    failure_type: "network_partition"
    duration: "60s"
    expected_behavior:
      - connection_pool_recovery
      - graceful_error_handling
      - automatic_reconnection
      
  # Redis caching failure
  - name: "redis_cache_failure"
    description: "Test behavior when Redis becomes unavailable"
    target: "redis"
    failure_type: "container_stop"
    duration: "45s"
    expected_behavior:
      - fallback_to_direct_database_access
      - no_cache_errors_propagated
      - cache_recovery_on_restart
      
  # Network partition between services
  - name: "service_network_partition"
    description: "Partition gateway instances from database"
    target: "mcpgateway-*"
    failure_type: "network_delay"
    parameters:
      latency: "1000ms"
      jitter: "500ms"
    duration: "120s"
    expected_behavior:
      - request_timeout_handling
      - circuit_breaker_activation
      - health_check_failures
      
  # Resource exhaustion
  - name: "memory_pressure_test"
    description: "Exhaust memory resources on gateway instances"
    target: "mcpgateway-1"
    failure_type: "memory_stress"
    parameters:
      memory_percentage: 90
    duration: "60s"
    expected_behavior:
      - graceful_degradation
      - oom_killer_protection
      - load_balancer_removes_unhealthy_instance
      
  # Load balancer failure
  - name: "load_balancer_failure"
    description: "Test direct instance access when LB fails"
    target: "load-balancer"
    failure_type: "container_stop"
    duration: "30s"
    expected_behavior:
      - direct_instance_access_possible
      - service_discovery_fallback
      - health_monitoring_continues

πŸ“¦ Related Make Targets

Target Purpose
make chaos-test Run complete chaos engineering test suite across all environments
make chaos-local Run Docker Compose based chaos tests locally
make chaos-k8s Run Kubernetes chaos tests with Chaos Mesh
make chaos-minikube Run chaos tests in local Minikube environment
make chaos-network Test network partition and connectivity failure scenarios
make chaos-services Test service failure and recovery scenarios
make chaos-resources Test resource exhaustion and pressure scenarios
make chaos-setup Set up chaos testing infrastructure and dependencies
make chaos-report Generate chaos test reports and resilience analysis
make chaos-clean Clean chaos test environments and artifacts

Bold targets are mandatory; CI must fail if critical resilience requirements are not met.


πŸ“‹ Acceptance Criteria

  • make chaos-test validates system resilience across all failure scenarios and environments.
  • make chaos-local successfully tests multi-container failures using Docker Compose.
  • make chaos-k8s executes Kubernetes-native chaos tests with Chaos Mesh integration.
  • make chaos-minikube runs complete chaos test suite in local Kubernetes environment.
  • Service failure scenarios validate automatic recovery and failover capabilities.
  • Network partition tests ensure split-brain scenario handling and service discovery.
  • Database failure tests validate connection pool recovery and graceful degradation.
  • Resource exhaustion tests confirm system stability under pressure.
  • Load balancing tests ensure traffic routing during instance failures.
  • All chaos tests include automated validation of expected resilience behaviors.
  • Changelog entry under "Testing" or "Reliability".

πŸ› οΈ Task List (suggested flow)

  1. Chaos testing infrastructure setup

    mkdir -p chaos/{experiments,k8s,reports,data}
    
    # Create chaos testing configuration
    cat > chaos/config.yaml << 'EOF'
    environments:
      local:
        type: "docker-compose"
        compose_file: "docker-compose.chaos.yml"
        services: ["mcpgateway-1", "mcpgateway-2", "postgres", "redis", "load-balancer"]
        
      minikube:
        type: "kubernetes"
        namespace: "mcpgateway-chaos"
        chaos_mesh_enabled: true
        
      k8s:
        type: "kubernetes"  
        namespace: "mcpgateway"
        chaos_mesh_enabled: true
    
    chaos_scenarios:
      service_failures:
        - gateway_instance_kill
        - database_connection_failure
        - redis_cache_failure
        
      network_issues:
        - network_partition
        - network_delay
        - packet_loss
        
      resource_pressure:
        - memory_exhaustion
        - cpu_pressure
        - disk_pressure
    
    validation_checks:
      - service_availability
      - data_consistency
      - recovery_time
      - error_handling
    EOF
  2. Docker Compose chaos testing setup

    # docker-compose.chaos.yml - Multi-service chaos testing environment
    version: '3.8'
    
    services:
      # Multiple gateway instances for failover testing
      mcpgateway-1:
        build: .
        container_name: mcpgateway-1
        environment:
          - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
          - REDIS_URL=redis://redis:6379/0
          - PORT=8000
          - INSTANCE_ID=gateway-1
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
          interval: 10s
          timeout: 5s
          retries: 3
        depends_on:
          postgres:
            condition: service_healthy
          redis:
            condition: service_healthy
        networks: [mcpgateway-net]
        
      mcpgateway-2:
        build: .
        container_name: mcpgateway-2
        environment:
          - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
          - REDIS_URL=redis://redis:6379/0
          - PORT=8000
          - INSTANCE_ID=gateway-2
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
          interval: 10s
          timeout: 5s
          retries: 3
        depends_on:
          postgres:
            condition: service_healthy
          redis:
            condition: service_healthy
        networks: [mcpgateway-net]
        
      # Database with health checks
      postgres:
        image: postgres:15
        container_name: postgres-chaos
        environment:
          POSTGRES_PASSWORD: password
          POSTGRES_DB: mcpgateway
          POSTGRES_USER: postgres
        healthcheck:
          test: ["CMD-SHELL", "pg_isready -U postgres"]
          interval: 10s
          timeout: 5s
          retries: 3
        volumes:
          - postgres_data:/var/lib/postgresql/data
        networks: [mcpgateway-net]
        
      # Redis cache
      redis:
        image: redis:7-alpine
        container_name: redis-chaos
        healthcheck:
          test: ["CMD", "redis-cli", "ping"]
          interval: 10s
          timeout: 5s
          retries: 3
        networks: [mcpgateway-net]
        
      # Load balancer for multi-instance testing
      load-balancer:
        image: nginx:alpine
        container_name: nginx-lb
        ports:
          - "4444:80"
        volumes:
          - "./chaos/nginx.conf:/etc/nginx/nginx.conf:ro"
        depends_on:
          - mcpgateway-1
          - mcpgateway-2
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost/health"]
          interval: 10s
          timeout: 5s
          retries: 3
        networks: [mcpgateway-net]
        
      # Chaos testing controller
      chaos-monkey:
        image: python:3.12-alpine
        container_name: chaos-controller
        volumes:
          - "./chaos:/chaos"
          - "/var/run/docker.sock:/var/run/docker.sock"
        working_dir: /chaos
        command: ["python", "chaos_runner.py"]
        depends_on:
          - mcpgateway-1
          - mcpgateway-2
          - postgres
          - redis
        networks: [mcpgateway-net]
        
    volumes:
      postgres_data:
      
    networks:
      mcpgateway-net:
        driver: bridge
    # chaos/nginx.conf - Load balancer configuration
    events {
        worker_connections 1024;
    }
    
    http {
        upstream mcpgateway {
            server mcpgateway-1:8000 max_fails=2 fail_timeout=30s;
            server mcpgateway-2:8000 max_fails=2 fail_timeout=30s;
        }
        
        server {
            listen 80;
            
            location /health {
                access_log off;
                proxy_pass http://mcpgateway/health;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
            }
            
            location / {
                proxy_pass http://mcpgateway;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_connect_timeout 5s;
                proxy_send_timeout 10s;
                proxy_read_timeout 10s;
            }
        }
    }
  3. Makefile integration

    # Chaos Engineering Testing Targets
    .PHONY: chaos-test chaos-local chaos-k8s chaos-minikube chaos-setup chaos-clean
    
    CHAOS_DIR := chaos
    CHAOS_REPORTS := $(CHAOS_DIR)/reports
    COMPOSE_CHAOS := docker-compose.chaos.yml
    MINIKUBE_PROFILE := mcpgateway-chaos
    
    chaos-test: chaos-local chaos-minikube
    	@echo "πŸ”₯ Running complete chaos engineering test suite..."
    	@python $(CHAOS_DIR)/generate_report.py \
    		--local-results $(CHAOS_REPORTS)/local \
    		--k8s-results $(CHAOS_REPORTS)/minikube \
    		--output $(CHAOS_REPORTS)/chaos-summary.html
    
    chaos-setup:
    	@echo "πŸ”§ Setting up chaos testing infrastructure..."
    	@pip install docker chaos-toolkit requests pytest
    	@docker-compose -f $(COMPOSE_CHAOS) pull
    	@minikube profile $(MINIKUBE_PROFILE) || minikube start -p $(MINIKUBE_PROFILE)
    	@kubectl --context $(MINIKUBE_PROFILE) apply -f https://mirrors.chaos-mesh.org/v2.6.0/install.sh
    
    chaos-local:
    	@echo "πŸ‹ Running Docker Compose chaos tests..."
    	@mkdir -p $(CHAOS_REPORTS)/local
    	@docker-compose -f $(COMPOSE_CHAOS) up -d
    	@sleep 30  # Wait for services to be ready
    	@python $(CHAOS_DIR)/chaos_runner.py \
    		--environment local \
    		--config $(CHAOS_DIR)/config.yaml \
    		--output $(CHAOS_REPORTS)/local
    	@docker-compose -f $(COMPOSE_CHAOS) down
    
    chaos-minikube:
    	@echo "☸️  Running Minikube chaos tests..."
    	@mkdir -p $(CHAOS_REPORTS)/minikube
    	@minikube profile $(MINIKUBE_PROFILE)
    	@helm upgrade --install mcpgateway charts/mcp-stack \
    		--namespace mcpgateway-chaos \
    		--create-namespace \
    		--set replicaCount=2 \
    		--wait
    	@kubectl apply -f $(CHAOS_DIR)/k8s/ -n mcpgateway-chaos
    	@python $(CHAOS_DIR)/k8s_chaos_runner.py \
    		--namespace mcpgateway-chaos \
    		--output $(CHAOS_REPORTS)/minikube
    
    chaos-k8s:
    	@echo "☸️  Running Kubernetes chaos tests..."
    	@mkdir -p $(CHAOS_REPORTS)/k8s
    	@kubectl apply -f $(CHAOS_DIR)/k8s/ -n mcpgateway
    	@python $(CHAOS_DIR)/k8s_chaos_runner.py \
    		--namespace mcpgateway \
    		--output $(CHAOS_REPORTS)/k8s
    
    chaos-network:
    	@echo "🌐 Testing network partition scenarios..."
    	@python $(CHAOS_DIR)/network_chaos.py \
    		--scenarios partition,delay,loss \
    		--duration 60 \
    		--validate-recovery
    
    chaos-services:
    	@echo "⚑ Testing service failure scenarios..."
    	@python $(CHAOS_DIR)/service_chaos.py \
    		--services mcpgateway,postgres,redis \
    		--failure-types kill,stop,restart \
    		--validate-failover
    
    chaos-resources:
    	@echo "πŸ’Ύ Testing resource exhaustion scenarios..."
    	@python $(CHAOS_DIR)/resource_chaos.py \
    		--resources cpu,memory,disk \
    		--pressure-levels 80,90,95 \
    		--duration 120
    
    chaos-clean:
    	@echo "🧹 Cleaning chaos test environments..."
    	@docker-compose -f $(COMPOSE_CHAOS) down -v
    	@minikube delete -p $(MINIKUBE_PROFILE) || true
    	@rm -rf $(CHAOS_REPORTS)/*
  4. Chaos test runner (Docker Compose)

    # chaos/chaos_runner.py
    #!/usr/bin/env python3
    """
    Docker Compose based chaos testing runner.
    """
    
    import time
    import docker
    import requests
    import json
    import argparse
    from typing import Dict, List, Any
    from pathlib import Path
    
    class DockerChaosRunner:
        """Run chaos experiments in Docker Compose environment."""
        
        def __init__(self, config_path: str, output_dir: str):
            self.config = self._load_config(config_path)
            self.output_dir = Path(output_dir)
            self.output_dir.mkdir(parents=True, exist_ok=True)
            self.docker_client = docker.from_env()
            self.results = []
            
        def run_all_experiments(self) -> List[Dict[str, Any]]:
            """Run all configured chaos experiments."""
            
            print("πŸ”₯ Starting chaos experiments...")
            
            # Validate initial system health
            if not self._validate_system_health():
                raise Exception("System not healthy before chaos testing")
                
            # Run service failure experiments
            for experiment in self.config['chaos_scenarios']['service_failures']:
                result = self._run_service_failure_experiment(experiment)
                self.results.append(result)
                
            # Run network chaos experiments  
            for experiment in self.config['chaos_scenarios']['network_issues']:
                result = self._run_network_experiment(experiment)
                self.results.append(result)
                
            # Run resource pressure experiments
            for experiment in self.config['chaos_scenarios']['resource_pressure']:
                result = self._run_resource_experiment(experiment)
                self.results.append(result)
                
            # Save results
            self._save_results()
            return self.results
            
        def _run_service_failure_experiment(self, experiment_name: str) -> Dict[str, Any]:
            """Run a service failure experiment."""
            
            print(f"  🎯 Running service failure: {experiment_name}")
            
            experiment_config = {
                'gateway_instance_kill': {
                    'target_container': 'mcpgateway-1',
                    'action': 'kill',
                    'duration': 30,
                    'expected_behavior': ['load_balancer_failover', 'no_request_failures']
                },
                'database_connection_failure': {
                    'target_container': 'postgres-chaos',
                    'action': 'pause',
                    'duration': 60,
                    'expected_behavior': ['connection_pool_recovery', 'graceful_errors']
                },
                'redis_cache_failure': {
                    'target_container': 'redis-chaos',
                    'action': 'stop',
                    'duration': 45,
                    'expected_behavior': ['cache_fallback', 'no_error_propagation']
                }
            }
            
            config = experiment_config[experiment_name]
            result = {
                'experiment': experiment_name,
                'start_time': time.time(),
                'success': False,
                'observations': [],
                'metrics': {}
            }
            
            try:
                # Record baseline metrics
                baseline_metrics = self._collect_metrics()
                
                # Inject failure
                container = self.docker_client.containers.get(config['target_container'])
                
                if config['action'] == 'kill':
                    container.kill()
                elif config['action'] == 'pause':
                    container.pause()
                elif config['action'] == 'stop':
                    container.stop()
                    
                # Monitor system during failure
                failure_metrics = []
                for i in range(config['duration']):
                    time.sleep(1)
                    metrics = self._collect_metrics()
                    failure_metrics.append(metrics)
                    
                    # Check if system is behaving as expected
                    self._validate_expected_behavior(config['expected_behavior'], metrics)
                    
                # Restore service
                if config['action'] == 'kill':
                    # Container restart handled by compose restart policy
                    pass
                elif config['action'] == 'pause':
                    container.unpause()
                elif config['action'] == 'stop':
                    container.start()
                    
                # Wait for recovery and validate
                time.sleep(30)
                recovery_metrics = self._collect_metrics()
                
                # Validate recovery
                if self._validate_system_health():
                    result['success'] = True
                    result['observations'].append("System recovered successfully")
                else:
                    result['observations'].append("System failed to recover properly")
                    
                result['metrics'] = {
                    'baseline': baseline_metrics,
                    'during_failure': failure_metrics,
                    'after_recovery': recovery_metrics
                }
                
            except Exception as e:
                result['observations'].append(f"Experiment failed: {str(e)}")
                
            result['end_time'] = time.time()
            result['duration'] = result['end_time'] - result['start_time']
            
            return result
            
        def _collect_metrics(self) -> Dict[str, Any]:
            """Collect system metrics during chaos experiments."""
            
            metrics = {
                'timestamp': time.time(),
                'services': {},
                'load_balancer': {},
                'response_times': []
            }
            
            # Check service health
            services = ['mcpgateway-1', 'mcpgateway-2', 'postgres-chaos', 'redis-chaos']
            for service in services:
                try:
                    container = self.docker_client.containers.get(service)
                    metrics['services'][service] = {
                        'status': container.status,
                        'health': self._get_container_health(container)
                    }
                except Exception as e:
                    metrics['services'][service] = {'status': 'not_found', 'error': str(e)}
                    
            # Test load balancer response
            try:
                start_time = time.time()
                response = requests.get('http://localhost:4444/health', timeout=5)
                response_time = time.time() - start_time
                
                metrics['load_balancer'] = {
                    'status_code': response.status_code,
                    'response_time': response_time,
                    'accessible': response.status_code == 200
                }
            except Exception as e:
                metrics['load_balancer'] = {
                    'accessible': False,
                    'error': str(e)
                }
                
            # Test API endpoints
            test_endpoints = ['/health', '/tools/', '/servers/']
            for endpoint in test_endpoints:
                try:
                    start_time = time.time()
                    response = requests.get(f'http://localhost:4444{endpoint}', timeout=5)
                    response_time = time.time() - start_time
                    
                    metrics['response_times'].append({
                        'endpoint': endpoint,
                        'response_time': response_time,
                        'status_code': response.status_code
                    })
                except Exception as e:
                    metrics['response_times'].append({
                        'endpoint': endpoint,
                        'error': str(e),
                        'status_code': 0
                    })
                    
            return metrics
  5. Kubernetes chaos testing

    # chaos/k8s/network-partition.yaml
    apiVersion: chaos-mesh.org/v1alpha1
    kind: NetworkChaos
    metadata:
      name: mcpgateway-network-partition
      namespace: mcpgateway-chaos
    spec:
      action: partition
      mode: all
      selector:
        namespaces: ["mcpgateway-chaos"]
        labelSelectors:
          app.kubernetes.io/name: mcpgateway
      direction: both
      duration: "30s"
      
    ---
    apiVersion: chaos-mesh.org/v1alpha1
    kind: NetworkChaos
    metadata:
      name: mcpgateway-network-delay
      namespace: mcpgateway-chaos
    spec:
      action: delay
      mode: one
      selector:
        namespaces: ["mcpgateway-chaos"]
        labelSelectors:
          app.kubernetes.io/name: mcpgateway
      delay:
        latency: "1000ms"
        correlation: "100"
        jitter: "500ms"
      duration: "60s"
    # chaos/k8s/pod-chaos.yaml
    apiVersion: chaos-mesh.org/v1alpha1
    kind: PodChaos
    metadata:
      name: mcpgateway-pod-kill
      namespace: mcpgateway-chaos
    spec:
      action: pod-kill
      mode: one
      selector:
        namespaces: ["mcpgateway-chaos"]
        labelSelectors:
          app.kubernetes.io/name: mcpgateway
      gracePeriod: 0
      
    ---
    apiVersion: chaos-mesh.org/v1alpha1
    kind: PodChaos
    metadata:
      name: postgres-pod-failure
      namespace: mcpgateway-chaos
    spec:
      action: pod-failure
      mode: one
      selector:
        namespaces: ["mcpgateway-chaos"]
        labelSelectors:
          app.kubernetes.io/name: postgresql
      duration: "60s"
    # chaos/k8s/stress-chaos.yaml
    apiVersion: chaos-mesh.org/v1alpha1
    kind: StressChaos
    metadata:
      name: mcpgateway-memory-stress
      namespace: mcpgateway-chaos
    spec:
      mode: one
      selector:
        namespaces: ["mcpgateway-chaos"]
        labelSelectors:
          app.kubernetes.io/name: mcpgateway
      duration: "120s"
      stressors:
        memory:
          workers: 1
          size: "80%"
          
    ---
    apiVersion: chaos-mesh.org/v1alpha1
    kind: StressChaos
    metadata:
      name: mcpgateway-cpu-stress
      namespace: mcpgateway-chaos
    spec:
      mode: one
      selector:
        namespaces: ["mcpgateway-chaos"]
        labelSelectors:
          app.kubernetes.io/name: mcpgateway
      duration: "90s"
      stressors:
        cpu:
          workers: 2
          load: 90
  6. Kubernetes chaos runner

    # chaos/k8s_chaos_runner.py
    #!/usr/bin/env python3
    """
    Kubernetes chaos testing with Chaos Mesh integration.
    """
    
    import time
    import yaml
    import subprocess
    import requests
    import argparse
    from typing import Dict, List, Any
    from pathlib import Path
    from kubernetes import client, config
    
    class KubernetesChaosRunner:
        """Run chaos experiments in Kubernetes using Chaos Mesh."""
        
        def __init__(self, namespace: str, output_dir: str):
            self.namespace = namespace
            self.output_dir = Path(output_dir)
            self.output_dir.mkdir(parents=True, exist_ok=True)
            
            # Load Kubernetes config
            try:
                config.load_incluster_config()
            except:
                config.load_kube_config()
                
            self.k8s_client = client.ApiClient()
            self.v1 = client.CoreV1Api()
            self.apps_v1 = client.AppsV1Api()
            self.results = []
            
        def run_all_chaos_experiments(self) -> List[Dict[str, Any]]:
            """Run all Kubernetes chaos experiments."""
            
            print(f"☸️  Starting Kubernetes chaos experiments in namespace: {self.namespace}")
            
            # Validate initial cluster state
            if not self._validate_cluster_health():
                raise Exception("Cluster not healthy before chaos testing")
                
            # Get chaos experiment files
            chaos_files = list(Path("chaos/k8s").glob("*.yaml"))
            
            for chaos_file in chaos_files:
                result = self._run_chaos_experiment(chaos_file)
                self.results.append(result)
                
                # Wait between experiments
                time.sleep(30)
                
            # Save results
            self._save_results()
            return self.results
            
        def _run_chaos_experiment(self, chaos_file: Path) -> Dict[str, Any]:
            """Run a single chaos experiment."""
            
            print(f"  🎯 Running chaos experiment: {chaos_file.name}")
            
            result = {
                'experiment_file': str(chaos_file),
                'start_time': time.time(),
                'success': False,
                'observations': [],
                'metrics': {}
            }
            
            try:
                # Load chaos experiment
                with open(chaos_file) as f:
                    chaos_docs = list(yaml.safe_load_all(f))
                    
                # Record baseline metrics
                baseline_metrics = self._collect_k8s_metrics()
                
                # Apply chaos experiment
                for doc in chaos_docs:
                    if doc:
                        self._apply_chaos_manifest(doc)
                        
                # Monitor during chaos
                chaos_duration = self._extract_duration(chaos_docs[0])
                monitoring_metrics = []
                
                for i in range(chaos_duration + 30):  # Duration + recovery time
                    time.sleep(1)
                    metrics = self._collect_k8s_metrics()
                    monitoring_metrics.append(metrics)
                    
                    # Log significant events
                    if i % 10 == 0:
                        print(f"    Monitoring... {i}s elapsed")
                        
                # Clean up chaos experiment
                for doc in chaos_docs:
                    if doc:
                        self._delete_chaos_manifest(doc)
                        
                # Wait for recovery
                time.sleep(60)
                recovery_metrics = self._collect_k8s_metrics()
                
                # Validate recovery
                if self._validate_cluster_health():
                    result['success'] = True
                    result['observations'].append("Cluster recovered successfully")
                else:
                    result['observations'].append("Cluster failed to recover properly")
                    
                result['metrics'] = {
                    'baseline': baseline_metrics,
                    'during_chaos': monitoring_metrics,
                    'after_recovery': recovery_metrics
                }
                
            except Exception as e:
                result['observations'].append(f"Chaos experiment failed: {str(e)}")
                
            result['end_time'] = time.time()
            result['duration'] = result['end_time'] - result['start_time']
            
            return result
            
        def _collect_k8s_metrics(self) -> Dict[str, Any]:
            """Collect Kubernetes cluster metrics."""
            
            metrics = {
                'timestamp': time.time(),
                'pods': {},
                'services': {},
                'endpoints': {},
                'nodes': {}
            }
            
            try:
                # Pod status
                pods = self.v1.list_namespaced_pod(namespace=self.namespace)
                for pod in pods.items:
                    metrics['pods'][pod.metadata.name] = {
                        'phase': pod.status.phase,
                        'ready': self._is_pod_ready(pod),
                        'restart_count': sum(
                            container.restart_count or 0 
                            for container in pod.status.container_statuses or []
                        )
                    }
                    
                # Service endpoints
                services = self.v1.list_namespaced_service(namespace=self.namespace)
                for service in services.items:
                    endpoints = self.v1.read_namespaced_endpoints(
                        name=service.metadata.name,
                        namespace=self.namespace
                    )
                    metrics['services'][service.metadata.name] = {
                        'type': service.spec.type,
                        'endpoints_ready': sum(
                            len(subset.addresses or []) 
                            for subset in endpoints.subsets or []
                        )
                    }
                    
                # Test service connectivity
                service_url = self._get_service_url()
                if service_url:
                    try:
                        response = requests.get(f"{service_url}/health", timeout=5)
                        metrics['service_connectivity'] = {
                            'accessible': True,
                            'status_code': response.status_code,
                            'response_time': response.elapsed.total_seconds()
                        }
                    except Exception as e:
                        metrics['service_connectivity'] = {
                            'accessible': False,
                            'error': str(e)
                        }
                        
            except Exception as e:
                metrics['collection_error'] = str(e)
                
            return metrics
  7. Chaos test validation

    # chaos/validation.py
    """
    Validation logic for chaos engineering tests.
    """
    
    import time
    import requests
    from typing import Dict, List, Any, Optional
    
    class ChaosTestValidator:
        """Validate system behavior during and after chaos experiments."""
        
        def __init__(self, base_url: str = "http://localhost:4444"):
            self.base_url = base_url
            self.validation_results = []
            
        def validate_service_resilience(self, experiment_type: str, 
                                      during_chaos_metrics: List[Dict],
                                      recovery_metrics: Dict) -> Dict[str, Any]:
            """Validate service resilience during chaos experiments."""
            
            validation = {
                'experiment_type': experiment_type,
                'passed': True,
                'failures': [],
                'warnings': [],
                'recovery_time': None
            }
            
            if experiment_type == 'gateway_instance_kill':
                validation.update(self._validate_instance_failover(
                    during_chaos_metrics, recovery_metrics
                ))
            elif experiment_type == 'database_connection_failure':
                validation.update(self._validate_database_resilience(
                    during_chaos_metrics, recovery_metrics
                ))
            elif experiment_type == 'redis_cache_failure':
                validation.update(self._validate_cache_resilience(
                    during_chaos_metrics, recovery_metrics
                ))
            elif experiment_type == 'network_partition':
                validation.update(self._validate_network_resilience(
                    during_chaos_metrics, recovery_metrics
                ))
                
            return validation
            
        def _validate_instance_failover(self, chaos_metrics: List[Dict], 
                                      recovery_metrics: Dict) -> Dict[str, Any]:
            """Validate gateway instance failover behavior."""
            
            validation = {'passed': True, 'failures': [], 'warnings': []}
            
            # Check load balancer kept routing requests
            successful_requests = 0
            total_requests = 0
            
            for metrics in chaos_metrics:
                lb_metrics = metrics.get('load_balancer', {})
                if lb_metrics.get('accessible', False):
                    successful_requests += 1
                total_requests += 1
                
            success_rate = successful_requests / total_requests if total_requests > 0 else 0
            
            if success_rate < 0.95:  # 95% availability threshold
                validation['failures'].append(
                    f"Service availability dropped to {success_rate:.2%} during instance failure"
                )
                validation['passed'] = False
            else:
                validation['warnings'].append(
                    f"Service maintained {success_rate:.2%} availability during failover"
                )
                
            # Check recovery time
            recovery_time = self._calculate_recovery_time(chaos_metrics)
            if recovery_time > 60:  # 60 second max recovery time
                validation['failures'].append(
                    f"Recovery took {recovery_time}s, exceeds 60s threshold"
                )
                validation['passed'] = False
                
            validation['recovery_time'] = recovery_time
            return validation
            
        def _validate_database_resilience(self, chaos_metrics: List[Dict],
                                        recovery_metrics: Dict) -> Dict[str, Any]:
            """Validate database connection resilience."""
            
            validation = {'passed': True, 'failures': [], 'warnings': []}
            
            # Check for graceful error handling
            error_responses = []
            for metrics in chaos_metrics:
                for response in metrics.get('response_times', []):
                    if response.get('status_code', 0) >= 500:
                        error_responses.append(response)
                        
            # Database failures should result in 503 (service unavailable) not 500
            internal_errors = [r for r in error_responses if r.get('status_code') == 500]
            if internal_errors:
                validation['failures'].append(
                    f"Found {len(internal_errors)} internal server errors during DB failure"
                )
                validation['passed'] = False
                
            # Check connection pool recovery
            if not recovery_metrics.get('services', {}).get('postgres-chaos', {}).get('health'):
                validation['failures'].append("Database service not healthy after recovery")
                validation['passed'] = False
                
            return validation
            
        def _validate_cache_resilience(self, chaos_metrics: List[Dict],
                                     recovery_metrics: Dict) -> Dict[str, Any]:
            """Validate Redis cache failure resilience."""
            
            validation = {'passed': True, 'failures': [], 'warnings': []}
            
            # Cache failures should not impact service availability
            for metrics in chaos_metrics:
                if not metrics.get('load_balancer', {}).get('accessible', False):
                    validation['failures'].append(
                        "Service became unavailable during cache failure"
                    )
                    validation['passed'] = False
                    break
                    
            # Performance may degrade but should remain functional
            response_times = []
            for metrics in chaos_metrics:
                for response in metrics.get('response_times', []):
                    if response.get('status_code') == 200:
                        response_times.append(response.get('response_time', 0))
                        
            if response_times:
                avg_response_time = sum(response_times) / len(response_times)
                if avg_response_time > 5.0:  # 5 second threshold
                    validation['warnings'].append(
                        f"Response time degraded to {avg_response_time:.2f}s during cache failure"
                    )
                    
            return validation
  8. CI integration

    # Add to existing GitHub Actions workflow
    chaos-engineering:
      name: πŸ”₯ Chaos Engineering Tests
      runs-on: ubuntu-latest
      needs: [test, migration-testing]
      if: github.ref == 'refs/heads/main'  # Only run on main branch
      
      strategy:
        matrix:
          environment: ["docker-compose", "minikube"]
          
      steps:
        - name: ⬇️  Checkout source
          uses: actions/checkout@v4
          with:
            fetch-depth: 1
            
        - name: 🐍  Set up Python
          uses: actions/setup-python@v5
          with:
            python-version: "3.12"
            cache: pip
            
        - name: πŸ”§  Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -e .[dev]
            pip install docker chaos-toolkit kubernetes
            
        - name: πŸš€  Set up Docker Compose environment
          if: matrix.environment == 'docker-compose'
          run: |
            docker-compose -f docker-compose.chaos.yml build
            make chaos-setup
            
        - name: ☸️  Set up Minikube environment  
          if: matrix.environment == 'minikube'
          run: |
            # Install minikube
            curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
            sudo install minikube-linux-amd64 /usr/local/bin/minikube
            
            # Install helm
            curl https://get.helm.sh/helm-v3.12.0-linux-amd64.tar.gz | tar xz
            sudo mv linux-amd64/helm /usr/local/bin/
            
            # Start minikube
            minikube start --driver=docker --cpus=2 --memory=4096
            
            # Install Chaos Mesh
            curl -sSL https://mirrors.chaos-mesh.org/v2.6.0/install.sh | bash
            
        - name: πŸ”₯  Run chaos tests - ${{ matrix.environment }}
          run: |
            case "${{ matrix.environment }}" in
              "docker-compose")
                make chaos-local
                ;;
              "minikube")
                make chaos-minikube
                ;;
            esac
            
        - name: πŸ“Š  Generate chaos test report
          run: |
            make chaos-report
            
        - name: πŸ“Ž  Upload chaos test results
          uses: actions/upload-artifact@v4
          with:
            name: chaos-test-results-${{ matrix.environment }}
            path: |
              chaos/reports/
            retention-days: 30
            
        - name: 🚨  Validate resilience requirements
          run: |
            python chaos/validate_resilience.py \
              --results chaos/reports/${{ matrix.environment }} \
              --fail-on-critical
            
        - name: 🧹  Cleanup
          if: always()
          run: |
            make chaos-clean
            if [ "${{ matrix.environment }}" == "minikube" ]; then
              minikube delete
            fi
  9. Monitoring and reporting

    # chaos/generate_report.py
    #!/usr/bin/env python3
    """
    Generate comprehensive chaos engineering test reports.
    """
    
    import json
    import plotly.graph_objects as go
    import plotly.express as px
    from plotly.subplots import make_subplots
    import pandas as pd
    from pathlib import Path
    
    class ChaosTestReporter:
        """Generate visual reports for chaos engineering tests."""
        
        def generate_html_report(self, results_dir: Path, output_file: Path):
            """Generate comprehensive HTML chaos test report."""
            
            # Load all test results
            all_results = self._load_all_results(results_dir)
            
            # Create visualizations
            fig = make_subplots(
                rows=3, cols=2,
                subplot_titles=[
                    'Service Availability During Chaos',
                    'Recovery Time by Experiment',
                    'Response Time Impact',
                    'Error Rate Analysis',
                    'Resource Usage During Chaos',
                    'Resilience Score by Component'
                ],
                specs=[
                    [{"secondary_y": True}, {"type": "bar"}],
                    [{"type": "scatter"}, {"type": "bar"}],
                    [{"type": "heatmap"}, {"type": "indicator"}]
                ]
            )
            
            # Service availability timeline
            self._add_availability_chart(fig, all_results, row=1, col=1)
            
            # Recovery time comparison
            self._add_recovery_time_chart(fig, all_results, row=1, col=2)
            
            # Response time impact
            self._add_response_time_chart(fig, all_results, row=2, col=1)
            
            # Generate HTML report
            html_template = self._get_html_template()
            html_content = html_template.format(
                charts=fig.to_html(include_plotlyjs='cdn'),
                summary_table=self._generate_summary_table(all_results),
                recommendations=self._generate_recommendations(all_results),
                timestamp=pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
            )
            
            output_file.write_text(html_content)
            print(f"πŸ“Š Chaos test report generated: {output_file}")
  10. Documentation

    Add chaos engineering documentation:

    # Chaos Engineering Testing
    
    ## Overview
    
    Chaos engineering tests validate mcpgateway's resilience by deliberately introducing failures and verifying the system's ability to handle them gracefully.
    
    ## Test Environments
    
    ### Docker Compose (Local)
    ```bash
    # Run local multi-container chaos tests
    make chaos-local

    Minikube (Local Kubernetes)

    # Run Kubernetes chaos tests locally
    make chaos-minikube  

    Production Kubernetes

    # Run chaos tests in production cluster
    make chaos-k8s

    Failure Scenarios

    Category Scenario Expected Behavior
    Service Failures Gateway instance kill Load balancer failover, <5s recovery
    Database PostgreSQL connection loss Graceful errors, automatic reconnection
    Cache Redis failure Fallback to database, no error propagation
    Network Network partition Circuit breaker activation, timeout handling
    Resources Memory/CPU pressure Graceful degradation, auto-scaling

    Resilience Requirements

    • Service Availability: >95% during single instance failures
    • Recovery Time: <60 seconds for service restoration
    • Error Handling: Graceful degradation, no 500 errors
    • Data Consistency: No data loss during failures
    • Performance: <2x response time degradation under pressure

πŸ“– References


🧩 Additional Notes

  • Start simple: Begin with Docker Compose chaos tests before moving to Kubernetes scenarios.
  • Gradual failure injection: Test individual failure types before combining multiple failures.
  • Real-world scenarios: Focus on failures that actually occur in production environments.
  • Automated validation: Every chaos experiment should include automated validation of expected behaviors.
  • Documentation: Document all failure scenarios and expected system responses for team knowledge.
  • Safety first: Always run chaos tests in isolated environments before production testing.
  • Monitoring integration: Chaos tests should integrate with your monitoring and alerting systems.

Chaos Engineering Best Practices:

  • Define steady-state behavior before introducing chaos
  • Minimize blast radius to avoid widespread impact
  • Automate experiments to run consistently and frequently
  • Build confidence gradually by starting with small experiments
  • Learn from failures and improve system resilience based on findings
  • Include business metrics in chaos experiment validation
  • Document and share learnings across the team

Metadata

Metadata

Assignees

Labels

choreLinting, formatting, dependency hygiene, or project maintenance chorescicdIssue with CI/CD process (GitHub Actions, scaffolding)devopsDevOps activities (containers, automation, deployment, makefiles, etc)help wantedExtra attention is neededtestingTesting (unit, e2e, manual, automated, etc)triageIssues / Features awaiting triage

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions