[CHORE]: Implement chaos engineering tests for fault tolerance validation (network partitions, service failures)

### 🧭 Chore Summary

Implement **comprehensive chaos engineering tests** for fault tolerance validation: `make chaos-test` to systematically inject failures (network partitions, service crashes, resource exhaustion) and validate **mcpgateway** resilience across **Docker Compose**, **Kubernetes/Helm**, and **Minikube** environments.

---

### 🧱 Areas Affected

* [x] Chaos testing infrastructure / Make targets (`make chaos-test`, `make chaos-local`, `make chaos-k8s`)
* [x] Docker Compose multi-service setup for local chaos testing
* [x] Kubernetes/Helm chaos testing with Chaos Mesh integration
* [x] Minikube local Kubernetes testing environment
* [x] Service resilience validation and failure recovery testing
* [x] Network partition simulation and service discovery testing
* [x] Database and Redis failure scenario testing
* [x] Load balancing and failover validation

---

### ⚙️ Context / Rationale

**Chaos engineering proactively discovers weaknesses in distributed systems by deliberately introducing controlled failures.** Instead of waiting for production outages, chaos tests simulate real-world failure scenarios to validate that mcpgateway can handle service crashes, network partitions, database failures, and resource exhaustion while maintaining availability and data consistency.

**What is Chaos Engineering?**
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. It involves systematically injecting failures to identify weaknesses before they cause outages.

**Key Chaos Testing Scenarios:**
- **Service Failures**: Container crashes, process kills, graceful/ungraceful shutdowns
- **Network Partitions**: Split-brain scenarios, intermittent connectivity, packet loss
- **Resource Exhaustion**: CPU/memory/disk pressure, connection pool exhaustion
- **Database Failures**: Primary/replica failures, connection timeouts, query failures
- **Dependency Failures**: Redis crashes, external API timeouts, DNS failures

**Simple Docker Compose Chaos Test:**
```yaml
# docker-compose.chaos.yml - Multi-service setup for chaos testing
version: '3.8'
services:
  mcpgateway-1:
    image: mcpgateway:latest
    environment:
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
      - REDIS_URL=redis://redis:6379/0
    depends_on: [postgres, redis]
    
  mcpgateway-2:
    image: mcpgateway:latest  
    environment:
      - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
      - REDIS_URL=redis://redis:6379/0
    depends_on: [postgres, redis]
    
  postgres:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: password
      
  redis:
    image: redis:7
    
  load-balancer:
    image: nginx:alpine
    ports: ["4444:80"]
    volumes: ["./chaos/nginx.conf:/etc/nginx/nginx.conf"]
    depends_on: [mcpgateway-1, mcpgateway-2]
    
  chaos-controller:
    image: chaos-toolkit:latest
    volumes: ["./chaos:/chaos"]
    command: ["chaos", "run", "/chaos/experiments.json"]
```

**Advanced Kubernetes Chaos Testing:**
```yaml
# chaos/k8s-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: mcpgateway-network-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces: ["mcpgateway"]
    labelSelectors:
      app: mcpgateway
  direction: both
  duration: "30s"
  
---
apiVersion: chaos-mesh.org/v1alpha1  
kind: PodChaos
metadata:
  name: mcpgateway-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: ["mcpgateway"]
    labelSelectors:
      app: mcpgateway
  duration: "10s"
```

**MCPGateway Specific Chaos Scenarios:**

```yaml
# Chaos test scenarios for mcpgateway distributed setup
chaos_scenarios:
  # Service resilience testing
  - name: "gateway_instance_failure"
    description: "Kill one mcpgateway instance while maintaining service"
    target: "mcpgateway-1"
    failure_type: "container_kill"
    duration: "30s"
    expected_behavior:
      - load_balancer_routes_to_healthy_instance
      - no_request_failures
      - automatic_service_recovery
      
  # Database resilience testing  
  - name: "database_connection_failure"
    description: "Simulate PostgreSQL connection failures"
    target: "postgres"
    failure_type: "network_partition"
    duration: "60s"
    expected_behavior:
      - connection_pool_recovery
      - graceful_error_handling
      - automatic_reconnection
      
  # Redis caching failure
  - name: "redis_cache_failure"
    description: "Test behavior when Redis becomes unavailable"
    target: "redis"
    failure_type: "container_stop"
    duration: "45s"
    expected_behavior:
      - fallback_to_direct_database_access
      - no_cache_errors_propagated
      - cache_recovery_on_restart
      
  # Network partition between services
  - name: "service_network_partition"
    description: "Partition gateway instances from database"
    target: "mcpgateway-*"
    failure_type: "network_delay"
    parameters:
      latency: "1000ms"
      jitter: "500ms"
    duration: "120s"
    expected_behavior:
      - request_timeout_handling
      - circuit_breaker_activation
      - health_check_failures
      
  # Resource exhaustion
  - name: "memory_pressure_test"
    description: "Exhaust memory resources on gateway instances"
    target: "mcpgateway-1"
    failure_type: "memory_stress"
    parameters:
      memory_percentage: 90
    duration: "60s"
    expected_behavior:
      - graceful_degradation
      - oom_killer_protection
      - load_balancer_removes_unhealthy_instance
      
  # Load balancer failure
  - name: "load_balancer_failure"
    description: "Test direct instance access when LB fails"
    target: "load-balancer"
    failure_type: "container_stop"
    duration: "30s"
    expected_behavior:
      - direct_instance_access_possible
      - service_discovery_fallback
      - health_monitoring_continues
```

---

### 📦 Related Make Targets

| Target                        | Purpose                                                                    |
| ----------------------------- | -------------------------------------------------------------------------- |
| **`make chaos-test`**         | Run complete chaos engineering test suite across all environments         |
| **`make chaos-local`**        | Run Docker Compose based chaos tests locally                              |
| **`make chaos-k8s`**          | Run Kubernetes chaos tests with Chaos Mesh                               |
| **`make chaos-minikube`**     | Run chaos tests in local Minikube environment                             |
| **`make chaos-network`**      | Test network partition and connectivity failure scenarios                 |
| **`make chaos-services`**     | Test service failure and recovery scenarios                               |
| **`make chaos-resources`**    | Test resource exhaustion and pressure scenarios                           |
| `make chaos-setup`            | Set up chaos testing infrastructure and dependencies                      |
| `make chaos-report`           | Generate chaos test reports and resilience analysis                       |
| `make chaos-clean`            | Clean chaos test environments and artifacts                               |

**Bold** targets are **mandatory**; CI must fail if critical resilience requirements are not met.

---

### 📋 Acceptance Criteria

* [ ] **`make chaos-test`** validates system resilience across all failure scenarios and environments.
* [ ] **`make chaos-local`** successfully tests multi-container failures using Docker Compose.
* [ ] **`make chaos-k8s`** executes Kubernetes-native chaos tests with Chaos Mesh integration.
* [ ] **`make chaos-minikube`** runs complete chaos test suite in local Kubernetes environment.
* [ ] Service failure scenarios validate automatic recovery and failover capabilities.
* [ ] Network partition tests ensure split-brain scenario handling and service discovery.
* [ ] Database failure tests validate connection pool recovery and graceful degradation.
* [ ] Resource exhaustion tests confirm system stability under pressure.
* [ ] Load balancing tests ensure traffic routing during instance failures.
* [ ] All chaos tests include automated validation of expected resilience behaviors.
* [ ] Changelog entry under **"Testing"** or **"Reliability"**.

---

### 🛠️ Task List (suggested flow)

1. **Chaos testing infrastructure setup**

   ```bash
   mkdir -p chaos/{experiments,k8s,reports,data}
   
   # Create chaos testing configuration
   cat > chaos/config.yaml << 'EOF'
   environments:
     local:
       type: "docker-compose"
       compose_file: "docker-compose.chaos.yml"
       services: ["mcpgateway-1", "mcpgateway-2", "postgres", "redis", "load-balancer"]
       
     minikube:
       type: "kubernetes"
       namespace: "mcpgateway-chaos"
       chaos_mesh_enabled: true
       
     k8s:
       type: "kubernetes"  
       namespace: "mcpgateway"
       chaos_mesh_enabled: true
   
   chaos_scenarios:
     service_failures:
       - gateway_instance_kill
       - database_connection_failure
       - redis_cache_failure
       
     network_issues:
       - network_partition
       - network_delay
       - packet_loss
       
     resource_pressure:
       - memory_exhaustion
       - cpu_pressure
       - disk_pressure
   
   validation_checks:
     - service_availability
     - data_consistency
     - recovery_time
     - error_handling
   EOF
   ```

2. **Docker Compose chaos testing setup**

   ```yaml
   # docker-compose.chaos.yml - Multi-service chaos testing environment
   version: '3.8'
   
   services:
     # Multiple gateway instances for failover testing
     mcpgateway-1:
       build: .
       container_name: mcpgateway-1
       environment:
         - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
         - REDIS_URL=redis://redis:6379/0
         - PORT=8000
         - INSTANCE_ID=gateway-1
       healthcheck:
         test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
         interval: 10s
         timeout: 5s
         retries: 3
       depends_on:
         postgres:
           condition: service_healthy
         redis:
           condition: service_healthy
       networks: [mcpgateway-net]
       
     mcpgateway-2:
       build: .
       container_name: mcpgateway-2
       environment:
         - DATABASE_URL=postgresql://postgres:password@postgres:5432/mcpgateway
         - REDIS_URL=redis://redis:6379/0
         - PORT=8000
         - INSTANCE_ID=gateway-2
       healthcheck:
         test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
         interval: 10s
         timeout: 5s
         retries: 3
       depends_on:
         postgres:
           condition: service_healthy
         redis:
           condition: service_healthy
       networks: [mcpgateway-net]
       
     # Database with health checks
     postgres:
       image: postgres:15
       container_name: postgres-chaos
       environment:
         POSTGRES_PASSWORD: password
         POSTGRES_DB: mcpgateway
         POSTGRES_USER: postgres
       healthcheck:
         test: ["CMD-SHELL", "pg_isready -U postgres"]
         interval: 10s
         timeout: 5s
         retries: 3
       volumes:
         - postgres_data:/var/lib/postgresql/data
       networks: [mcpgateway-net]
       
     # Redis cache
     redis:
       image: redis:7-alpine
       container_name: redis-chaos
       healthcheck:
         test: ["CMD", "redis-cli", "ping"]
         interval: 10s
         timeout: 5s
         retries: 3
       networks: [mcpgateway-net]
       
     # Load balancer for multi-instance testing
     load-balancer:
       image: nginx:alpine
       container_name: nginx-lb
       ports:
         - "4444:80"
       volumes:
         - "./chaos/nginx.conf:/etc/nginx/nginx.conf:ro"
       depends_on:
         - mcpgateway-1
         - mcpgateway-2
       healthcheck:
         test: ["CMD", "curl", "-f", "http://localhost/health"]
         interval: 10s
         timeout: 5s
         retries: 3
       networks: [mcpgateway-net]
       
     # Chaos testing controller
     chaos-monkey:
       image: python:3.12-alpine
       container_name: chaos-controller
       volumes:
         - "./chaos:/chaos"
         - "/var/run/docker.sock:/var/run/docker.sock"
       working_dir: /chaos
       command: ["python", "chaos_runner.py"]
       depends_on:
         - mcpgateway-1
         - mcpgateway-2
         - postgres
         - redis
       networks: [mcpgateway-net]
       
   volumes:
     postgres_data:
     
   networks:
     mcpgateway-net:
       driver: bridge
   ```

   ```nginx
   # chaos/nginx.conf - Load balancer configuration
   events {
       worker_connections 1024;
   }
   
   http {
       upstream mcpgateway {
           server mcpgateway-1:8000 max_fails=2 fail_timeout=30s;
           server mcpgateway-2:8000 max_fails=2 fail_timeout=30s;
       }
       
       server {
           listen 80;
           
           location /health {
               access_log off;
               proxy_pass http://mcpgateway/health;
               proxy_set_header Host $host;
               proxy_set_header X-Real-IP $remote_addr;
           }
           
           location / {
               proxy_pass http://mcpgateway;
               proxy_set_header Host $host;
               proxy_set_header X-Real-IP $remote_addr;
               proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
               proxy_connect_timeout 5s;
               proxy_send_timeout 10s;
               proxy_read_timeout 10s;
           }
       }
   }
   ```

3. **Makefile integration**

   ```makefile
   # Chaos Engineering Testing Targets
   .PHONY: chaos-test chaos-local chaos-k8s chaos-minikube chaos-setup chaos-clean
   
   CHAOS_DIR := chaos
   CHAOS_REPORTS := $(CHAOS_DIR)/reports
   COMPOSE_CHAOS := docker-compose.chaos.yml
   MINIKUBE_PROFILE := mcpgateway-chaos
   
   chaos-test: chaos-local chaos-minikube
   	@echo "🔥 Running complete chaos engineering test suite..."
   	@python $(CHAOS_DIR)/generate_report.py \
   		--local-results $(CHAOS_REPORTS)/local \
   		--k8s-results $(CHAOS_REPORTS)/minikube \
   		--output $(CHAOS_REPORTS)/chaos-summary.html
   
   chaos-setup:
   	@echo "🔧 Setting up chaos testing infrastructure..."
   	@pip install docker chaos-toolkit requests pytest
   	@docker-compose -f $(COMPOSE_CHAOS) pull
   	@minikube profile $(MINIKUBE_PROFILE) || minikube start -p $(MINIKUBE_PROFILE)
   	@kubectl --context $(MINIKUBE_PROFILE) apply -f https://mirrors.chaos-mesh.org/v2.6.0/install.sh
   
   chaos-local:
   	@echo "🐋 Running Docker Compose chaos tests..."
   	@mkdir -p $(CHAOS_REPORTS)/local
   	@docker-compose -f $(COMPOSE_CHAOS) up -d
   	@sleep 30  # Wait for services to be ready
   	@python $(CHAOS_DIR)/chaos_runner.py \
   		--environment local \
   		--config $(CHAOS_DIR)/config.yaml \
   		--output $(CHAOS_REPORTS)/local
   	@docker-compose -f $(COMPOSE_CHAOS) down
   
   chaos-minikube:
   	@echo "☸️  Running Minikube chaos tests..."
   	@mkdir -p $(CHAOS_REPORTS)/minikube
   	@minikube profile $(MINIKUBE_PROFILE)
   	@helm upgrade --install mcpgateway charts/mcp-stack \
   		--namespace mcpgateway-chaos \
   		--create-namespace \
   		--set replicaCount=2 \
   		--wait
   	@kubectl apply -f $(CHAOS_DIR)/k8s/ -n mcpgateway-chaos
   	@python $(CHAOS_DIR)/k8s_chaos_runner.py \
   		--namespace mcpgateway-chaos \
   		--output $(CHAOS_REPORTS)/minikube
   
   chaos-k8s:
   	@echo "☸️  Running Kubernetes chaos tests..."
   	@mkdir -p $(CHAOS_REPORTS)/k8s
   	@kubectl apply -f $(CHAOS_DIR)/k8s/ -n mcpgateway
   	@python $(CHAOS_DIR)/k8s_chaos_runner.py \
   		--namespace mcpgateway \
   		--output $(CHAOS_REPORTS)/k8s
   
   chaos-network:
   	@echo "🌐 Testing network partition scenarios..."
   	@python $(CHAOS_DIR)/network_chaos.py \
   		--scenarios partition,delay,loss \
   		--duration 60 \
   		--validate-recovery
   
   chaos-services:
   	@echo "⚡ Testing service failure scenarios..."
   	@python $(CHAOS_DIR)/service_chaos.py \
   		--services mcpgateway,postgres,redis \
   		--failure-types kill,stop,restart \
   		--validate-failover
   
   chaos-resources:
   	@echo "💾 Testing resource exhaustion scenarios..."
   	@python $(CHAOS_DIR)/resource_chaos.py \
   		--resources cpu,memory,disk \
   		--pressure-levels 80,90,95 \
   		--duration 120
   
   chaos-clean:
   	@echo "🧹 Cleaning chaos test environments..."
   	@docker-compose -f $(COMPOSE_CHAOS) down -v
   	@minikube delete -p $(MINIKUBE_PROFILE) || true
   	@rm -rf $(CHAOS_REPORTS)/*
   ```

4. **Chaos test runner (Docker Compose)**

   ```python
   # chaos/chaos_runner.py
   #!/usr/bin/env python3
   """
   Docker Compose based chaos testing runner.
   """
   
   import time
   import docker
   import requests
   import json
   import argparse
   from typing import Dict, List, Any
   from pathlib import Path
   
   class DockerChaosRunner:
       """Run chaos experiments in Docker Compose environment."""
       
       def __init__(self, config_path: str, output_dir: str):
           self.config = self._load_config(config_path)
           self.output_dir = Path(output_dir)
           self.output_dir.mkdir(parents=True, exist_ok=True)
           self.docker_client = docker.from_env()
           self.results = []
           
       def run_all_experiments(self) -> List[Dict[str, Any]]:
           """Run all configured chaos experiments."""
           
           print("🔥 Starting chaos experiments...")
           
           # Validate initial system health
           if not self._validate_system_health():
               raise Exception("System not healthy before chaos testing")
               
           # Run service failure experiments
           for experiment in self.config['chaos_scenarios']['service_failures']:
               result = self._run_service_failure_experiment(experiment)
               self.results.append(result)
               
           # Run network chaos experiments  
           for experiment in self.config['chaos_scenarios']['network_issues']:
               result = self._run_network_experiment(experiment)
               self.results.append(result)
               
           # Run resource pressure experiments
           for experiment in self.config['chaos_scenarios']['resource_pressure']:
               result = self._run_resource_experiment(experiment)
               self.results.append(result)
               
           # Save results
           self._save_results()
           return self.results
           
       def _run_service_failure_experiment(self, experiment_name: str) -> Dict[str, Any]:
           """Run a service failure experiment."""
           
           print(f"  🎯 Running service failure: {experiment_name}")
           
           experiment_config = {
               'gateway_instance_kill': {
                   'target_container': 'mcpgateway-1',
                   'action': 'kill',
                   'duration': 30,
                   'expected_behavior': ['load_balancer_failover', 'no_request_failures']
               },
               'database_connection_failure': {
                   'target_container': 'postgres-chaos',
                   'action': 'pause',
                   'duration': 60,
                   'expected_behavior': ['connection_pool_recovery', 'graceful_errors']
               },
               'redis_cache_failure': {
                   'target_container': 'redis-chaos',
                   'action': 'stop',
                   'duration': 45,
                   'expected_behavior': ['cache_fallback', 'no_error_propagation']
               }
           }
           
           config = experiment_config[experiment_name]
           result = {
               'experiment': experiment_name,
               'start_time': time.time(),
               'success': False,
               'observations': [],
               'metrics': {}
           }
           
           try:
               # Record baseline metrics
               baseline_metrics = self._collect_metrics()
               
               # Inject failure
               container = self.docker_client.containers.get(config['target_container'])
               
               if config['action'] == 'kill':
                   container.kill()
               elif config['action'] == 'pause':
                   container.pause()
               elif config['action'] == 'stop':
                   container.stop()
                   
               # Monitor system during failure
               failure_metrics = []
               for i in range(config['duration']):
                   time.sleep(1)
                   metrics = self._collect_metrics()
                   failure_metrics.append(metrics)
                   
                   # Check if system is behaving as expected
                   self._validate_expected_behavior(config['expected_behavior'], metrics)
                   
               # Restore service
               if config['action'] == 'kill':
                   # Container restart handled by compose restart policy
                   pass
               elif config['action'] == 'pause':
                   container.unpause()
               elif config['action'] == 'stop':
                   container.start()
                   
               # Wait for recovery and validate
               time.sleep(30)
               recovery_metrics = self._collect_metrics()
               
               # Validate recovery
               if self._validate_system_health():
                   result['success'] = True
                   result['observations'].append("System recovered successfully")
               else:
                   result['observations'].append("System failed to recover properly")
                   
               result['metrics'] = {
                   'baseline': baseline_metrics,
                   'during_failure': failure_metrics,
                   'after_recovery': recovery_metrics
               }
               
           except Exception as e:
               result['observations'].append(f"Experiment failed: {str(e)}")
               
           result['end_time'] = time.time()
           result['duration'] = result['end_time'] - result['start_time']
           
           return result
           
       def _collect_metrics(self) -> Dict[str, Any]:
           """Collect system metrics during chaos experiments."""
           
           metrics = {
               'timestamp': time.time(),
               'services': {},
               'load_balancer': {},
               'response_times': []
           }
           
           # Check service health
           services = ['mcpgateway-1', 'mcpgateway-2', 'postgres-chaos', 'redis-chaos']
           for service in services:
               try:
                   container = self.docker_client.containers.get(service)
                   metrics['services'][service] = {
                       'status': container.status,
                       'health': self._get_container_health(container)
                   }
               except Exception as e:
                   metrics['services'][service] = {'status': 'not_found', 'error': str(e)}
                   
           # Test load balancer response
           try:
               start_time = time.time()
               response = requests.get('http://localhost:4444/health', timeout=5)
               response_time = time.time() - start_time
               
               metrics['load_balancer'] = {
                   'status_code': response.status_code,
                   'response_time': response_time,
                   'accessible': response.status_code == 200
               }
           except Exception as e:
               metrics['load_balancer'] = {
                   'accessible': False,
                   'error': str(e)
               }
               
           # Test API endpoints
           test_endpoints = ['/health', '/tools/', '/servers/']
           for endpoint in test_endpoints:
               try:
                   start_time = time.time()
                   response = requests.get(f'http://localhost:4444{endpoint}', timeout=5)
                   response_time = time.time() - start_time
                   
                   metrics['response_times'].append({
                       'endpoint': endpoint,
                       'response_time': response_time,
                       'status_code': response.status_code
                   })
               except Exception as e:
                   metrics['response_times'].append({
                       'endpoint': endpoint,
                       'error': str(e),
                       'status_code': 0
                   })
                   
           return metrics
   ```

5. **Kubernetes chaos testing**

   ```yaml
   # chaos/k8s/network-partition.yaml
   apiVersion: chaos-mesh.org/v1alpha1
   kind: NetworkChaos
   metadata:
     name: mcpgateway-network-partition
     namespace: mcpgateway-chaos
   spec:
     action: partition
     mode: all
     selector:
       namespaces: ["mcpgateway-chaos"]
       labelSelectors:
         app.kubernetes.io/name: mcpgateway
     direction: both
     duration: "30s"
     
   ---
   apiVersion: chaos-mesh.org/v1alpha1
   kind: NetworkChaos
   metadata:
     name: mcpgateway-network-delay
     namespace: mcpgateway-chaos
   spec:
     action: delay
     mode: one
     selector:
       namespaces: ["mcpgateway-chaos"]
       labelSelectors:
         app.kubernetes.io/name: mcpgateway
     delay:
       latency: "1000ms"
       correlation: "100"
       jitter: "500ms"
     duration: "60s"
   ```

   ```yaml
   # chaos/k8s/pod-chaos.yaml
   apiVersion: chaos-mesh.org/v1alpha1
   kind: PodChaos
   metadata:
     name: mcpgateway-pod-kill
     namespace: mcpgateway-chaos
   spec:
     action: pod-kill
     mode: one
     selector:
       namespaces: ["mcpgateway-chaos"]
       labelSelectors:
         app.kubernetes.io/name: mcpgateway
     gracePeriod: 0
     
   ---
   apiVersion: chaos-mesh.org/v1alpha1
   kind: PodChaos
   metadata:
     name: postgres-pod-failure
     namespace: mcpgateway-chaos
   spec:
     action: pod-failure
     mode: one
     selector:
       namespaces: ["mcpgateway-chaos"]
       labelSelectors:
         app.kubernetes.io/name: postgresql
     duration: "60s"
   ```

   ```yaml
   # chaos/k8s/stress-chaos.yaml
   apiVersion: chaos-mesh.org/v1alpha1
   kind: StressChaos
   metadata:
     name: mcpgateway-memory-stress
     namespace: mcpgateway-chaos
   spec:
     mode: one
     selector:
       namespaces: ["mcpgateway-chaos"]
       labelSelectors:
         app.kubernetes.io/name: mcpgateway
     duration: "120s"
     stressors:
       memory:
         workers: 1
         size: "80%"
         
   ---
   apiVersion: chaos-mesh.org/v1alpha1
   kind: StressChaos
   metadata:
     name: mcpgateway-cpu-stress
     namespace: mcpgateway-chaos
   spec:
     mode: one
     selector:
       namespaces: ["mcpgateway-chaos"]
       labelSelectors:
         app.kubernetes.io/name: mcpgateway
     duration: "90s"
     stressors:
       cpu:
         workers: 2
         load: 90
   ```

6. **Kubernetes chaos runner**

   ```python
   # chaos/k8s_chaos_runner.py
   #!/usr/bin/env python3
   """
   Kubernetes chaos testing with Chaos Mesh integration.
   """
   
   import time
   import yaml
   import subprocess
   import requests
   import argparse
   from typing import Dict, List, Any
   from pathlib import Path
   from kubernetes import client, config
   
   class KubernetesChaosRunner:
       """Run chaos experiments in Kubernetes using Chaos Mesh."""
       
       def __init__(self, namespace: str, output_dir: str):
           self.namespace = namespace
           self.output_dir = Path(output_dir)
           self.output_dir.mkdir(parents=True, exist_ok=True)
           
           # Load Kubernetes config
           try:
               config.load_incluster_config()
           except:
               config.load_kube_config()
               
           self.k8s_client = client.ApiClient()
           self.v1 = client.CoreV1Api()
           self.apps_v1 = client.AppsV1Api()
           self.results = []
           
       def run_all_chaos_experiments(self) -> List[Dict[str, Any]]:
           """Run all Kubernetes chaos experiments."""
           
           print(f"☸️  Starting Kubernetes chaos experiments in namespace: {self.namespace}")
           
           # Validate initial cluster state
           if not self._validate_cluster_health():
               raise Exception("Cluster not healthy before chaos testing")
               
           # Get chaos experiment files
           chaos_files = list(Path("chaos/k8s").glob("*.yaml"))
           
           for chaos_file in chaos_files:
               result = self._run_chaos_experiment(chaos_file)
               self.results.append(result)
               
               # Wait between experiments
               time.sleep(30)
               
           # Save results
           self._save_results()
           return self.results
           
       def _run_chaos_experiment(self, chaos_file: Path) -> Dict[str, Any]:
           """Run a single chaos experiment."""
           
           print(f"  🎯 Running chaos experiment: {chaos_file.name}")
           
           result = {
               'experiment_file': str(chaos_file),
               'start_time': time.time(),
               'success': False,
               'observations': [],
               'metrics': {}
           }
           
           try:
               # Load chaos experiment
               with open(chaos_file) as f:
                   chaos_docs = list(yaml.safe_load_all(f))
                   
               # Record baseline metrics
               baseline_metrics = self._collect_k8s_metrics()
               
               # Apply chaos experiment
               for doc in chaos_docs:
                   if doc:
                       self._apply_chaos_manifest(doc)
                       
               # Monitor during chaos
               chaos_duration = self._extract_duration(chaos_docs[0])
               monitoring_metrics = []
               
               for i in range(chaos_duration + 30):  # Duration + recovery time
                   time.sleep(1)
                   metrics = self._collect_k8s_metrics()
                   monitoring_metrics.append(metrics)
                   
                   # Log significant events
                   if i % 10 == 0:
                       print(f"    Monitoring... {i}s elapsed")
                       
               # Clean up chaos experiment
               for doc in chaos_docs:
                   if doc:
                       self._delete_chaos_manifest(doc)
                       
               # Wait for recovery
               time.sleep(60)
               recovery_metrics = self._collect_k8s_metrics()
               
               # Validate recovery
               if self._validate_cluster_health():
                   result['success'] = True
                   result['observations'].append("Cluster recovered successfully")
               else:
                   result['observations'].append("Cluster failed to recover properly")
                   
               result['metrics'] = {
                   'baseline': baseline_metrics,
                   'during_chaos': monitoring_metrics,
                   'after_recovery': recovery_metrics
               }
               
           except Exception as e:
               result['observations'].append(f"Chaos experiment failed: {str(e)}")
               
           result['end_time'] = time.time()
           result['duration'] = result['end_time'] - result['start_time']
           
           return result
           
       def _collect_k8s_metrics(self) -> Dict[str, Any]:
           """Collect Kubernetes cluster metrics."""
           
           metrics = {
               'timestamp': time.time(),
               'pods': {},
               'services': {},
               'endpoints': {},
               'nodes': {}
           }
           
           try:
               # Pod status
               pods = self.v1.list_namespaced_pod(namespace=self.namespace)
               for pod in pods.items:
                   metrics['pods'][pod.metadata.name] = {
                       'phase': pod.status.phase,
                       'ready': self._is_pod_ready(pod),
                       'restart_count': sum(
                           container.restart_count or 0 
                           for container in pod.status.container_statuses or []
                       )
                   }
                   
               # Service endpoints
               services = self.v1.list_namespaced_service(namespace=self.namespace)
               for service in services.items:
                   endpoints = self.v1.read_namespaced_endpoints(
                       name=service.metadata.name,
                       namespace=self.namespace
                   )
                   metrics['services'][service.metadata.name] = {
                       'type': service.spec.type,
                       'endpoints_ready': sum(
                           len(subset.addresses or []) 
                           for subset in endpoints.subsets or []
                       )
                   }
                   
               # Test service connectivity
               service_url = self._get_service_url()
               if service_url:
                   try:
                       response = requests.get(f"{service_url}/health", timeout=5)
                       metrics['service_connectivity'] = {
                           'accessible': True,
                           'status_code': response.status_code,
                           'response_time': response.elapsed.total_seconds()
                       }
                   except Exception as e:
                       metrics['service_connectivity'] = {
                           'accessible': False,
                           'error': str(e)
                       }
                       
           except Exception as e:
               metrics['collection_error'] = str(e)
               
           return metrics
   ```

7. **Chaos test validation**

   ```python
   # chaos/validation.py
   """
   Validation logic for chaos engineering tests.
   """
   
   import time
   import requests
   from typing import Dict, List, Any, Optional
   
   class ChaosTestValidator:
       """Validate system behavior during and after chaos experiments."""
       
       def __init__(self, base_url: str = "http://localhost:4444"):
           self.base_url = base_url
           self.validation_results = []
           
       def validate_service_resilience(self, experiment_type: str, 
                                     during_chaos_metrics: List[Dict],
                                     recovery_metrics: Dict) -> Dict[str, Any]:
           """Validate service resilience during chaos experiments."""
           
           validation = {
               'experiment_type': experiment_type,
               'passed': True,
               'failures': [],
               'warnings': [],
               'recovery_time': None
           }
           
           if experiment_type == 'gateway_instance_kill':
               validation.update(self._validate_instance_failover(
                   during_chaos_metrics, recovery_metrics
               ))
           elif experiment_type == 'database_connection_failure':
               validation.update(self._validate_database_resilience(
                   during_chaos_metrics, recovery_metrics
               ))
           elif experiment_type == 'redis_cache_failure':
               validation.update(self._validate_cache_resilience(
                   during_chaos_metrics, recovery_metrics
               ))
           elif experiment_type == 'network_partition':
               validation.update(self._validate_network_resilience(
                   during_chaos_metrics, recovery_metrics
               ))
               
           return validation
           
       def _validate_instance_failover(self, chaos_metrics: List[Dict], 
                                     recovery_metrics: Dict) -> Dict[str, Any]:
           """Validate gateway instance failover behavior."""
           
           validation = {'passed': True, 'failures': [], 'warnings': []}
           
           # Check load balancer kept routing requests
           successful_requests = 0
           total_requests = 0
           
           for metrics in chaos_metrics:
               lb_metrics = metrics.get('load_balancer', {})
               if lb_metrics.get('accessible', False):
                   successful_requests += 1
               total_requests += 1
               
           success_rate = successful_requests / total_requests if total_requests > 0 else 0
           
           if success_rate < 0.95:  # 95% availability threshold
               validation['failures'].append(
                   f"Service availability dropped to {success_rate:.2%} during instance failure"
               )
               validation['passed'] = False
           else:
               validation['warnings'].append(
                   f"Service maintained {success_rate:.2%} availability during failover"
               )
               
           # Check recovery time
           recovery_time = self._calculate_recovery_time(chaos_metrics)
           if recovery_time > 60:  # 60 second max recovery time
               validation['failures'].append(
                   f"Recovery took {recovery_time}s, exceeds 60s threshold"
               )
               validation['passed'] = False
               
           validation['recovery_time'] = recovery_time
           return validation
           
       def _validate_database_resilience(self, chaos_metrics: List[Dict],
                                       recovery_metrics: Dict) -> Dict[str, Any]:
           """Validate database connection resilience."""
           
           validation = {'passed': True, 'failures': [], 'warnings': []}
           
           # Check for graceful error handling
           error_responses = []
           for metrics in chaos_metrics:
               for response in metrics.get('response_times', []):
                   if response.get('status_code', 0) >= 500:
                       error_responses.append(response)
                       
           # Database failures should result in 503 (service unavailable) not 500
           internal_errors = [r for r in error_responses if r.get('status_code') == 500]
           if internal_errors:
               validation['failures'].append(
                   f"Found {len(internal_errors)} internal server errors during DB failure"
               )
               validation['passed'] = False
               
           # Check connection pool recovery
           if not recovery_metrics.get('services', {}).get('postgres-chaos', {}).get('health'):
               validation['failures'].append("Database service not healthy after recovery")
               validation['passed'] = False
               
           return validation
           
       def _validate_cache_resilience(self, chaos_metrics: List[Dict],
                                    recovery_metrics: Dict) -> Dict[str, Any]:
           """Validate Redis cache failure resilience."""
           
           validation = {'passed': True, 'failures': [], 'warnings': []}
           
           # Cache failures should not impact service availability
           for metrics in chaos_metrics:
               if not metrics.get('load_balancer', {}).get('accessible', False):
                   validation['failures'].append(
                       "Service became unavailable during cache failure"
                   )
                   validation['passed'] = False
                   break
                   
           # Performance may degrade but should remain functional
           response_times = []
           for metrics in chaos_metrics:
               for response in metrics.get('response_times', []):
                   if response.get('status_code') == 200:
                       response_times.append(response.get('response_time', 0))
                       
           if response_times:
               avg_response_time = sum(response_times) / len(response_times)
               if avg_response_time > 5.0:  # 5 second threshold
                   validation['warnings'].append(
                       f"Response time degraded to {avg_response_time:.2f}s during cache failure"
                   )
                   
           return validation
   ```

8. **CI integration**

   ```yaml
   # Add to existing GitHub Actions workflow
   chaos-engineering:
     name: 🔥 Chaos Engineering Tests
     runs-on: ubuntu-latest
     needs: [test, migration-testing]
     if: github.ref == 'refs/heads/main'  # Only run on main branch
     
     strategy:
       matrix:
         environment: ["docker-compose", "minikube"]
         
     steps:
       - name: ⬇️  Checkout source
         uses: actions/checkout@v4
         with:
           fetch-depth: 1
           
       - name: 🐍  Set up Python
         uses: actions/setup-python@v5
         with:
           python-version: "3.12"
           cache: pip
           
       - name: 🔧  Install dependencies
         run: |
           python -m pip install --upgrade pip
           pip install -e .[dev]
           pip install docker chaos-toolkit kubernetes
           
       - name: 🚀  Set up Docker Compose environment
         if: matrix.environment == 'docker-compose'
         run: |
           docker-compose -f docker-compose.chaos.yml build
           make chaos-setup
           
       - name: ☸️  Set up Minikube environment  
         if: matrix.environment == 'minikube'
         run: |
           # Install minikube
           curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
           sudo install minikube-linux-amd64 /usr/local/bin/minikube
           
           # Install helm
           curl https://get.helm.sh/helm-v3.12.0-linux-amd64.tar.gz | tar xz
           sudo mv linux-amd64/helm /usr/local/bin/
           
           # Start minikube
           minikube start --driver=docker --cpus=2 --memory=4096
           
           # Install Chaos Mesh
           curl -sSL https://mirrors.chaos-mesh.org/v2.6.0/install.sh | bash
           
       - name: 🔥  Run chaos tests - ${{ matrix.environment }}
         run: |
           case "${{ matrix.environment }}" in
             "docker-compose")
               make chaos-local
               ;;
             "minikube")
               make chaos-minikube
               ;;
           esac
           
       - name: 📊  Generate chaos test report
         run: |
           make chaos-report
           
       - name: 📎  Upload chaos test results
         uses: actions/upload-artifact@v4
         with:
           name: chaos-test-results-${{ matrix.environment }}
           path: |
             chaos/reports/
           retention-days: 30
           
       - name: 🚨  Validate resilience requirements
         run: |
           python chaos/validate_resilience.py \
             --results chaos/reports/${{ matrix.environment }} \
             --fail-on-critical
           
       - name: 🧹  Cleanup
         if: always()
         run: |
           make chaos-clean
           if [ "${{ matrix.environment }}" == "minikube" ]; then
             minikube delete
           fi
   ```

9. **Monitoring and reporting**

   ```python
   # chaos/generate_report.py
   #!/usr/bin/env python3
   """
   Generate comprehensive chaos engineering test reports.
   """
   
   import json
   import plotly.graph_objects as go
   import plotly.express as px
   from plotly.subplots import make_subplots
   import pandas as pd
   from pathlib import Path
   
   class ChaosTestReporter:
       """Generate visual reports for chaos engineering tests."""
       
       def generate_html_report(self, results_dir: Path, output_file: Path):
           """Generate comprehensive HTML chaos test report."""
           
           # Load all test results
           all_results = self._load_all_results(results_dir)
           
           # Create visualizations
           fig = make_subplots(
               rows=3, cols=2,
               subplot_titles=[
                   'Service Availability During Chaos',
                   'Recovery Time by Experiment',
                   'Response Time Impact',
                   'Error Rate Analysis',
                   'Resource Usage During Chaos',
                   'Resilience Score by Component'
               ],
               specs=[
                   [{"secondary_y": True}, {"type": "bar"}],
                   [{"type": "scatter"}, {"type": "bar"}],
                   [{"type": "heatmap"}, {"type": "indicator"}]
               ]
           )
           
           # Service availability timeline
           self._add_availability_chart(fig, all_results, row=1, col=1)
           
           # Recovery time comparison
           self._add_recovery_time_chart(fig, all_results, row=1, col=2)
           
           # Response time impact
           self._add_response_time_chart(fig, all_results, row=2, col=1)
           
           # Generate HTML report
           html_template = self._get_html_template()
           html_content = html_template.format(
               charts=fig.to_html(include_plotlyjs='cdn'),
               summary_table=self._generate_summary_table(all_results),
               recommendations=self._generate_recommendations(all_results),
               timestamp=pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
           )
           
           output_file.write_text(html_content)
           print(f"📊 Chaos test report generated: {output_file}")
   ```

10. **Documentation**

    Add chaos engineering documentation:
    ```markdown
    # Chaos Engineering Testing
    
    ## Overview
    
    Chaos engineering tests validate mcpgateway's resilience by deliberately introducing failures and verifying the system's ability to handle them gracefully.
    
    ## Test Environments
    
    ### Docker Compose (Local)
    ```bash
    # Run local multi-container chaos tests
    make chaos-local
    ```
    
    ### Minikube (Local Kubernetes)
    ```bash
    # Run Kubernetes chaos tests locally
    make chaos-minikube  
    ```
    
    ### Production Kubernetes
    ```bash
    # Run chaos tests in production cluster
    make chaos-k8s
    ```
    
    ## Failure Scenarios
    
    | Category | Scenario | Expected Behavior |
    |----------|----------|-------------------|
    | Service Failures | Gateway instance kill | Load balancer failover, <5s recovery |
    | Database | PostgreSQL connection loss | Graceful errors, automatic reconnection |
    | Cache | Redis failure | Fallback to database, no error propagation |
    | Network | Network partition | Circuit breaker activation, timeout handling |
    | Resources | Memory/CPU pressure | Graceful degradation, auto-scaling |
    
    ## Resilience Requirements
    
    - **Service Availability**: >95% during single instance failures
    - **Recovery Time**: <60 seconds for service restoration  
    - **Error Handling**: Graceful degradation, no 500 errors
    - **Data Consistency**: No data loss during failures
    - **Performance**: <2x response time degradation under pressure


---

### 📖 References

* **Chaos Engineering Principles** – Building confidence in system behavior · [https://principlesofchaos.org/](https://principlesofchaos.org/)
* **Chaos Mesh** – Cloud-native chaos engineering platform · [https://chaos-mesh.org/](https://chaos-mesh.org/)
* **Docker Compose** – Multi-container application definition · [https://docs.docker.com/compose/](https://docs.docker.com/compose/)
* **Kubernetes** – Container orchestration platform · [https://kubernetes.io/](https://kubernetes.io/)
* **Minikube** – Local Kubernetes development · [https://minikube.sigs.k8s.io/](https://minikube.sigs.k8s.io/)

---

### 🧩 Additional Notes

* **Start simple**: Begin with Docker Compose chaos tests before moving to Kubernetes scenarios.
* **Gradual failure injection**: Test individual failure types before combining multiple failures.
* **Real-world scenarios**: Focus on failures that actually occur in production environments.
* **Automated validation**: Every chaos experiment should include automated validation of expected behaviors.
* **Documentation**: Document all failure scenarios and expected system responses for team knowledge.
* **Safety first**: Always run chaos tests in isolated environments before production testing.
* **Monitoring integration**: Chaos tests should integrate with your monitoring and alerting systems.

**Chaos Engineering Best Practices:**
- Define steady-state behavior before introducing chaos
- Minimize blast radius to avoid widespread impact
- Automate experiments to run consistently and frequently
- Build confidence gradually by starting with small experiments
- Learn from failures and improve system resilience based on findings
- Include business metrics in chaos experiment validation
- Document and share learnings across the team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CHORE]: Implement chaos engineering tests for fault tolerance validation (network partitions, service failures) #253

🧭 Chore Summary

🧱 Areas Affected

⚙️ Context / Rationale

📦 Related Make Targets

📋 Acceptance Criteria

🛠️ Task List (suggested flow)

Minikube (Local Kubernetes)

Production Kubernetes

Failure Scenarios

Resilience Requirements

📖 References

🧩 Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Target	Purpose
`make chaos-test`	Run complete chaos engineering test suite across all environments
`make chaos-local`	Run Docker Compose based chaos tests locally
`make chaos-k8s`	Run Kubernetes chaos tests with Chaos Mesh
`make chaos-minikube`	Run chaos tests in local Minikube environment
`make chaos-network`	Test network partition and connectivity failure scenarios
`make chaos-services`	Test service failure and recovery scenarios
`make chaos-resources`	Test resource exhaustion and pressure scenarios
`make chaos-setup`	Set up chaos testing infrastructure and dependencies
`make chaos-report`	Generate chaos test reports and resilience analysis
`make chaos-clean`	Clean chaos test environments and artifacts

Category	Scenario	Expected Behavior
Service Failures	Gateway instance kill	Load balancer failover, <5s recovery
Database	PostgreSQL connection loss	Graceful errors, automatic reconnection
Cache	Redis failure	Fallback to database, no error propagation
Network	Network partition	Circuit breaker activation, timeout handling
Resources	Memory/CPU pressure	Graceful degradation, auto-scaling

[CHORE]: Implement chaos engineering tests for fault tolerance validation (network partitions, service failures) #253

Description

🧭 Chore Summary

🧱 Areas Affected

⚙️ Context / Rationale

📦 Related Make Targets

📋 Acceptance Criteria

🛠️ Task List (suggested flow)

Minikube (Local Kubernetes)

Production Kubernetes

Failure Scenarios

Resilience Requirements

📖 References

🧩 Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions