Skip to content

Conversation

deuszx
Copy link
Contributor

@deuszx deuszx commented Oct 8, 2025

Motivation

The Linera client needs to interact with multiple validator nodes efficiently. Previously, the
client would make individual requests to validators without:

  1. Performance tracking: No mechanism to prefer faster, more reliable validators
  2. Request deduplication: Concurrent requests for the same data would all hit the network, wasting
    bandwidth and validator resources
  3. Response caching: Repeated requests for the same data would always go to validators
  4. Load balancing: No rate limiting per validator, risking overload
  5. Resilience: No fallback mechanism when a validator is slow or unresponsive

This led to:

  • Unnecessary network traffic and validator load
  • Poor user experience with redundant waiting
  • No optimization based on validator performance
  • Risk of overwhelming validators with too many concurrent requests
  • No recovery mechanism when validators are slow

Proposal

This PR introduces ValidatorManager, a sophisticated request orchestration layer that provides
intelligent peer selection, request deduplication, caching, and performance-based routing.

Key Features

  1. Performance Tracking with Exponential Moving Averages (EMA)
  • Tracks latency, success rate, and current load for each validator
  • Uses configurable weights to compute a composite performance score
  • Intelligently selects the best available validator for each request
  • Weighted random selection from top performers to avoid hotspots
  1. Request Deduplication
  • Exact matching: Multiple concurrent requests for identical data are deduplicated
  • Subsumption-based matching: Smaller requests are satisfied by larger in-flight requests that
    contain the needed data (e.g., a request for blocks 10-12 can be satisfied by an in-flight request
    for blocks 10-20)
  • Broadcast mechanism ensures all waiting requesters receive the result when the request completes
  • Timeout handling: Stale in-flight requests (>200ms) are not deduplicated against, allowing fresh
    attempts
  1. Response Caching
  • Successfully completed requests are cached with configurable TTL (default: 2 seconds)
  • LRU eviction when cache reaches maximum size (default: 1000 entries)
  • Works with both exact and subsumption matching
  • Only successful results are cached
  1. Slot-Based Rate Limiting
  • Each validator has a maximum concurrent request limit (default: 100)
  • Async await mechanism: requests wait for available slots without polling
  • Prevents overloading individual validators
  • Automatic slot release on request completion
  1. Alternative Peer Handling
  • When multiple callers request the same data, they register as "alternative peers"
  • If the original request times out (>200ms), any alternative peer can complete the request
  • The result is broadcast to all waiting requesters
  • Provides resilience against slow validators
  1. Modular Architecture

Created a new validator_manager module with clear separation of concerns:

  validator_manager/
  ├── mod.rs              - Module exports and constants
  ├── manager.rs          - ValidatorManager orchestration logic
  ├── in_flight_tracker.rs - In-flight request tracking and deduplication
  ├── node_info.rs        - Per-validator performance tracking
  ├── request.rs          - Request types and result extraction
  └── scoring.rs          - Configurable scoring weights

API

High-level APIs:

  // Execute with best available validator
  manager.with_best(request_key, |peer| async {
      peer.download_certificates(chain_id, start, limit).await
  }).await

  // Execute with specific validator
  manager.with_peer(request_key, peer, |peer| async {
      peer.download_blob(blob_id).await
  }).await

  Configuration:
  let manager = ValidatorManager::with_config(
      validator_nodes,
      max_requests_per_node: 100,
      weights: ScoringWeights { latency: 0.4, success: 0.4, load: 0.2 },
      alpha: 0.1,              // EMA smoothing factor
      max_expected_latency_ms: 5000.0,
      cache_ttl: Duration::from_secs(2),
      max_cache_size: 1000,
  );

Benefits

  • Reduced network load: Deduplication and caching eliminate redundant requests
  • Better performance: Intelligent peer selection routes to fastest validators
  • Improved reliability: Alternative peer mechanism provides resilience
  • Protection for validators: Rate limiting prevents overload
  • Efficient resource usage: EMA-based scoring optimizes validator selection
  • Clean architecture: Modular design makes code maintainable and testable

Metrics

In production usage, this should significantly reduce:

  • Network traffic between clients and validators
  • Validator CPU/memory usage from redundant requests
  • Client request latency through caching and smart routing
  • Failed requests through performance tracking and rate limiting

The following metrics have been added to Prometheus (with compiled with --features metrics):

  • validator_manager_response_time_ms - Response time for requests to validators in milliseconds
  • validator_manager_request_total - Total number of requests made to each validator
  • validator_manager_request_success - Number of successful requests to each validator ((validator_manager_request_total - validator_manager_request_success) / validator_manager_request_total is an error rate)
  • validator_manager_request_deduplication_total - Number of requests that were deduplicated by joining an in-flight request
  • validator_manager_request_cache_hit_total - Number of requests that were served from cache

Test Plan

Existing CI makes sure we maintain backwards compatibility. Some tests have been added to the new modules.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Links

@deuszx deuszx force-pushed the conway_validator-manager branch 6 times, most recently from f8be16f to b66d121 Compare October 15, 2025 15:20
@deuszx deuszx force-pushed the conway_validator-manager branch from af2f10d to e2391e7 Compare October 16, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant