Introduce ValidatorManager to track all requests and validators' scores #4752

deuszx · 2025-10-08T15:05:29Z

Motivation

The Linera client needs to interact with multiple validator nodes efficiently. Previously, the
client would make individual requests to validators without:

Performance tracking: No mechanism to prefer faster, more reliable validators
Request deduplication: Concurrent requests for the same data would all hit the network, wasting
bandwidth and validator resources
Response caching: Repeated requests for the same data would always go to validators
Load balancing: No rate limiting per validator, risking overload
Resilience: No fallback mechanism when a validator is slow or unresponsive

This led to:

Unnecessary network traffic and validator load
Poor user experience with redundant waiting
No optimization based on validator performance
Risk of overwhelming validators with too many concurrent requests
No recovery mechanism when validators are slow

Proposal

This PR introduces ValidatorManager, a sophisticated request orchestration layer that provides
intelligent peer selection, request deduplication, caching, and performance-based routing.

Key Features

Performance Tracking with Exponential Moving Averages (EMA)

Tracks latency, success rate, and current load for each validator
Uses configurable weights to compute a composite performance score
Intelligently selects the best available validator for each request
Weighted random selection from top performers to avoid hotspots

Request Deduplication

Exact matching: Multiple concurrent requests for identical data are deduplicated
Subsumption-based matching: Smaller requests are satisfied by larger in-flight requests that
contain the needed data (e.g., a request for blocks 10-12 can be satisfied by an in-flight request
for blocks 10-20)
Broadcast mechanism ensures all waiting requesters receive the result when the request completes
Timeout handling: Stale in-flight requests (>200ms) are not deduplicated against, allowing fresh
attempts

Response Caching

Successfully completed requests are cached with configurable TTL (default: 2 seconds)
LRU eviction when cache reaches maximum size (default: 1000 entries)
Works with both exact and subsumption matching
Only successful results are cached

Slot-Based Rate Limiting

Each validator has a maximum concurrent request limit (default: 100)
Async await mechanism: requests wait for available slots without polling
Prevents overloading individual validators
Automatic slot release on request completion

Alternative Peer Handling

When multiple callers request the same data, they register as "alternative peers"
If the original request times out (>200ms), any alternative peer can complete the request
The result is broadcast to all waiting requesters
Provides resilience against slow validators

Modular Architecture

Created a new validator_manager module with clear separation of concerns:

  validator_manager/
  ├── mod.rs              - Module exports and constants
  ├── manager.rs          - ValidatorManager orchestration logic
  ├── in_flight_tracker.rs - In-flight request tracking and deduplication
  ├── node_info.rs        - Per-validator performance tracking
  ├── request.rs          - Request types and result extraction
  └── scoring.rs          - Configurable scoring weights

API

High-level APIs:

  // Execute with best available validator
  manager.with_best(request_key, |peer| async {
      peer.download_certificates(chain_id, start, limit).await
  }).await

  // Execute with specific validator
  manager.with_peer(request_key, peer, |peer| async {
      peer.download_blob(blob_id).await
  }).await

  Configuration:
  let manager = ValidatorManager::with_config(
      validator_nodes,
      max_requests_per_node: 100,
      weights: ScoringWeights { latency: 0.4, success: 0.4, load: 0.2 },
      alpha: 0.1,              // EMA smoothing factor
      max_expected_latency_ms: 5000.0,
      cache_ttl: Duration::from_secs(2),
      max_cache_size: 1000,
  );

Benefits

Reduced network load: Deduplication and caching eliminate redundant requests
Better performance: Intelligent peer selection routes to fastest validators
Improved reliability: Alternative peer mechanism provides resilience
Protection for validators: Rate limiting prevents overload
Efficient resource usage: EMA-based scoring optimizes validator selection
Clean architecture: Modular design makes code maintainable and testable

Metrics

In production usage, this should significantly reduce:

Network traffic between clients and validators
Validator CPU/memory usage from redundant requests
Client request latency through caching and smart routing
Failed requests through performance tracking and rate limiting

The following metrics have been added to Prometheus (with compiled with --features metrics):

validator_manager_response_time_ms - Response time for requests to validators in milliseconds
validator_manager_request_total - Total number of requests made to each validator
validator_manager_request_success - Number of successful requests to each validator ((validator_manager_request_total - validator_manager_request_success) / validator_manager_request_total is an error rate)
validator_manager_request_deduplication_total - Number of requests that were deduplicated by joining an in-flight request
validator_manager_request_cache_hit_total - Number of requests that were served from cache

Test Plan

Existing CI makes sure we maintain backwards compatibility. Some tests have been added to the new modules.

Release Plan

Nothing to do / These changes follow the usual release cycle.

Links

reviewer checklist

… the network

…previous one is stalled

deuszx force-pushed the conway_validator-manager branch 6 times, most recently from f8be16f to b66d121 Compare October 15, 2025 15:20

deuszx added 24 commits October 16, 2025 14:44

Introduce ValidatorManager to track all requests and validators' scores

979bb56

Deduplicate requets for the same data

ae6eb4f

Cache responses to queries and resolve subsequent ones w/o going over…

eabaa77

… the network

Expose remote node's address and use when logging

de792b7

Simplify RequestKey enum

a6b079b

Add ValidatorManager::with_peer method

eef9cac

Simplify ValidatorManager::deduplicated_request

b4d6b77

Rearrange the methods

5eae001

Add ValidatorManager::with_peer method

045547e

Evict all expired entries from cache at once

9663a6d

Fix compilation clippy

b162144

Remove unused methods from ValidatorManager

5e873af

Use specific peers when downloading data from.

f9b6001

Download blobs using communicate_concurrently

739d4f1

Rename remote_nodes to validator_manager

a42fdd2

Clean up the API of ValidatorManager

6898178

Cache only sucessful results

eacabe7

Fix doc examples on ValidatorManager

6264f9d

Make slot acquisition an async await

75d65bc

Deduplicate requests if they are subset of each other

0f64b05

Split validator_manager.rs into module

da81306

Don't deduplicate slow inflight requests

50947e4

Move result.is_ok() check to store_in_cache

47d03a8

Add unit tests for request deduplication

1c0387f

deuszx added 4 commits October 16, 2025 14:44

Add a test for slot acquisition rate-limiting

2455b62

Use alternative peer if in-flight is too slow

15ac249

Stop logging grpc error status at INFO

80a287a

Extract in-flight map to a separate file

e2391e7

deuszx force-pushed the conway_validator-manager branch from af2f10d to e2391e7 Compare October 16, 2025 13:58

deuszx added 4 commits October 16, 2025 16:15

Add basic metrics to ValidatorManager

51af950

Simplify register_alternative_and_check_timeout function

5cf9624

Refactor RequestResult type

44f1de2

Simplify code for registering new peers and issuing new queries when …

f922f82

…previous one is stalled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce ValidatorManager to track all requests and validators' scores #4752

Introduce ValidatorManager to track all requests and validators' scores #4752

deuszx commented Oct 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Introduce ValidatorManager to track all requests and validators' scores #4752

Are you sure you want to change the base?

Introduce ValidatorManager to track all requests and validators' scores #4752

Conversation

deuszx commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deuszx commented Oct 8, 2025 •

edited

Loading