Skip to content

Conversation

@atchernych
Copy link
Contributor

@atchernych atchernych commented Nov 15, 2025

Overview:

Have to create a new epp image starting 0.7.0 due to an upstream change in Dynamo (new glibc (2.38+))

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added dynamic worker selection capability using Dynamo-backed routing for improved request scheduling.
    • Introduced KV-aware scoring for enhanced request distribution across workers.
    • New scheduling plugins to optimize worker selection and request handling.
  • Infrastructure

    • Updated build configuration to support Dynamo-enabled EPP deployment options.
    • Added new deployment image build targets for Dynamo-backed inference gateway.

Signed-off-by: Anna Tchernych <[email protected]>
@atchernych atchernych requested a review from a team as a code owner November 15, 2025 01:41
@github-actions github-actions bot added the feat label Nov 15, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 15, 2025

Walkthrough

The pull request updates the EPP Dynamo build process with a new v0.7.0 patch that introduces Dynamo-based routing enhancements. These include new pre-request and KV-scorer plugins, YAML configuration, and modifications to BBR server logic to support worker ID tracking and injection.

Changes

Cohort / File(s) Summary
Build Configuration
deploy/inference-gateway/build-epp-dynamo.sh
Updated patch file path from v0.5.1-2 to v0.7.0 for Dynamo patch application
Dynamo Patch Bundle
deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch
New comprehensive patch introducing Dynamo-based EPP routing system with worker selection pipeline and header-based worker ID tracking
Plugin Implementation
pkg/plugins/dynamo_inject_workerid/..., pkg/plugins/dynamo_kv_scorer/...
New pre-request plugin for injecting worker-id/token headers and new KV-scorer plugin with FFI-backed Dynamo LLM runtime integration and OpenAI-compatible request routing
Server & Configuration
pkg/bbr/handlers/..., cmd/epp/main.go, epp-config-dynamo.yaml
BBR request handler enhancements for per-request hints and header mutations; plugin registry updates; Dynamo scheduling profile configuration
Build Infrastructure
Makefile
New build targets for Dynamo EPP image variants and local build options

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • FFI/CFFI Integration Complexity: New Dynamo runtime initialization, worker selection pipeline management, and inter-language boundary calls require careful verification of memory safety and error handling patterns.
  • Plugin Lifecycle & State Management: Interaction between pre-request plugin and KV-scorer plugin for header injection, token encoding, and state mutation flows across request pipeline.
  • BBR Server Modifications: Request tracking, header pruning logic, and streaming/non-streaming mutation handling introduce new control flow branches needing thorough validation.
  • Configuration Wiring: YAML-based plugin registration and environment variable configuration parsing require validation against documented behavior.

Poem

🐰 Dynamo dust and worker threads dance,
New routing logic gets its chance,
Pre-request hooks inject with care,
KV scorers know what's fair,
BBR headers float through air! ✨

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides an overview but leaves critical sections as placeholders ('' and ''), failing to meet template requirements. Complete the Details section with specific changes, and add guidance on which files reviewers should focus on (e.g., build-epp-dynamo.sh, epp-config-dynamo.yaml, dynamo plugins).
Title check ❓ Inconclusive The title 'feat: create new epp patch' is vague and generic, using non-descriptive terms that don't convey the specific change (upgrading Dynamo EPP to v0.7.0 for glibc 2.38+). Revise the title to be more specific, e.g., 'feat: upgrade EPP to v0.7.0 with Dynamo glibc 2.38+ support' to clearly indicate the main change.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch (6)

11-50: Dynamo image targets may invoke two builds per call and leak builder on failure

The new dynamo-image-local-build always invokes dynamo-image-build twice (once with PUSH=$(PUSH), once with LOAD=$(LOAD)), which means make dynamo-image-local-load or dynamo-image-local-push will likely perform two builds, and in the push case might push twice depending on how PUSH is propagated. Also, the temporary Buildx builder is only removed if all steps succeed; a failure in one dynamo-image-build run leaves the builder behind.

If this mirrors an existing pattern elsewhere, it’s acceptable, but you may want to:

  • Either call dynamo-image-build once with both PUSH and LOAD as appropriate for the target (e.g., only LOAD=--load for *-local-load, only PUSH=--push for *-local-push).
  • Or gate the second $(MAKE) dynamo-image-build ... call on the corresponding variable being non-empty, to avoid redundant builds and reduce Buildx resource usage.

99-141: Request body mutation and header preservation behavior change is intentional but worth validating

The updated HandleRequestBody now:

  • Injects backend_instance_id into data["nvext"] when s.workerIDHint is present.
  • Decodes base64-encoded JSON token data from s.tokenDataHint, unmarshals into []int64, and injects it as nvext.token_data.
  • For both streaming and non‑streaming cases, always re-serializes and returns the (possibly mutated) body instead of pass‑through, including when model is missing.
  • In the streaming case, keeps the original model header and optionally preserves the x-worker-instance-id header by appending a conditional header option and then pruning nil entries in SetHeaders.

This all looks logically consistent, with good guardrails (empty/invalid base64 or JSON are silently ignored). Two minor suggestions:

  • Consider logging at a low verbosity when base64 decoding or JSON unmarshal fails, so misconfigured callers can be diagnosed without changing behavior.
  • Consider centralizing the header key constants with the plugin package to avoid future divergence of string literals.

Also applies to: 185-255


312-381: Pre-request plugin logic is straightforward; consider sharing header constants

InjectWorkerIDPreRequest:

  • Normalizes the x-worker-instance-id header, trimming whitespace and dropping empty values.
  • Writes the trimmed value back to WorkerIDHeader and also to the hint header so downstream can inject into nvext.
  • Passes through the token data header if present.

The basic behavior looks correct and defensive against missing headers. One small improvement: the header key strings are duplicated here and in pkg/bbr/handlers/request.go. A shared constants package (or reusing this exported WorkerIDHeader/TokenDataHeader in the BBR handlers) would reduce the risk of future string drift.


382-408: Config example wires plugins sensibly; check whether the pre-request plugin needs explicit profile wiring

The sample epp-config-dynamo.yaml:

  • Registers dynamo-inject-workerid (dyn-pre) and kv-aware-scorer (dyn-kv) in the plugins: section.
  • Uses dyn-kv and picker in the default scheduling profile.

Assuming the config schema runs PreRequest plugins based on the top-level plugins list and not the schedulingProfiles entries, this is fine. If pre-request plugins must be referenced explicitly in profiles or another section, dyn-pre may currently be defined but unused.

Worth double‑checking against the GAIE EPP config spec to ensure dyn-pre is actually invoked as intended.


413-603: FFI config and initialization are mostly robust; watch the panic pathway and logging

The FFI setup (loadDynamoConfig + initFFI) does:

  • Required DYNAMO_KV_BLOCK_SIZE with range and power‑of‑two validation, panicking on misconfig.
  • Optional env overrides for namespace, component, model, worker ID, and routing knobs.
  • warmupOnce wraps initFFI with a recover so panics during initial factory warmup turn into a clean plugin init error.

A couple of observations:

  • loadDynamoConfig’s panic is only recovered when called through warmupOnce in KVAwareScorerFactory. If initFFI is ever invoked via another code path before factory warmup, it would panic the process. Today that doesn’t happen, but it’s worth keeping in mind for future refactors.
  • fmt.Printf for configuration messages is fine for early development; in production you may prefer using the controller‑runtime logger for consistency and log routing.

Functionally this looks sound; just be careful if you later reuse initFFI outside the current warmup flow.


685-726: Scorer behavior: uniform scores but side-effectful headers; consider logging noise

KVAwareScorer.Score:

  • Calls the Dynamo router once, writes the selected worker ID into CycleState and request headers, and attaches base64‑encoded token data.
  • Returns a uniform score of 1.0 for all pods.

This is intentional if the real routing decision happens outside the EPP scheduler (via the worker ID hint), but two small considerations:

  • You log the full tokenData slice at default verbosity. If these arrays are large or sensitive, consider logging only len(tokenData) and maybe the first N entries.
  • The second if req.Headers == nil before writing TokenDataHeader is redundant since you already guard earlier, but harmless.
deploy/inference-gateway/build-epp-dynamo.sh (1)

86-96: Patch path update is correct; consider aligning Dockerfile echo

Updating:

PATCH_FILE="${DYNAMO_DIR}/deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch"

matches the new patch location introduced in this repo and keeps the control flow unchanged (check existence → git apply --checkgit apply). That looks good.

Minor nit (pre-existing): you copy Dockerfile.epp into ${GAIE_DIR}/Dockerfile.dynamo but later echo Docker: ${GAIE_DIR}/Dockerfile.epp. For clarity, you may want the echo to reference Dockerfile.dynamo.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf427e and efff31d.

📒 Files selected for processing (2)
  • deploy/inference-gateway/build-epp-dynamo.sh (1 hunks)
  • deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch (4)

55-76: Plugin registration wiring looks consistent

The registration of "dynamo-inject-workerid" and "kv-aware-scorer" against InjectWorkerIDPreRequestFactory and KVAwareScorerFactory matches the type strings used in the respective plugins and in epp-config-dynamo.yaml, so the wiring should resolve correctly at startup.


730-799: Router call error handling and memory management look correct

callDynamoRouter:

  • Ensures initFFI succeeds and runtimeInitialized is true before proceeding.
  • Checks currentPipeline for nil.
  • Marshals an OpenAI-style request body, converts it to C string, and defers C.free.
  • Copies C token buffer into Go memory with unsafe.Slice and then calls dynamo_free_worker_selection_result to free C allocations.
  • Logs the result and returns worker ID plus []int64 token data.

The memory ownership and error checks all look correct. Nice job keeping the FFI boundary tight.


800-819: OpenAI-style request construction is simple and safe

buildOpenAIRequest falls back to sane defaults:

  • Uses req.Prompt if non-empty, otherwise "default prompt".
  • Uses req.TargetModel when present, else the configured ffiModel.
  • Sets max_tokens, temperature, stream, and a basic nvext.annotations.

This is fine as a starting point. If you later need to reflect caller options (e.g., temperature or max tokens), this is a good centralized place to extend.


604-681: Shared pipeline pointer and cleanup could race if cleanupDynamo is called while routing

The FFI fields maintain a single *C.struct_WorkerSelectionPipeline plus pipelineMutex. Creation and destruction are guarded by the mutex, but callDynamoRouter does:

pipelineMutex.RLock()
currentPipeline := pipeline
pipelineMutex.RUnlock()
// ...
rc := C.dynamo_query_worker_selection_and_annotate(currentPipeline, ...)

If cleanupDynamo is called concurrently, it can destroy and null out pipeline between the RUnlock and the FFI call, leaving currentPipeline pointing at freed memory. Right now cleanupDynamo doesn’t appear to be called from the patch, but if it’s hooked into shutdown while requests are still in flight, that becomes a use‑after‑free.

Safer options:

  • Hold the read lock for the duration of the FFI call, or
  • Introduce a separate shutdown flag and only call cleanupDynamo after routing has quiesced.

Also, depending on the thread‑safety of WorkerSelectionPipeline on the Rust side, you may want to serialize calls to dynamo_query_worker_selection_and_annotate behind a mutex rather than allowing arbitrary concurrent FFI calls.

Also applies to: 823-841

Comment on lines +262 to +285
+ // reset per-request
+ s.workerIDHint = ""
+ s.tokenDataHint = ""
+
+ if m := headers.GetHeaders(); m != nil {
+ for _, h := range m.GetHeaders() {
+ k := strings.ToLower(h.GetKey())
+
+ switch k {
+ case injectHintHeader, workerIDHeader:
+ if rv := h.GetRawValue(); len(rv) > 0 {
+ s.workerIDHint = strings.TrimSpace(string(rv))
+ } else {
+ s.workerIDHint = strings.TrimSpace(h.GetValue())
+ }
+ case tokenDataHeader:
+ if rv := h.GetRawValue(); len(rv) > 0 {
+ s.tokenDataHint = strings.TrimSpace(string(rv))
+ } else {
+ s.tokenDataHint = strings.TrimSpace(h.GetValue())
+ }
+ }
+ }
+ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Per-request hints stored on Server fields are likely racy across concurrent streams

HandleRequestHeaders now mutates s.workerIDHint and s.tokenDataHint, and HandleRequestBody reads those fields later to inject nvext.backend_instance_id and nvext.token_data. These hints are per-request, but Server itself is the gRPC service implementation and is typically shared across multiple concurrent Process streams.

This introduces two issues:

  • Data races on workerIDHint/tokenDataHint when multiple requests are processed concurrently.
  • Cross-request leakage, where headers from one stream can affect body mutation for another stream if their lifetimes overlap.

You likely need to make the hints part of the per-stream/per-request state instead of Server fields, e.g.:

  • Store them in a struct that lives per Process stream and is passed to HandleRequestHeaders/HandleRequestBody, or
  • Thread them via the context.Context or the data/headers payload, avoiding shared mutable state on Server.

Also applies to: 295-303


I'll need to examine the codebase to verify the concurrency concerns raised in the review comment. Let me search for the source files and understand the implementation.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find the source files being patched by looking at the patch file and surrounding Go source
find . -type f -name "*.go" | head -20

Length of output: 1207


🏁 Script executed:

#!/bin/bash
# Search for Server struct definition and HandleRequestHeaders/HandleRequestBody methods
rg -n "type Server struct" --type=go -A 15

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for HandleRequestHeaders method
rg -n "HandleRequestHeaders" --type=go -B 2 -A 10

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for HandleRequestBody method
rg -n "HandleRequestBody" --type=go -B 2 -A 10

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for workerIDHint and tokenDataHint field declarations
rg -n "workerIDHint\|tokenDataHint" --type=go -B 3 -A 3

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look for the patch file to understand context
cat deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch | head -100

Length of output: 3683


🏁 Script executed:

#!/bin/bash
# Search for Process stream method to understand concurrency model
rg -n "func.*Process.*stream" --type=go -B 2 -A 15

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for any sync.Mutex or other synchronization on Server
rg -n "sync\\.Mutex\|sync\\.RWMutex\|atomic" --type=go | grep -i server

Length of output: 0


I need to examine the patch file more thoroughly and locate the source files it's modifying. Let me search for the actual Go source being patched.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find and examine the request.go file being patched
find . -name "request.go" -type f | xargs ls -la

Length of output: 3094


🏁 Script executed:

#!/bin/bash
# Read the full patch to identify the Server type being modified
cat deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch | grep -A 50 -B 5 "workerIDHint\|tokenDataHint" | head -200

Length of output: 6461


🏁 Script executed:

#!/bin/bash
# Search for gRPC Process service or stream handling
rg -n "Process.*context" --type=go -B 3 -A 10 | head -50

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look for the source file at the path indicated in the patch
cat pkg/bbr/handlers/request.go 2>/dev/null | head -100

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for "ExtProcService" or other gRPC service implementations
rg -n "ExtProc\|Process.*Recv\|Process.*Send" --type=go -B 2 -A 5 | head -80

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get more of the patch to understand the full context
tail -n +250 deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch | head -150

Length of output: 4804


Concurrency issue confirmed: per-request state must not be stored on shared Server struct

The patch adds workerIDHint and tokenDataHint string fields to the Server struct (in pkg/bbr/handlers/server.go). These fields are mutated in HandleRequestHeaders and read in HandleRequestBody without any synchronization.

Since the Server struct implements the gRPC ExternalProcessor_ProcessServer service (the Process method is called once per client stream), multiple concurrent streams share the same Server instance. This creates a definitive data race:

  • Race condition: Thread handling stream A's HandleRequestHeaders overwrites s.workerIDHint while thread handling stream B's HandleRequestBody reads it.
  • Cross-request leakage: Headers from one request can bleed into another's body mutation if their call sequences overlap.

These hints must be moved to per-stream or per-request state—either stored in a context value, passed as method parameters, or managed in a per-stream struct separate from Server. If they remain on Server, they must be protected by synchronization (e.g., sync.Mutex).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants