-
Notifications
You must be signed in to change notification settings - Fork 692
feat: create new epp patch #4374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anna Tchernych <[email protected]>
WalkthroughThe pull request updates the EPP Dynamo build process with a new v0.7.0 patch that introduces Dynamo-based routing enhancements. These include new pre-request and KV-scorer plugins, YAML configuration, and modifications to BBR server logic to support worker ID tracking and injection. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch (6)
11-50: Dynamo image targets may invoke two builds per call and leak builder on failureThe new
dynamo-image-local-buildalways invokesdynamo-image-buildtwice (once withPUSH=$(PUSH), once withLOAD=$(LOAD)), which meansmake dynamo-image-local-loadordynamo-image-local-pushwill likely perform two builds, and in the push case might push twice depending on howPUSHis propagated. Also, the temporary Buildx builder is only removed if all steps succeed; a failure in onedynamo-image-buildrun leaves the builder behind.If this mirrors an existing pattern elsewhere, it’s acceptable, but you may want to:
- Either call
dynamo-image-buildonce with bothPUSHandLOADas appropriate for the target (e.g., onlyLOAD=--loadfor*-local-load, onlyPUSH=--pushfor*-local-push).- Or gate the second
$(MAKE) dynamo-image-build ...call on the corresponding variable being non-empty, to avoid redundant builds and reduce Buildx resource usage.
99-141: Request body mutation and header preservation behavior change is intentional but worth validatingThe updated
HandleRequestBodynow:
- Injects
backend_instance_idintodata["nvext"]whens.workerIDHintis present.- Decodes base64-encoded JSON token data from
s.tokenDataHint, unmarshals into[]int64, and injects it asnvext.token_data.- For both streaming and non‑streaming cases, always re-serializes and returns the (possibly mutated) body instead of pass‑through, including when
modelis missing.- In the streaming case, keeps the original model header and optionally preserves the
x-worker-instance-idheader by appending a conditional header option and then pruningnilentries inSetHeaders.This all looks logically consistent, with good guardrails (empty/invalid base64 or JSON are silently ignored). Two minor suggestions:
- Consider logging at a low verbosity when base64 decoding or JSON unmarshal fails, so misconfigured callers can be diagnosed without changing behavior.
- Consider centralizing the header key constants with the plugin package to avoid future divergence of string literals.
Also applies to: 185-255
312-381: Pre-request plugin logic is straightforward; consider sharing header constants
InjectWorkerIDPreRequest:
- Normalizes the
x-worker-instance-idheader, trimming whitespace and dropping empty values.- Writes the trimmed value back to
WorkerIDHeaderand also to the hint header so downstream can inject intonvext.- Passes through the token data header if present.
The basic behavior looks correct and defensive against missing headers. One small improvement: the header key strings are duplicated here and in
pkg/bbr/handlers/request.go. A shared constants package (or reusing this exportedWorkerIDHeader/TokenDataHeaderin the BBR handlers) would reduce the risk of future string drift.
382-408: Config example wires plugins sensibly; check whether the pre-request plugin needs explicit profile wiringThe sample
epp-config-dynamo.yaml:
- Registers
dynamo-inject-workerid(dyn-pre) andkv-aware-scorer(dyn-kv) in theplugins:section.- Uses
dyn-kvandpickerin thedefaultscheduling profile.Assuming the config schema runs
PreRequestplugins based on the top-levelpluginslist and not theschedulingProfilesentries, this is fine. If pre-request plugins must be referenced explicitly in profiles or another section,dyn-premay currently be defined but unused.Worth double‑checking against the GAIE EPP config spec to ensure
dyn-preis actually invoked as intended.
413-603: FFI config and initialization are mostly robust; watch the panic pathway and loggingThe FFI setup (
loadDynamoConfig+initFFI) does:
- Required
DYNAMO_KV_BLOCK_SIZEwith range and power‑of‑two validation, panicking on misconfig.- Optional env overrides for namespace, component, model, worker ID, and routing knobs.
warmupOncewrapsinitFFIwith arecoverso panics during initial factory warmup turn into a clean plugin init error.A couple of observations:
loadDynamoConfig’spanicis only recovered when called throughwarmupOnceinKVAwareScorerFactory. IfinitFFIis ever invoked via another code path before factory warmup, it would panic the process. Today that doesn’t happen, but it’s worth keeping in mind for future refactors.fmt.Printffor configuration messages is fine for early development; in production you may prefer using the controller‑runtime logger for consistency and log routing.Functionally this looks sound; just be careful if you later reuse
initFFIoutside the current warmup flow.
685-726: Scorer behavior: uniform scores but side-effectful headers; consider logging noise
KVAwareScorer.Score:
- Calls the Dynamo router once, writes the selected worker ID into
CycleStateand request headers, and attaches base64‑encoded token data.- Returns a uniform score of
1.0for all pods.This is intentional if the real routing decision happens outside the EPP scheduler (via the worker ID hint), but two small considerations:
- You log the full
tokenDataslice at default verbosity. If these arrays are large or sensitive, consider logging onlylen(tokenData)and maybe the first N entries.- The second
if req.Headers == nilbefore writingTokenDataHeaderis redundant since you already guard earlier, but harmless.deploy/inference-gateway/build-epp-dynamo.sh (1)
86-96: Patch path update is correct; consider aligning Dockerfile echoUpdating:
PATCH_FILE="${DYNAMO_DIR}/deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch"matches the new patch location introduced in this repo and keeps the control flow unchanged (check existence →
git apply --check→git apply). That looks good.Minor nit (pre-existing): you copy
Dockerfile.eppinto${GAIE_DIR}/Dockerfile.dynamobut later echoDocker: ${GAIE_DIR}/Dockerfile.epp. For clarity, you may want the echo to referenceDockerfile.dynamo.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/inference-gateway/build-epp-dynamo.sh(1 hunks)deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch (4)
55-76: Plugin registration wiring looks consistentThe registration of
"dynamo-inject-workerid"and"kv-aware-scorer"againstInjectWorkerIDPreRequestFactoryandKVAwareScorerFactorymatches the type strings used in the respective plugins and inepp-config-dynamo.yaml, so the wiring should resolve correctly at startup.
730-799: Router call error handling and memory management look correct
callDynamoRouter:
- Ensures
initFFIsucceeds andruntimeInitializedis true before proceeding.- Checks
currentPipelinefor nil.- Marshals an OpenAI-style request body, converts it to C string, and defers
C.free.- Copies C token buffer into Go memory with
unsafe.Sliceand then callsdynamo_free_worker_selection_resultto free C allocations.- Logs the result and returns worker ID plus
[]int64token data.The memory ownership and error checks all look correct. Nice job keeping the FFI boundary tight.
800-819: OpenAI-style request construction is simple and safe
buildOpenAIRequestfalls back to sane defaults:
- Uses
req.Promptif non-empty, otherwise"default prompt".- Uses
req.TargetModelwhen present, else the configuredffiModel.- Sets
max_tokens,temperature,stream, and a basicnvext.annotations.This is fine as a starting point. If you later need to reflect caller options (e.g., temperature or max tokens), this is a good centralized place to extend.
604-681: Shared pipeline pointer and cleanup could race ifcleanupDynamois called while routingThe FFI fields maintain a single
*C.struct_WorkerSelectionPipelinepluspipelineMutex. Creation and destruction are guarded by the mutex, butcallDynamoRouterdoes:pipelineMutex.RLock() currentPipeline := pipeline pipelineMutex.RUnlock() // ... rc := C.dynamo_query_worker_selection_and_annotate(currentPipeline, ...)If
cleanupDynamois called concurrently, it can destroy and null outpipelinebetween theRUnlockand the FFI call, leavingcurrentPipelinepointing at freed memory. Right nowcleanupDynamodoesn’t appear to be called from the patch, but if it’s hooked into shutdown while requests are still in flight, that becomes a use‑after‑free.Safer options:
- Hold the read lock for the duration of the FFI call, or
- Introduce a separate
shutdownflag and only callcleanupDynamoafter routing has quiesced.Also, depending on the thread‑safety of
WorkerSelectionPipelineon the Rust side, you may want to serialize calls todynamo_query_worker_selection_and_annotatebehind a mutex rather than allowing arbitrary concurrent FFI calls.Also applies to: 823-841
| + // reset per-request | ||
| + s.workerIDHint = "" | ||
| + s.tokenDataHint = "" | ||
| + | ||
| + if m := headers.GetHeaders(); m != nil { | ||
| + for _, h := range m.GetHeaders() { | ||
| + k := strings.ToLower(h.GetKey()) | ||
| + | ||
| + switch k { | ||
| + case injectHintHeader, workerIDHeader: | ||
| + if rv := h.GetRawValue(); len(rv) > 0 { | ||
| + s.workerIDHint = strings.TrimSpace(string(rv)) | ||
| + } else { | ||
| + s.workerIDHint = strings.TrimSpace(h.GetValue()) | ||
| + } | ||
| + case tokenDataHeader: | ||
| + if rv := h.GetRawValue(); len(rv) > 0 { | ||
| + s.tokenDataHint = strings.TrimSpace(string(rv)) | ||
| + } else { | ||
| + s.tokenDataHint = strings.TrimSpace(h.GetValue()) | ||
| + } | ||
| + } | ||
| + } | ||
| + } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Per-request hints stored on Server fields are likely racy across concurrent streams
HandleRequestHeaders now mutates s.workerIDHint and s.tokenDataHint, and HandleRequestBody reads those fields later to inject nvext.backend_instance_id and nvext.token_data. These hints are per-request, but Server itself is the gRPC service implementation and is typically shared across multiple concurrent Process streams.
This introduces two issues:
- Data races on
workerIDHint/tokenDataHintwhen multiple requests are processed concurrently. - Cross-request leakage, where headers from one stream can affect body mutation for another stream if their lifetimes overlap.
You likely need to make the hints part of the per-stream/per-request state instead of Server fields, e.g.:
- Store them in a struct that lives per
Processstream and is passed toHandleRequestHeaders/HandleRequestBody, or - Thread them via the
context.Contextor thedata/headers payload, avoiding shared mutable state onServer.
Also applies to: 295-303
I'll need to examine the codebase to verify the concurrency concerns raised in the review comment. Let me search for the source files and understand the implementation.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the source files being patched by looking at the patch file and surrounding Go source
find . -type f -name "*.go" | head -20
Length of output: 1207
🏁 Script executed:
#!/bin/bash
# Search for Server struct definition and HandleRequestHeaders/HandleRequestBody methods
rg -n "type Server struct" --type=go -A 15Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for HandleRequestHeaders method
rg -n "HandleRequestHeaders" --type=go -B 2 -A 10Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for HandleRequestBody method
rg -n "HandleRequestBody" --type=go -B 2 -A 10Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for workerIDHint and tokenDataHint field declarations
rg -n "workerIDHint\|tokenDataHint" --type=go -B 3 -A 3Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look for the patch file to understand context
cat deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch | head -100Length of output: 3683
🏁 Script executed:
#!/bin/bash
# Search for Process stream method to understand concurrency model
rg -n "func.*Process.*stream" --type=go -B 2 -A 15Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for any sync.Mutex or other synchronization on Server
rg -n "sync\\.Mutex\|sync\\.RWMutex\|atomic" --type=go | grep -i serverLength of output: 0
I need to examine the patch file more thoroughly and locate the source files it's modifying. Let me search for the actual Go source being patched.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find and examine the request.go file being patched
find . -name "request.go" -type f | xargs ls -la
Length of output: 3094
🏁 Script executed:
#!/bin/bash
# Read the full patch to identify the Server type being modified
cat deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch | grep -A 50 -B 5 "workerIDHint\|tokenDataHint" | head -200Length of output: 6461
🏁 Script executed:
#!/bin/bash
# Search for gRPC Process service or stream handling
rg -n "Process.*context" --type=go -B 3 -A 10 | head -50Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look for the source file at the path indicated in the patch
cat pkg/bbr/handlers/request.go 2>/dev/null | head -100Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for "ExtProcService" or other gRPC service implementations
rg -n "ExtProc\|Process.*Recv\|Process.*Send" --type=go -B 2 -A 5 | head -80Length of output: 0
🏁 Script executed:
#!/bin/bash
# Get more of the patch to understand the full context
tail -n +250 deploy/inference-gateway/epp-patches/v0.7.0/epp-v0.7.0-dyn.patch | head -150Length of output: 4804
Concurrency issue confirmed: per-request state must not be stored on shared Server struct
The patch adds workerIDHint and tokenDataHint string fields to the Server struct (in pkg/bbr/handlers/server.go). These fields are mutated in HandleRequestHeaders and read in HandleRequestBody without any synchronization.
Since the Server struct implements the gRPC ExternalProcessor_ProcessServer service (the Process method is called once per client stream), multiple concurrent streams share the same Server instance. This creates a definitive data race:
- Race condition: Thread handling stream A's
HandleRequestHeadersoverwritess.workerIDHintwhile thread handling stream B'sHandleRequestBodyreads it. - Cross-request leakage: Headers from one request can bleed into another's body mutation if their call sequences overlap.
These hints must be moved to per-stream or per-request state—either stored in a context value, passed as method parameters, or managed in a per-stream struct separate from Server. If they remain on Server, they must be protected by synchronization (e.g., sync.Mutex).
Overview:
Have to create a new epp image starting 0.7.0 due to an upstream change in Dynamo (new glibc (2.38+))
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Infrastructure