Skip to content

Conversation

@natehofmann
Copy link

build off of liangjason87:liangjason/spec_decoding, but with rough Rust layer. haven't tried yet, things still all over the place. keeping to simple, non-stream, non-lora case

@mmoskal
Copy link
Member

mmoskal commented Mar 28, 2025

I think the spec-decoding logic should sit in async_exec.rs and not be visible outside much.

  • disable logits_processor on the draft model for now
  • spin two threads for response from trtllm (you already do that)
  • in ReqData keep the current draft and real req id

When a request from user is added (assume n_draft_tokens=5)

  • enqueue request in draft executor, with max_tokens=5
  • when the draft responder thread gets the response (all 5 tokens), validate tokens against the constraint (may ignore for now), and enqueue new request in the real executor (max_tokens=1, though maybe that's implicit)
  • when the real responder thread gets the response (I guess it's going to be up to 6 tokens), send StepResults back to the user and enqueue new request in draft executor (if request not done)

This way you don't have to mess with completions.rs much.

@natehofmann natehofmann force-pushed the nathof/spec-decode-support branch from 265cdd4 to 4dc658a Compare April 11, 2025 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants