Skip to content

Conversation

chenzhuofu
Copy link
Collaborator

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

zikun-li and others added 30 commits September 7, 2024 17:48
chenzhuofu and others added 28 commits November 29, 2024 09:36
@chenzhuofu
Copy link
Collaborator Author

Main potential conflicts between two branches:

  • Inference application [inference.h]
    • input data format
  • Request Manager [request_manager.h]
    • Execution pipeline: split the prefill and decode stage
  • Batch Config [batch_config.h]
    • Use one class for all cases: incr decode, spec decode
    • Changed args: RequestInfo, TokenInfo, request_available, etc.
  • Ops
    • Attention
      • Changed args: rotary_embedding_meta, etc.
      • kv cache layout: NHD
      • Use AttentionMetaData to store flashinfer-related params
      • PageManager manages pageable memory allocation for kv cache, which in control of metadata attention kernel uses
    • Fused
      • Enable cudaGraph
  • Python interfaces hasn't been checked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants