Skip to content

Conversation

kozistr
Copy link

@kozistr kozistr commented Jun 13, 2025

Note

I'm working on this in my repository, candle-moe. Currently, only the topk_softmax kernel is verified, and other kernels (moe_sum, moe_align_block_size) should be tested properly. And, moe_wna16_gemm is under heavy development.

Actually, it's my first time dealing with CUDA and Rust FFI stuff, so there may be something lacking.

I fixed the candle version to 0.8.0 due to cudarc. iirc, the latest version of candle (0.9.0) has a cudarc 0.16 dependency, where there are some API changes (e.g. device_ptr).

Please feel free to leave any comments or feedback :)

Performance

I haven't profiled its performance using GPU event timers, but wall clock time. Looks like it brings huge improvement (x10, x4 faster, depending on the environment) compared to the naive candle implementation.

Reviewer(s)

@Narsil

@kozistr
Copy link
Author

kozistr commented Sep 9, 2025

huggingface/text-embeddings-inference#717

will open another PR to add the fully working MoE kernel.

@kozistr kozistr closed this Sep 9, 2025
@kozistr kozistr deleted the feature/topk-softmax-kernel branch September 10, 2025 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant