-
Notifications
You must be signed in to change notification settings - Fork 305
Introduce Fused MoE to Nomic MoE #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @kozistr thanks for this PR! I'll try to run some benchmarks on my end to in order to compare both solutions! Also, do you think it would make sense to add this within https://github.com/huggingface/candle-extensions instead of on a separate repository? Asking as this way it might be easier to maintain in the long-term cc @ivarflakstad too as per the latter! Thanks again, I'll come back to you once I've tested + reviewed @kozistr 🤗 |
Hi @alvarobartt! I also think it'd be much better to add the MoE kernel to the Actually, I had opened an incomplete version of the MoE kernel PR before, which is partially working. And, it'd be a nice time to renew that PR with the new implementation 🤗 I'll get back to you when ready to reopen the PR to |
Hey @kozistr thanks for flagging, I missed that! Thanks for the work, and please do let us know if there's anything other than reviews that we can do to help 🤗 |
thanks for your help :) I'll surely get back to you if I need any help 🤗 btw, I've just opened a PR at |
What does this PR do?
related to #596
I've completed testing the fused MoE kernel, which is originally implemented in here by @EricLBuehler. (thanks!)
Here's a fused MoE implementation repository: https://github.com/kozistr/candle-moe. (I adopted and edited Eric's baseline to work with the Nomic MoE version)
Main Changes
topk_softmax
fused MoE
Of course, I've also tested that it outputs the (almost) identical result to the naive implementation.
And, honestly, I haven't yet run extensive benchmarks across multiple settings due to time and resource constraints :(, but I've observed an improvement in latency based on wall clock time. (but still have to be verified more and benchmark precise kernel timing)
Also, I'm very new to CUDA programming, so any feedback or suggestions would be greatly appreciated :)
Before submitting
insta
snapshots?Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil OR @alvarobartt