-
Notifications
You must be signed in to change notification settings - Fork 13.6k
convert : handle compressed-tensors quant method #17069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert : handle compressed-tensors quant method #17069
Conversation
|
Successfully converted with numactl -N 1 -m 1 \
python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF \
/mnt/data/models/moonshotai/Kimi-K2-Thinking/
Shard (46/46): 100%|██████████| 22.5G/22.5G [01:15<00:00, 299Mbyte/s]
Writing: 100%|██████████| 2.05T/2.05T [1:59:32<00:00, 286Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/currently inferencing on a q8_0 (q4_0 routed experts) successfully! |
|
Well thanks to the llama.cpp team there is a GGUF available for testing: https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/ubergarmkimik2thinkinggguf_hugging_face/ Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking on this one today! 🫶 |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
|
Is this the PR that is doing the round trip to It would be better to find this out now than have many TB of bastard quants getting uploaded to HF! |
|
IIUC this will round trip to F32, then quantize to target types like q8_0. The dequant only happens internally. We cannot yet directly map kimi-k2's int4 to q4_0 because q4_0 only supports f16 as scale. Adding q4_0 with bf16 scale is possible but will require quite a lot of works. I think the requant solution introduced by this PR is a good temporary solution. We will improve it if the model (or this quant scheme) get a lot of usages |
|
(deleted so that nobody tries doing this!)
Actually this won't be lossless as I forgot about sub-normals... But it shouldn't be that far off to make much difference. |
This discussion seems to be better to be placed in mi wip branch for mapping int4 -> q4_0. Here is it a bit off-topic my version is almost working, just need to correct the nibble layout. Will open a PR for discussion |
|
just FYI: I think I've made it work with alternative way. Both Q3 and Q2 GGUF seems working: Experimental quants uploading (please allow some more time for the upload) here: Feel free to test the quants and the converter Regardless to this^, it would be nice if this native quant handling would work! |
Yeah, sorry - I'll continue the discussion there. I also tested if using uniform weights with |
|
To be clear, I agree with @jukofyork; requantization isn't lossless when a naïve quantization function is used (like the default The plan is to implement generalized repacking to allow fast lossless conversion (apart from BF16 -> F16 for very small scales). But as described in #14810 (comment), this requires handling permutes, reshapes, stacking, (and maybe splitting), and so it's not exactly simple. I'll still proceed with merging this because dequantization works (but note the caveat with requantization), and repacking will use part of that functionality (to get the quantized values in the correct order). |
TBH, it shouldn't be expected to be either.
Agreed, repacking is an additional feature that deserves its own PR. |



(alternative to #17064, cc @ngxson)
This adds support for a few formats in the
compressed-tensorsquant method.pack-quantizedsymmetric = true(without zero point)symmetric = false(with zero point)int-quantizedfloat-quantizedstrategy = "channel"strategy = "block"naive-quantizedI've also re-tested plain
fp8with https://huggingface.co/Qwen/Qwen3-4B-FP8 to make sure I didn't break it.I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.
Make sure to read the contributing guidelines before submitting a PR