Replies: 1 comment
-
The issue stemmed from using quantized KV cache (--ctk and --ctv set to q8_0). After removing these parameters, inference worked correctly. I verified in the Ollama source code that GPT-OSS explicitly disallows quantized cache types due to its use of attention sinks. The code includes a check that disables quantized KV cache when the model architecture is gpt-oss. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m running the llama.cpp service using the following command:
ramalama serve gpt-oss -c 64000 --ngl 99 --cache-reuse 256 --runtime-args="--keep -1 -ctk q8_0 -ctv q8_0 --jinja"
Model:
gpt-oss
pulled fromhf://ggml-org/gpt-oss-20b-GGUF
I interact with the model via the web interface launched by the service (
http://0.0.0.0:8080
). As the conversation context grows over multiple rounds, I notice a significant drop in inference throughput — from ~80 tokens/sec initially to around 20–40 tokens/sec after a few exchanges.I’ve experimented with various parameters including
--cache-reuse
,--keep
, and--swa-checkpoint
, but none of these adjustments seem to improve performance.Is this slowdown expected behavior with long context windows? Or is there something I might be missing in terms of configuration or optimization?
The service logs:
cache_reuse is not supported by this context, it will be disabled
. Does this mean the model architecture or runtime configuration doesn’t support KV reuse in this setup?Full log:
Beta Was this translation helpful? Give feedback.
All reactions