Add support for gemma 3n #1248

krisbiradar · 2025-09-11T05:14:23Z

No description provided.

Introduces LLamaFlashAttentionType enum and integrates flash attention configuration into LLamaContextParams. Adds support for diffusion-based models in SafeLlamaModelHandle. Updates NativeApi and SafeLLamaContextHandle with new adapter metadata and sequence state methods. Syncs llama.cpp submodule.

martindevans · 2025-09-11T10:51:16Z

Thanks for putting this together. I've started a build here to produce new binaries, once those are ready we can update the csproj and run the tests.

krisbiradar · 2025-09-13T15:14:45Z

Hi @martindevans , i guess i need to hardcode the binary file url for now once the changes are tested and ready to merge we can restore the original url right ?

krisbiradar · 2025-09-13T15:24:54Z

Have changed it assuming i had to hardcode if anything else is required do let me know.

martindevans · 2025-09-13T15:35:07Z

Sorry for the delay, I'll create a new release on https://github.com/SciSharp/LLamaSharpBinaries shortly, then you can just put the ID of that release into the csproj file.

martindevans · 2025-09-13T15:39:51Z

Ok I've created https://github.com/SciSharp/LLamaSharpBinaries/releases/tag/86587da, you can put that release ID into the csproj here and it should auto download on build (probably best to do a clean and rebuild to be sure).

krisbiradar · 2025-09-13T15:47:47Z

Will do it in sometime.

krisbiradar · 2025-09-14T11:56:23Z

Hi @martindevans , few tests are failing but i checked out to the last commit before my contribution even there the tests are failing... am i missing something ?

martindevans

For reference, here's the llama.h diff I'm working from for this review: ggml-org/llama.cpp@11dd5a4...86587da#diff-dd7c3bc82b7d2728f1fe661e7820aebbfd53f558a5fdd85bbd6dd618213a118d

There are a few bits missing from this PR that are in the diff, the most important is the flash_attn parameter which has been removed on the C++ side but not the C# side, that will cause all the other parameters to be misaligned. Hopefully fixing this might resolve some of the test failures 🤞

There are a few functions that are missing:

llama_adapter_get_alora_n_invocation_tokens
llama_adapter_get_alora_invocation_tokens
llama_flash_attn_type_name
llama_state_seq_flags (3 related function)

If you're not interested in implementing those right now please just dump their signatures into a comment in NativeApi.cs, that way we don't get out of sync.

LLama/Native/LLamaContextParams.cs

LLama/Native/SafeLLamaContextHandle.cs

krisbiradar · 2025-09-18T18:18:07Z

I tried fixing these but still the tests failed. I will try to dig deeper this weekend.

martindevans · 2025-09-18T18:43:18Z

I'm on holiday this week, so I won't be around much to help out. I'll try to keep tabs on this thread though.

A lot of the test failures seem to be similar, something to do with invalid batches in the Reranker. e.g.

decode: cannot decode batches with this context (calling encode() instead)
init: invalid seq_id[9][0] = 1 >= 1
encode: failed to initialize batch
llama_decode: failed to decode, ret = -1

That seems to be coming from here in llama.cpp, that code looks like checks that none of the tokens in the batch are being assigned to a sequence >= n_seq_id

The LLamaReranker is relatively new, so it's very possible there's a bug of some kind in it.

krisbiradar · 2025-09-18T19:00:01Z

Sure,will go through this

martindevans · 2025-10-03T20:00:28Z

I just opened a PR on your fork into this branch that fixes 2 of the failing tests.

The other seems to be a genuine failure, something wrong with the way batching works. I tried out the BatchedExecutor forking demo as another check and that fails.

krisbiradar · 2025-10-04T00:56:34Z

Will check that , sorry for the delay have been busy with other tasks lately.

martindevans · 2025-10-04T01:00:38Z

No worries, I was meaning to check this out myself ages ago too! I'm hoping to get some time to properly look through this and other PRs this weekend 🤞

…ich is why these were failing now. (#1)

Lyrcaxis · 2025-10-10T16:34:26Z

We got an answer that solves an issue caused by a significant change:

New LLamaContextParams.Default() needs to become like so to support current LLamaSharp behaviour:

static LLamaModelParams Default() {
    var _params = llama_model_default_params();
    _params.kv_unified = true;
    return _params;

    [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
    static extern LLamaContextParams llama_context_default_params();
}

Lyrcaxis · 2025-10-10T16:44:57Z

LLama/Extensions/IContextParamsExtensions.cs

            result.type_k = @params.TypeK ?? GGMLType.GGML_TYPE_F16;
            result.type_v = @params.TypeV ?? GGMLType.GGML_TYPE_F16;
            result.offload_kqv = !@params.NoKqvOffload;
-            result.flash_attention = @params.FlashAttention;


Instead of completely removing the option to use flash attention, can you pass to llama_flash_attn_type?
I would suggest keeping the previous FlashAttention bool as it was -- but turn it to nullable, so null == Auto.

result.llama_flash_attn_type = @params.FlashAttention switch { true => LLamaFlashAttentionType.LLAMA_FLASH_ATTENTION_TYPE_ENABLED, false => LLamaFlashAttentionType.LLAMA_FLASH_ATTENTION_TYPE_DISABLED, null => LLamaFlashAttentionType.LLAMA_FLASH_ATTENTION_TYPE_AUTO } result.kv_unified = true; // if we wanna hardcode it here instead of in `Default()`.

krisbiradar · 2025-10-12T13:15:55Z

Well i tried doing this but it didn't seem to make a difference.
Will check on this today tho.

krisbiradar · 2025-10-13T19:23:37Z

@martindevans

I took a look at the batched sampling test failure — it appears to be caused by a regression in the unified_kv_cache implementation. See PR #14363 and Issue #14847 for context.

as a solution i passed the MaxSeq = 4 in Context params and it worked just fine

let me know if i got this right.

krisbiradar · 2025-10-28T07:57:21Z

Hi @martindevans , @Lyrcaxis you got some time to take a look at this ?

Lyrcaxis

.

Lyrcaxis · 2025-10-28T08:28:29Z

LLama.Unittest/SamplingTests.cs

            _params = new ModelParams(Constants.GenerativeModelPath2) {
                ContextSize = 200,
                BatchSize = 200,
+                SeqMax = 4,


I don't think specifiying SeqMax is necessary now that kv_unified is enabled.

It is due to the regression in llama cpp without this the batched sampling and reranker tests would fail
look at this

Can you try without? I think it isn't needed

I did try and the reranker and batched sampling tests failed. One of the reasons I think is previously it did pass some no of max sequences by default but after the unified kv it has stopped doing that. And passes 0 as max sequences

Looks like this is going to be a breaking change shall we go ahead anyway?

if this answers your query can we go ahead ?

Yeah, go ahead and remove SeqMax from here and other tests.

The tests are there to ensure LL# continues to function properly with at least some backward comp.
Making the users specify the SeqMax during model instantiation is not a good idea.

If kv_unified is enabled correctly, we shouldn't need to specify a seq num beforehand.
ggml-org/llama.cpp#16432

Since Martin mentions there's a min context size now, maybe that's why the test was failing?

Either way, batching should be allowed without having to specify SeqMax, so that has to go.

if that's the case i guess i've missed something else in the kv unified implementation. looking into this

Lyrcaxis · 2025-10-28T08:32:32Z

LLama/Native/NativeApi.cs

-        public static extern bool llama_state_load_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens_out, ulong n_token_capacity, out ulong n_token_count_out);
+        public static extern bool llama_state_load_file(SafeLLamaContextHandle ctx, string path_session,
+            LLamaToken[] tokens_out, ulong n_token_capacity, out ulong n_token_count_out);


Can you revert the line wrapping in this file?

LLama.Unittest/LLamaContextTests.cs

Lyrcaxis · 2025-10-28T08:37:33Z

LLama.KernelMemory/BuilderExtensions.cs

                SplitMode = config.SplitMode,
                BatchSize = 512,
                UBatchSize = 512,
-                FlashAttention = true,


Using FlashAttention should likely remain the LLamaSharp's default.

Set FlashAttention to true in BuilderExtensions for improved performance. Refactored NativeApi P/Invoke method signatures to single-line format for better readability and consistency.

Lyrcaxis · 2025-10-31T09:52:42Z

LLama.Unittest/SamplingTests.cs

            _params = new ModelParams(Constants.GenerativeModelPath2) {
                ContextSize = 200,
                BatchSize = 200,
+                SeqMax = 4,


Yeah, go ahead and remove SeqMax from here and other tests.

The tests are there to ensure LL# continues to function properly with at least some backward comp.
Making the users specify the SeqMax during model instantiation is not a good idea.

If kv_unified is enabled correctly, we shouldn't need to specify a seq num beforehand.
ggml-org/llama.cpp#16432

Lyrcaxis · 2025-10-31T09:53:37Z

LLama/Native/NativeApi.cs

-        public static unsafe int llama_chat_apply_template(byte* tmpl, LLamaChatMessage* chat, nuint n_msg, [MarshalAs(UnmanagedType.U1)] bool add_ass, byte* buf, int length)
+        public static unsafe int llama_chat_apply_template(byte* tmpl, LLamaChatMessage* chat, nuint n_msg,
+            [MarshalAs(UnmanagedType.U1)] bool add_ass, byte* buf, int length)


Can you revert the line wrapping in this file?

Lyrcaxis · 2025-10-31T09:54:21Z

LLama.KernelMemory/LLamaSharpTextEmbeddingGenerator.cs

                SplitMode = config?.SplitMode ?? LLama.Native.GPUSplitMode.Layer,
                BatchSize = 512,
                UBatchSize = 512,
-                FlashAttention = true,


Using FlashAttention should likely remain the LLamaSharp's default.

FlashAttention is now enabled by default in model parameter initialization for embedding and text generation. The unused SeqMax parameter has been removed from unit tests to simplify configuration. Minor formatting improvements were made in IContextParamsExtensions and NativeApi for consistency.

Lyrcaxis · 2025-11-02T19:19:04Z

@krisbiradar @martindevans alright, I figured some way out.

So, if kv_unified is true, the context no longer splits into n_seq_max equal context blocks.
What's remaining is to indeed set n_seq_max to some high value. But I think it shouldn't be on executor level.

On my local integration tests, I had solved everything on the constructor level -- with basicallly just:

defaultParams.n_seq_max = (uint) Math.Min(ContextSize / 8, 256);
defaultParams.kv_unified = true;

This effectively maintains previous LlamaSharp behaviour, but not sure if you'll like that, Martin. Let me know your thoughts.

Also, since the binaries for gemma 3n were built, new noteworthy models came out (e.g. LFM2-MOE). Might be worth an update.

krisbiradar · 2025-11-02T20:11:18Z

@krisbiradar @martindevans alright, I figured some way out.

So, if kv_unified is true, the context no longer splits into n_seq_max equal context blocks. What's remaining is to indeed set n_seq_max to some high value. But I think it shouldn't be on executor level.

On my local integration tests, I had solved everything on the constructor level -- with basicallly just:
defaultParams.n_seq_max = (uint) Math.Min(ContextSize / 8, 256);
defaultParams.kv_unified = true;
This effectively maintains previous LlamaSharp behaviour, but not sure if you'll like that, Martin. Let me know your thoughts.

Also, since the binaries for gemma 3n were built, new noteworthy models came out (e.g. LFM2-MOE). Might be worth an update.

Thanks will push this right away!

krisbiradar · 2025-11-03T18:00:13Z

any luck @Lyrcaxis , @martindevans ?

krisbiradar added 2 commits September 5, 2025 00:40

Update LLamaModelParams.cs

d80d038

krisbiradar mentioned this pull request Sep 11, 2025

[BUG]: Have not yet support gemma-3n? #1219

Open

Update LLamaSharp.csproj

53c8c56

Update LLamaSharp.csproj

20bcf74

martindevans requested changes Sep 15, 2025

View reviewed changes

LLama/Native/LLamaContextParams.cs Show resolved Hide resolved

LLama/Native/SafeLLamaContextHandle.cs Show resolved Hide resolved

bug fix: remove flash attention parameter from the model params

48f109a

Fixed some failing tests, it looks like there's a min context size wh…

424a736

…ich is why these were failing now. (#1)

Lyrcaxis reviewed Oct 10, 2025

View reviewed changes

Fix Reranker and Sampling Test Failures

ff6ea95

krisbiradar requested review from Lyrcaxis and martindevans October 16, 2025 08:08

Lyrcaxis reviewed Oct 28, 2025

View reviewed changes

Enable FlashAttention and clean up P/Invoke signatures

0990be3

Set FlashAttention to true in BuilderExtensions for improved performance. Refactored NativeApi P/Invoke method signatures to single-line format for better readability and consistency.

Lyrcaxis suggested changes Oct 31, 2025

View reviewed changes

Update IContextParamsExtensions.cs

ba5ca33

Add support for gemma 3n #1248

Are you sure you want to change the base?

Add support for gemma 3n #1248

Uh oh!

Conversation

krisbiradar commented Sep 11, 2025

Uh oh!

martindevans commented Sep 11, 2025

Uh oh!

krisbiradar commented Sep 13, 2025

Uh oh!

krisbiradar commented Sep 13, 2025

Uh oh!

martindevans commented Sep 13, 2025

Uh oh!

martindevans commented Sep 13, 2025

Uh oh!

krisbiradar commented Sep 13, 2025

Uh oh!

krisbiradar commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

krisbiradar commented Sep 18, 2025

Uh oh!

martindevans commented Sep 18, 2025

Uh oh!

krisbiradar commented Sep 18, 2025

Uh oh!

martindevans commented Oct 3, 2025

Uh oh!

krisbiradar commented Oct 4, 2025

Uh oh!

martindevans commented Oct 4, 2025

Uh oh!

Lyrcaxis commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyrcaxis Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisbiradar commented Oct 12, 2025

Uh oh!

krisbiradar commented Oct 13, 2025

Uh oh!

krisbiradar commented Oct 28, 2025

Uh oh!

Lyrcaxis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisbiradar Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

krisbiradar commented Sep 14, 2025 •

edited

Loading

Lyrcaxis commented Oct 10, 2025 •

edited

Loading

Lyrcaxis Oct 10, 2025 •

edited

Loading

Lyrcaxis left a comment •

edited

Loading

krisbiradar Oct 30, 2025 •

edited

Loading

Lyrcaxis commented Nov 2, 2025 •

edited

Loading