Add complete Megrez-MoE support: GGUF conversion + inference. #17141

tamarPal · 2025-11-10T10:14:50Z

--Replaces closed PR #17052 (force-push issue). Only Megrez-MoE changes, fully tested.

Architecture

64 routed experts (top-6) + 4 shared experts
Sigmoid + bias gating
30 MoE layers, 2048-dim, 163K context

Changes

Conversion (convert_hf_to_gguf.py): MegrezMoEModel class
- Shared expert FFN: hidden_size × 2.75 = 5632
- Gate bias mapping: e_score_correction_bias → exp_probs_b
- Expert tensor merging (64 experts/layer)
Architecture (constants.py): MODEL_TENSORS.MEGREZ_MOE
Inference (llama.cpp): MoE FFN + graph memory fix

Testing

-Real model: Infinigence/Megrez2-3x7B-A3B (13.92 GB → 14 GB GGUF)
-Conversion: 372 tensors, all parameters correct
-Inference: Coherent output, 5.77 tok/s
-All 40 tests pass

Example

python3 convert_hf_to_gguf.py models/Megrez2-3x7B-A3B/
./build/bin/llama-cli -m megrez2-3x7b-a3b-f16.gguf -p "Hello" -n 50

Implements complete support for Megrez-MoE (Mixture of Experts) models: - Add LLM_ARCH_MEGREZ_MOE architecture enum and mappings - Implement build_mergez_moe_ffn() with sigmoid+bias gating - Add llm_build_megrez_moe class for full model graph construction - Support 31-layer architecture (layer 0: dense FFN, layers 1-30: MoE) - Implement expert sharing pattern with 64 experts, 6 used per token, 4 shared - Load all model hyperparameters and 372 tensors correctly - Configure NEOX RoPE type for proper positional encoding Tested with Megrez2-3x7B-A3B_Q4_K_M.gguf model. All 39 llama.cpp tests pass successfully. Output verified to match infinigence/llama.cpp reference implementation. Note: Use --no-warmup flag to avoid warmup memory allocation issue.

Megrez-MoE creates many intermediate tensors during MoE FFN construction: - sigmoid, add, reshape (3x), get_rows, sum_rows, div, view_2d, mul_mat operations - ggml_top_k internally calls ggml_argsort + ggml_view_4d (2 more tensors per layer) - Each of 30 MoE layers creates ~35 intermediate tensors during graph construction During warmup, the graph is built 3 times with different batch sizes, requiring sufficient memory pool space for all intermediate tensors. Add 4096 node overhead for LLM_ARCH_MEGREZ_MOE to accommodate these intermediate tensors (30 layers × 35 tensors/layer ≈ 1050 nodes, doubled for safety margin). This fixes the 'not enough space in the context's memory pool' error during warmup, allowing Megrez-MoE to work without the --no-warmup flag. Tested: - All 39 tests pass - Megrez-MoE works with warmup enabled (no crashes) - Other models (e.g., Gemma-2) are unaffected - Verified with outputs up to 100 tokens

- Move llm_build_megrez_moe from llama-model.cpp to src/models/megrez-moe.cpp - Add declaration to src/models/models.h - Update CMakeLists.txt to include megrez-moe.cpp in build - Resolve merge conflicts in llama-arch.cpp and llama-model.cpp - Fix PANGU_EMBED case statement closing braces The model loads successfully, all tests pass (40/40), and inference works correctly.

…oe_ffn - Remove custom build_mergez_moe_ffn implementation (100+ lines) - Use existing build_moe_ffn with LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID - Pre-compute gate logits from pre_gate_hidden (Megrez-MoE's unique gating) - Pass pre-computed logits via probs_in parameter - Maintain exact same behavior and output quality This addresses review feedback to reuse existing MoE infrastructure instead of duplicating code. The sigmoid gating + bias after activation is already supported by build_moe_ffn.

- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp - Remove extra blank line in llama-context.cpp

Restore HunYuanMoE code to upstream version - no modifications needed

src/llama-model.cpp

ngxson · 2025-11-10T10:49:21Z

src/llama-context.cpp


 uint32_t llama_context::graph_max_nodes() const {
-    return std::max<uint32_t>(1024u, 8u*model.n_tensors());
+    uint32_t base_nodes = std::max<uint32_t>(1024u, 8u*model.n_tensors());


Suggested change

uint32_t base_nodes = std::max<uint32_t>(1024u, 8u*model.n_tensors());

uint32_t factor = LLM_ARCH_MEGREZ_MOE ? 9u : 8u;

uint32_t base_nodes = std::max<uint32_t>(1024u, factor * model.n_tensors());

increase the 9u if needed

Applied your suggestion.
Thank's!

I fixed it.

replace your code with the one I suggested

src/models/megrez-moe.cpp

ngxson · 2025-11-10T10:54:06Z

src/models/megrez-moe.cpp

+    // Layer 0
+    {


prevent duplicating this code block if possible, merge it to the for loop

I kept the Layer 0 code separate for now. While merging would reduce duplication slightly, the current structure is clearer and separates the dense layer (layer 0) from the MoE layers (1-30). The duplication is minimal and the readability benefit outweighs the consolidation.

I don't see how this can't be merged with the loop below.

Your code below has ((uint32_t) il < hparams.n_layer_dense_lead), which literally translate to "the first n_layer_dense_lead are dense, non-MoE layers"

Unless you think otherwise, you should refactor your code to make it more clear.

- Use explicit expert_layer_stride variable instead of hard-coded expression - Apply clang-format to ensure consistent code style - Fix trailing whitespace issues

ngxson

honestly, I think this PR is taking so much time for maintainers to review.

many parts of the code are not clean, I suggest that you should have a deeper look into how other models structure their code and follow the existing pattern.

pwilkin

Please run editorconfig-checker (https://github.com/editorconfig-checker/editorconfig-checker) and flake8 on your PR and fix the whitespace / indentation errors.

pwilkin · 2025-11-12T13:12:01Z

gguf-py/gguf/constants.py

+        MODEL_TENSOR.FFN_GATE_INP_SHEXP,
+        MODEL_TENSOR.FFN_GATE_SHEXP,
+        MODEL_TENSOR.FFN_DOWN_SHEXP,
+        MODEL_TENSOR.FFN_UP_SHEXP,


MODEL_TENSOR.FFN_EXP_PROBS_B is missing from constants.

tamarPal added 11 commits November 9, 2025 19:55

fix: remove trailing whitespace

755418d

fix: resolve additional merge issues from rebase

007ef13

- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp - Remove extra blank line in llama-context.cpp

Add Megrez-MoE GGUF conversion and inference support

bad2132

fix: restore HunYuanMoE code, keep only MegrezMoE Pyright fix

4b67f5c

fix: remove unintended HunYuanMoE changes

cd46a28

Restore HunYuanMoE code to upstream version - no modifications needed

megrez-moe : fix conversion

125a1d2

megrez-moe : fix pyright type error

019d8f6

tamarPal requested review from CISC and ggerganov as code owners November 10, 2025 10:14

github-actions bot added model Model specific python python script changes labels Nov 10, 2025

tamarPal changed the title ~~Add complete **Megrez-MoE** support: GGUF conversion + inference.~~ Add complete Megrez-MoE support: GGUF conversion + inference. Nov 10, 2025

ngxson reviewed Nov 10, 2025

View reviewed changes

refactor: improve code clarity and address PR review comments

f1f7aa9

- Use explicit expert_layer_stride variable instead of hard-coded expression - Apply clang-format to ensure consistent code style - Fix trailing whitespace issues

ngxson reviewed Nov 10, 2025

View reviewed changes

pwilkin suggested changes Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add complete Megrez-MoE support: GGUF conversion + inference. #17141

Add complete Megrez-MoE support: GGUF conversion + inference. #17141

tamarPal commented Nov 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

ngxson Nov 10, 2025

Uh oh!

tamarPal Nov 10, 2025

Uh oh!

tamarPal Nov 10, 2025

Uh oh!

ngxson Nov 10, 2025

Uh oh!

Uh oh!

ngxson Nov 10, 2025

Uh oh!

tamarPal Nov 10, 2025 •

edited

Loading

Uh oh!

ngxson Nov 10, 2025

Uh oh!

ngxson left a comment

Uh oh!

pwilkin left a comment

Uh oh!

pwilkin Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	uint32_t base_nodes = std::max<uint32_t>(1024u, 8u*model.n_tensors());
	uint32_t factor = LLM_ARCH_MEGREZ_MOE ? 9u : 8u;
	uint32_t base_nodes = std::max<uint32_t>(1024u, factor * model.n_tensors());

Add complete Megrez-MoE support: GGUF conversion + inference. #17141

Are you sure you want to change the base?

Add complete Megrez-MoE support: GGUF conversion + inference. #17141

Conversation

tamarPal commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Changes

Testing

Example

Uh oh!

Uh oh!

ngxson Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

tamarPal Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

tamarPal Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

tamarPal Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tamarPal commented Nov 10, 2025 •

edited

Loading

tamarPal Nov 10, 2025 •

edited

Loading