Fix Gemma 4 KV-shared layers creating unused projections#1158
Merged
angeloskath merged 4 commits intoApr 21, 2026
Conversation
Gemma 4 E4B/E2B models share KV projections across later layers (controlled by num_kv_shared_layers). The Attention class was creating k_proj, v_proj, k_norm, and v_norm for all layers, but shared layers never use them — the forward pass routes KV from earlier layers via shared_kv. This caused strict weight loading to fail for any Gemma 4 model saved through transformers (fine-tunes, merges, abliterations), since transformers correctly omits these weights for shared layers. - Skip creating k_proj/v_proj/k_norm/v_norm for KV-shared layers - Add defensive ValueError if a shared layer receives no shared_kv - Add test verifying shared layers omit KV projections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ericcurtin
pushed a commit
to vllm-project/vllm-metal
that referenced
this pull request
Apr 27, 2026
fix #299 * Passed Qwen3 deterministic test. No regression * Passed Gemma4-E4B e2e smoke test, the model loaded successfully and output normal token (not gibberish) future plan: need to keep watching on ml-explore/mlx-lm#1158 . There might be more upstream bug fixings in the future. --------- Signed-off-by: Ranran Haoran Zhang <haorzhang@ebay.com> Signed-off-by: ran <hzz5361@psu.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
num_kv_shared_layers). TheAttentionclass was creatingk_proj,v_proj,k_norm, andv_normfor all layers, but shared layers never use them — the forward pass routes KV from earlier layers viashared_kv.load_weights(strict=True)to fail for any Gemma 4 model saved through transformers (save_pretrained), since transformers correctly omits these weights for shared layers. This affects all derivative models: fine-tunes, merges, abliterations, etc.k_proj/v_proj/k_norm/v_normfor KV-shared layers, matching the transformers implementation.ValueErrorif a shared layer somehow receives noshared_kvat runtime.Test plan
gemma4tests passtest_gemma4_kv_shared_layers_omit_kv_projectionsverifies shared layers don't create KV modulesOBLITERATUS/gemma-4-E4B-it-OBLITERATEDblack🤖 Generated with Claude Code