Add Step 3.5 Flash#836
Conversation
|
You rock @kernelpool 🔥 |
| def __call__(self, x: mx.array) -> mx.array: | ||
| return mx.fast.rms_norm(x, self.weight + 1, self.eps) |
There was a problem hiding this comment.
Minor optimization: we can preprocess the weight in sanitize instead adding one every time.
| self.router_bias = mx.zeros((self.n_routed_experts,)) | ||
|
|
||
| def __call__(self, x: mx.array): | ||
| gates = self.gate(x.astype(mx.float32)) |
There was a problem hiding this comment.
I would do the cast inside the moe_gate_select. It will get fused in the compilation.
| self.routed_scaling_factor, | ||
| self.norm_topk_prob, | ||
| ) | ||
| return inds, weights.astype(x.dtype) |
There was a problem hiding this comment.
This looks really nice! Just left a few minor comments. After that, let's merge it!
Note: I've used regular KVCache with masking for the sliding window attention to support trimming
That's pretty interesting. I think we need a better story there in mlx-lm in general, but I'm not entirely sure what it should look like. The problem with that is it will increase memory usage, potentially substantially. On the other hand it does break prompt caching to some extent to use a fixed size KV cache.
|
Yeah, it was running a lot slower with the RotatingKVCache in practical use (OpenCode, etc) since it frequently has to reprocess everything. Could perhaps have some "snapshots" for the rotatingkvcache to revert back to so that we dont have to do process the whole prompt, but that would probably complicate things a fair bit. |
2195808 to
6c7c40b
Compare
|
No, it's generating nonsense using my mlx 8bit. I need to find why... By the way I generated the model using an updated version of mlx-my-repo to use the newest mlx-lm code which used this commit: I verified the sha256 of the model files to make sure my download was correct. So I think it's either the generated mlx 8bit is messed up somehow on the cloud, or something wrong in the commit. I'll just quant a 4bit using "mlx-my-repo" and see what happens. Also I'll download kernelpool's 4bit to test. By the way, the nonsense looks like this: Edit: I can confirm kernelpool's 4bit model is good. My model is somehow problematic. I guess certain commit will generate the wrong model? Edit: I generated a 4bit quant, and it still outputs like that. Something wrong with the quantization progress of mlx-my-repo. But the code just calls |
No problem. And I just confirmed that using commit b8c4549 can generate the correct model. |
|
Is the 4bit quant from yesterday working? Just double-checking before I start downloading on my slow-ish connection. Thanks for the quick support and awesome work! |
|
Yes, it's working. It was converted prior to the buggy commit. |
* Add Step 3.5 Flash * Shard model * Feedback
Update mlx-lm to v0.30.6 which includes Step 3.5 Flash support (ml-explore/mlx-lm#836). Add model cards for the 4bit, 6bit, and 8bit quantizations from mlx-community. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Model: https://huggingface.co/stepfun-ai/Step-3.5-Flash
Note: I've used regular KVCache with masking for the sliding window attention to support trimming
Tool calling uses same format as qwen3 coder.
Issue: #835