Improve reasoning and tool call parsing in server#711
Conversation
|
Also fixed a bug where we were keeping eos text in some cases at the end of the generation in the server. |
f627cf7 to
920269a
Compare
38f8179 to
05d377e
Compare
3f7b2cd to
5b99956
Compare
|
Thanks for the great initiative. I was working on the similar idea of creating different tool parsers for each model. I'm glad I'm on the right path, can't wait to see this get merged and tested on the MiniMax M2.1 with open code on local stack :) |
5b99956 to
05ba3a0
Compare
05ba3a0 to
96e2906
Compare
|
I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option. |
Thanks for the Minimax parser, I'll try it out, when I was working on my fork to integrate with opencode, I noticed that the tokenizer sometimes generates weird detokenized words, such as "-[space]word" instead of "-word." I'll try the PR branch to see if this issue occurs there as well. |
f4fc6da to
be57c94
Compare
|
Just a minor observation. I've noticed |
|
Yes it's confusing. The input The ouput response sets the I wouldn't have designed it that way 😅 but that's what the downstream models and upstream tools expect. For persistent reasoning with GLM4.7 we have to rely on the upstream tool to pass back the reasoning content which I don't think is standard yet.. but that's something to look into. |
@awni what if this logic is handled internally, starting from model family and managing exceptions from there? Because adding a line in tokenizer_config.json breaks other tools using mlx-lm with their own version of models like lm-studio or exo. |
|
This PR is a game changer! Experimental as everything around tool calling, but opens doors to new scenarios. |
|
I made a hybrid approach which tries to infer the tool parser but still adds the |
|
No no, I meant that they will have to add this additional field in their version of the models or tool parser won’t work out of the box if they want to leverage same mlx tool parser. |
|
Gotcha! That should be taken care of now! |
|
Just sharing some findings, may not be directly related to the PR. Somehow, Minimax 2.1 at a temperature of 1.0 picks up a lot of incorrect tokens in the file path, for example, leading spaces with a slash. However, when the temperature is set to 0, it produces a lot of repetitive tokens. In the end, I added file path normalization to remove extra spaces and used temperature 0 with a higher repetition penalty. It seems with these settings, it works fine on my local setup up to a 40k context. I'm not sure if there are some differences in the sampler between MLX and other inference frameworks, given that the model card suggests a temperature of 1.0. @ivanfioravanti |
|
I don't think it's related to this PR. If you have a prompt or somethign that produces low quality outputs that would be useful to share in an issue. And also the model you are using (e.g. quantization). |
done #725 :) |
This is not technically part of the OpenAI chat API spec. But seems to be the defacto standard (used by vllm, open router, etc).
And it makes the server work better with upstream tools which use openAI compatible APIs. For example with OpenCode you get the reasoning traces nicely formatted:
I also added tool call parsing and redesigned how custom chat templates and tool call parsing works.
Closes #607