Skip to content

Improve reasoning and tool call parsing in server#711

Merged
awni merged 10 commits into
mainfrom
server_parse_reasoning
Jan 5, 2026
Merged

Improve reasoning and tool call parsing in server#711
awni merged 10 commits into
mainfrom
server_parse_reasoning

Conversation

@awni
Copy link
Copy Markdown
Member

@awni awni commented Dec 30, 2025

This is not technically part of the OpenAI chat API spec. But seems to be the defacto standard (used by vllm, open router, etc).

And it makes the server work better with upstream tools which use openAI compatible APIs. For example with OpenCode you get the reasoning traces nicely formatted:

Screenshot 2025-12-30 at 12 47 58 PM

I also added tool call parsing and redesigned how custom chat templates and tool call parsing works.

Closes #607

@awni
Copy link
Copy Markdown
Member Author

awni commented Dec 30, 2025

Also fixed a bug where we were keeping eos text in some cases at the end of the generation in the server.

@awni awni requested a review from angeloskath December 30, 2025 20:52
@awni awni changed the title Parse reasoning in server Improve reasoning and tool call parsing in server Dec 31, 2025
@awni awni force-pushed the server_parse_reasoning branch from f627cf7 to 920269a Compare January 1, 2026 00:02
@awni awni force-pushed the server_parse_reasoning branch from 38f8179 to 05d377e Compare January 1, 2026 04:03
@mzbac
Copy link
Copy Markdown
Contributor

mzbac commented Jan 2, 2026

Thanks for the great initiative. I was working on the similar idea of creating different tool parsers for each model. I'm glad I'm on the right path, can't wait to see this get merged and tested on the MiniMax M2.1 with open code on local stack :)

@awni awni force-pushed the server_parse_reasoning branch from 5b99956 to 05ba3a0 Compare January 2, 2026 14:08
@awni awni force-pushed the server_parse_reasoning branch from 05ba3a0 to 96e2906 Compare January 2, 2026 15:30
@awni
Copy link
Copy Markdown
Member Author

awni commented Jan 2, 2026

I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the tokenizer_config.json to add the tool parser type - as in here.

I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option.

@mzbac
Copy link
Copy Markdown
Contributor

mzbac commented Jan 2, 2026

I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the tokenizer_config.json to add the tool parser type - as in here.

I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option.

Thanks for the Minimax parser, I'll try it out, when I was working on my fork to integrate with opencode, I noticed that the tokenizer sometimes generates weird detokenized words, such as "-[space]word" instead of "-word." I'll try the PR branch to see if this issue occurs there as well.

@awni awni force-pushed the server_parse_reasoning branch from f4fc6da to be57c94 Compare January 2, 2026 18:32
@kernelpool
Copy link
Copy Markdown
Contributor

Just a minor observation. I've noticed reasoning is being used instead of reasoning_content which seems to be the direction some projects are going (e.g. vLLM). However, a lot of chat templates still rely on reasoning_content, including GLM-4.7. Considering things like interleaved/persistent reasoning (where a client passes reasoning back to the server), would it be expected that the client still uses reasoning_content for the chat templates to function correctly?

@awni
Copy link
Copy Markdown
Member Author

awni commented Jan 3, 2026

Yes it's confusing. The input messages accept reasoning_content in the assistant role (which is passed directly to the chat template. We could remap that if needed.

The ouput response sets the reasoning field in the choices.

I wouldn't have designed it that way 😅 but that's what the downstream models and upstream tools expect.

For persistent reasoning with GLM4.7 we have to rely on the upstream tool to pass back the reasoning content which I don't think is standard yet.. but that's something to look into.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the tokenizer_config.json to add the tool parser type - as in here.

I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option.

@awni what if this logic is handled internally, starting from model family and managing exceptions from there? Because adding a line in tokenizer_config.json breaks other tools using mlx-lm with their own version of models like lm-studio or exo.
Not the most elegant solution, but maybe less impactful on the ecosystem.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

This PR is a game changer! Experimental as everything around tool calling, but opens doors to new scenarios.
Thanks @awni 🙏🏻

@awni awni mentioned this pull request Jan 3, 2026
4 tasks
@awni
Copy link
Copy Markdown
Member Author

awni commented Jan 3, 2026

I made a hybrid approach which tries to infer the tool parser but still adds the tool_parser_type to the tokenizer_config.json. Are you saying adding tool_parser_type as a field is breaking downstream tools? That would be somewhat surprising.. usually with these kinds of configs they should ignore fields that they don't use.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

No no, I meant that they will have to add this additional field in their version of the models or tool parser won’t work out of the box if they want to leverage same mlx tool parser.

@awni
Copy link
Copy Markdown
Member Author

awni commented Jan 3, 2026

Gotcha! That should be taken care of now!

@mzbac
Copy link
Copy Markdown
Contributor

mzbac commented Jan 4, 2026

Just sharing some findings, may not be directly related to the PR. Somehow, Minimax 2.1 at a temperature of 1.0 picks up a lot of incorrect tokens in the file path, for example, leading spaces with a slash. However, when the temperature is set to 0, it produces a lot of repetitive tokens. In the end, I added file path normalization to remove extra spaces and used temperature 0 with a higher repetition penalty. It seems with these settings, it works fine on my local setup up to a 40k context. I'm not sure if there are some differences in the sampler between MLX and other inference frameworks, given that the model card suggests a temperature of 1.0. @ivanfioravanti

@awni
Copy link
Copy Markdown
Member Author

awni commented Jan 4, 2026

I don't think it's related to this PR. If you have a prompt or somethign that produces low quality outputs that would be useful to share in an issue. And also the model you are using (e.g. quantization).

@mzbac
Copy link
Copy Markdown
Contributor

mzbac commented Jan 5, 2026

I don't think it's related to this PR. If you have a prompt or somethign that produces low quality outputs that would be useful to share in an issue. And also the model you are using (e.g. quantization).

done #725 :)

Copy link
Copy Markdown
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@awni awni merged commit ac8ae2c into main Jan 5, 2026
2 checks passed
@awni awni deleted the server_parse_reasoning branch January 5, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mlx_lm.server crashes when tool_calls aren’t JSON (Codex/OpenCode glob payloads)

5 participants