Improve reasoning and tool call parsing in server by awni · Pull Request #711 · ml-explore/mlx-lm · GitHub

awni · 2025-12-30T20:51:05Z

This is not technically part of the OpenAI chat API spec. But seems to be the defacto standard (used by vllm, open router, etc).

And it makes the server work better with upstream tools which use openAI compatible APIs. For example with OpenCode you get the reasoning traces nicely formatted:

I also added tool call parsing and redesigned how custom chat templates and tool call parsing works.

Closes #607

awni · 2025-12-30T20:51:57Z

Also fixed a bug where we were keeping eos text in some cases at the end of the generation in the server.

mzbac · 2026-01-02T03:40:15Z

Thanks for the great initiative. I was working on the similar idea of creating different tool parsers for each model. I'm glad I'm on the right path, can't wait to see this get merged and tested on the MiniMax M2.1 with open code on local stack :)

awni · 2026-01-02T17:18:15Z

I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the tokenizer_config.json to add the tool parser type - as in here.

I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option.

mzbac · 2026-01-02T18:00:42Z

I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the tokenizer_config.json to add the tool parser type - as in here.

I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option.

Thanks for the Minimax parser, I'll try it out, when I was working on my fork to integrate with opencode, I noticed that the tokenizer sometimes generates weird detokenized words, such as "-[space]word" instead of "-word." I'll try the PR branch to see if this issue occurs there as well.

kernelpool · 2026-01-03T03:31:14Z

Just a minor observation. I've noticed reasoning is being used instead of reasoning_content which seems to be the direction some projects are going (e.g. vLLM). However, a lot of chat templates still rely on reasoning_content, including GLM-4.7. Considering things like interleaved/persistent reasoning (where a client passes reasoning back to the server), would it be expected that the client still uses reasoning_content for the chat templates to function correctly?

awni · 2026-01-03T03:47:32Z

Yes it's confusing. The input messages accept reasoning_content in the assistant role (which is passed directly to the chat template. We could remap that if needed.

The ouput response sets the reasoning field in the choices.

I wouldn't have designed it that way 😅 but that's what the downstream models and upstream tools expect.

For persistent reasoning with GLM4.7 we have to rely on the upstream tool to pass back the reasoning content which I don't think is standard yet.. but that's something to look into.

ivanfioravanti · 2026-01-03T09:21:10Z

I just added a parser for Minimax M2.1 as well. The main thing to be aware of is that right now you have to manually modify the tokenizer_config.json to add the tool parser type - as in here.

I would like to make that easier to automate.. but it's a bit tricky to auto-detect it since it's the tool parser is different than the model family in some cases (e.g. qwen 3 coder vs qwen 3). Might also be good to add a way to specify the tool call parser as an option.

@awni what if this logic is handled internally, starting from model family and managing exceptions from there? Because adding a line in tokenizer_config.json breaks other tools using mlx-lm with their own version of models like lm-studio or exo.
Not the most elegant solution, but maybe less impactful on the ecosystem.

ivanfioravanti · 2026-01-03T10:29:50Z

This PR is a game changer! Experimental as everything around tool calling, but opens doors to new scenarios.
Thanks @awni 🙏🏻

awni · 2026-01-03T15:46:46Z

I made a hybrid approach which tries to infer the tool parser but still adds the tool_parser_type to the tokenizer_config.json. Are you saying adding tool_parser_type as a field is breaking downstream tools? That would be somewhat surprising.. usually with these kinds of configs they should ignore fields that they don't use.

ivanfioravanti · 2026-01-03T16:36:19Z

No no, I meant that they will have to add this additional field in their version of the models or tool parser won’t work out of the box if they want to leverage same mlx tool parser.

awni · 2026-01-03T16:58:57Z

Gotcha! That should be taken care of now!

mzbac · 2026-01-04T03:30:34Z

Just sharing some findings, may not be directly related to the PR. Somehow, Minimax 2.1 at a temperature of 1.0 picks up a lot of incorrect tokens in the file path, for example, leading spaces with a slash. However, when the temperature is set to 0, it produces a lot of repetitive tokens. In the end, I added file path normalization to remove extra spaces and used temperature 0 with a higher repetition penalty. It seems with these settings, it works fine on my local setup up to a 40k context. I'm not sure if there are some differences in the sampler between MLX and other inference frameworks, given that the model card suggests a temperature of 1.0. @ivanfioravanti

awni · 2026-01-04T04:16:19Z

I don't think it's related to this PR. If you have a prompt or somethign that produces low quality outputs that would be useful to share in an issue. And also the model you are using (e.g. quantization).

mzbac · 2026-01-05T02:48:47Z

I don't think it's related to this PR. If you have a prompt or somethign that produces low quality outputs that would be useful to share in an issue. And also the model you are using (e.g. quantization).

done #725 :)

angeloskath

Awesome!

awni requested a review from angeloskath December 30, 2025 20:52

awni changed the title ~~Parse reasoning in server~~ Improve reasoning and tool call parsing in server Dec 31, 2025

awni mentioned this pull request Dec 31, 2025

Fix tool calling with openai server #598

Closed

awni added 3 commits December 31, 2025 14:24

Parse reasoning in server

ac7d7fe

redesign and start to fix tool parsing

3c52635

add function gemma

920269a

awni force-pushed the server_parse_reasoning branch from f627cf7 to 920269a Compare January 1, 2026 00:02

fix

05d377e

awni force-pushed the server_parse_reasoning branch from 38f8179 to 05d377e Compare January 1, 2026 04:03

fix

50b87c3

awni force-pushed the server_parse_reasoning branch from 3f7b2cd to 5b99956 Compare January 1, 2026 17:28

ivanfioravanti mentioned this pull request Jan 1, 2026

Normalize tool schemas and tool calls #715

Closed

awni force-pushed the server_parse_reasoning branch from 5b99956 to 05ba3a0 Compare January 2, 2026 14:08

glm47 tools

96e2906

awni force-pushed the server_parse_reasoning branch from 05ba3a0 to 96e2906 Compare January 2, 2026 15:30

add minimax m2 tool parser

be57c94

awni force-pushed the server_parse_reasoning branch from f4fc6da to be57c94 Compare January 2, 2026 18:32

Awni Hannun added 2 commits January 2, 2026 11:32

tool_call finish reason

c0ae67f

Keep model wired in the server to reduce ttft

3e1be01

awni mentioned this pull request Jan 3, 2026

Add MiniMax M2 tool call parser support #665

Closed

4 tasks

infer tool parser

13f4497

angeloskath approved these changes Jan 5, 2026

View reviewed changes

awni merged commit ac8ae2c into main Jan 5, 2026
2 checks passed

awni deleted the server_parse_reasoning branch January 5, 2026 22:18

TomLucidor mentioned this pull request Jan 5, 2026

Guides on using this with Claude Code cubist38/mlx-openai-server#122

Closed

TinyWorkshopDesign mentioned this pull request May 21, 2026

feat: add tool parser support for Qwen 3.5/3.6 (non-Coder) models #1293

Open

Conversation

awni commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Dec 30, 2025

Uh oh!

mzbac commented Jan 2, 2026

Uh oh!

awni commented Jan 2, 2026

Uh oh!

mzbac commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kernelpool commented Jan 3, 2026

Uh oh!

awni commented Jan 3, 2026

Uh oh!

ivanfioravanti commented Jan 3, 2026

Uh oh!

ivanfioravanti commented Jan 3, 2026

Uh oh!

awni commented Jan 3, 2026

Uh oh!

ivanfioravanti commented Jan 3, 2026

Uh oh!

awni commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzbac commented Jan 4, 2026

Uh oh!

awni commented Jan 4, 2026

Uh oh!

mzbac commented Jan 5, 2026

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awni commented Dec 30, 2025 •

edited

Loading

mzbac commented Jan 2, 2026 •

edited

Loading

awni commented Jan 3, 2026 •

edited

Loading