Add 'mx.clear_cache()' to piecewise prompt processing in server. by N8python · Pull Request #917 · ml-explore/mlx-lm · GitHub

N8python · 2026-02-22T02:31:01Z

Prompts that were being newly added to the server didn't have cache-clearing enabled while being processed. This lead to massive memory hang.

For prompts of length ~16K on GLM 4.7 Flash 6bit, this took peak memory from 50+GB to a more reasonable 28GB.

ivanfioravanti · 2026-02-23T13:54:43Z

It works! A single line of code with a mega impact.

awni

Good catch! Not sure how we missed that.

N8python · 2026-02-23T19:17:21Z

Can we merge?

clear cache on prompt ingestion in server

3de6218

awni approved these changes Feb 23, 2026

View reviewed changes

awni merged commit d4701ba into ml-explore:main Feb 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'mx.clear_cache()' to piecewise prompt processing in server.#917

Add 'mx.clear_cache()' to piecewise prompt processing in server.#917
awni merged 1 commit into
ml-explore:mainfrom
N8python:main

N8python commented Feb 22, 2026

Uh oh!

ivanfioravanti commented Feb 23, 2026

Uh oh!

awni left a comment

Uh oh!

N8python commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

N8python commented Feb 22, 2026

Uh oh!

ivanfioravanti commented Feb 23, 2026

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

N8python commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants