Skip to content

Supporting delay in mlx_lm benchmark#1010

Merged
angeloskath merged 2 commits into
ml-explore:mainfrom
AndreasPlt:benchmark-with-delay
Mar 17, 2026
Merged

Supporting delay in mlx_lm benchmark#1010
angeloskath merged 2 commits into
ml-explore:mainfrom
AndreasPlt:benchmark-with-delay

Conversation

@AndreasPlt
Copy link
Copy Markdown
Contributor

@AndreasPlt AndreasPlt commented Mar 16, 2026

fixes #1009

Added an optional argument to mlx_lm.benchmark.py named delay which controls cooldown delay between consecutive benchmark runs. This is useful when running benchmarks on passively-cooled MacBook Airs, which otherwise significantly throttle performance and thus degrade benchmark results.


Edit: I also tried out running the same benchmark with and without delay, and I think the results show that cool-down can make quite a difference (running Qwen2.5-7b-Instruct-4bit, with 2048 prompt length and 2048 generation length, on a MacBook Air M4 with 24GB of RAM):

Without delay (command: mlx_lm benchmark --model mlx-community/Qwen2.5-7b-Instruct-4bit -n 5 -p 2048 -g 2048 --delay 0):

Timing with prompt_tokens=2048, generation_tokens=2048, batch_size=1.
Trial 1:  prompt_tps=204.832, generation_tps=21.086, peak_memory=5.114
Trial 2:  prompt_tps=137.867, generation_tps=18.929, peak_memory=5.114
Trial 3:  prompt_tps=153.996, generation_tps=18.775, peak_memory=5.114
Trial 4:  prompt_tps=162.618, generation_tps=17.456, peak_memory=5.115
Trial 5:  prompt_tps=149.155, generation_tps=18.051, peak_memory=5.115
Averages: prompt_tps=161.694, generation_tps=18.859, peak_memory=5.114

With delay (120s) (command: mlx_lm benchmark --model mlx-community/Qwen2.5-7b-Instruct-4bit -n 5 -p 2048 -g 2048 --delay 120):

Running warmup..
Timing with prompt_tokens=2048, generation_tokens=2048, batch_size=1.
Trial 1:  prompt_tps=200.729, generation_tps=21.146, peak_memory=5.114
Trial 2:  prompt_tps=203.119, generation_tps=21.440, peak_memory=5.114
Trial 3:  prompt_tps=193.286, generation_tps=21.239, peak_memory=5.114
Trial 4:  prompt_tps=201.892, generation_tps=21.806, peak_memory=5.115
Trial 5:  prompt_tps=207.449, generation_tps=20.508, peak_memory=5.115
Averages: prompt_tps=201.295, generation_tps=21.228, peak_memory=5.114

The run without delay clearly shows a significant drop in both prompt and generation TPS after the first trial, which is not occuring when running with a delay (where we observe a steady prompt and generation TPS).

@angeloskath
Copy link
Copy Markdown
Member

Hm interesting. That is definitely useful but the Macs will also go into low power mode if the program sleeps for say 10 seconds which means we 'll still have some cold start effects 🤷‍♂️

@AndreasPlt
Copy link
Copy Markdown
Contributor Author

Hm interesting. That is definitely useful but the Macs will also go into low power mode if the program sleeps for say 10 seconds which means we 'll still have some cold start effects 🤷‍♂️

Appreciate the input! Feel free to correct me, but afaik calling time.sleep in a Python thread does not cause a system sleep (i.e. low power mode) immediately, as system sleep is governed by system idle time (if set in the settings). In any case, system idling can (and probably should, for most benchmarks) be deactivated when running with AC power in macOS settings ("Battery" > "Options" > Turn on "Prevent automatic sleeping on power adapter when the display is off").

Copy link
Copy Markdown
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it depends on many factors whether the machine will go to low power mode. I was simply mentioning that there will still likely be a discrepancy. Either way I think this looks great and especially like the numbers you added.

@angeloskath angeloskath merged commit 564281f into ml-explore:main Mar 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support cooldown delays in benchmark

2 participants