Skip to content

Tags: bstnxbt/dflash-mlx

Tags

v0.1.7

Toggle v0.1.7's commit message
dflash-mlx v0.1.7

Big runtime + benchmark update focused on Qwen3.6 27B 4-bit and real long-context usage.

Highlights:
- Retuned adaptive verify policy for long-context and agentic decode.
- Added richer adaptive metrics: tokens/cycle, block reductions, reduced-block cycles, min block size, per-mode/per-block speed, CopySpec counters.
- Improved /metrics with decode-only average tok/s and split prefill rates: logical, real, restored.
- Added AIME25 benchmark suite with exact integer answer scoring.
- Added --only-dflash for expensive benchmark runs.
- Fixed benchmark cooldown so it applies between baseline and DFlash legs.
- Preserved prefix-cache chunking and generation snapshots for tool/agentic turns.
- Fixed Qwen thinking default to follow tokenizer/request behavior instead of forcing thinking off.
- Fixed GDN recurrent exactness around state dtype in gated-delta tape/tree kernels.
- Added public README benchmark artifacts for Qwen3.6 27B 4-bit at 1k, 2k, 4k, 8k, and 16k.

Measured README prompt, Qwen3.6 27B 4-bit, repeat=3, cooldown=120s, no EOS:
- 1024: baseline 33.26 tok/s, DFlash 98.05 tok/s, 2.95x
- 2048: baseline 32.34 tok/s, DFlash 90.67 tok/s, 2.81x
- 4096: baseline 30.58 tok/s, DFlash 93.55 tok/s, 3.06x
- 8192: baseline 26.03 tok/s, DFlash 79.12 tok/s, 3.04x
- 16384: baseline 21.50 tok/s, DFlash 60.77 tok/s, 2.78x

This release is mostly about making DFlash more usable and observable in real runs, especially long-context coding/agentic workloads.

v0.1.6

Toggle v0.1.6's commit message
Large runtime, server, and agentic-workflow release since v0.1.5, inc…

…luding the v0.1.5.1 fixes.

- Reworked runtime ownership around typed runtime config, `RuntimeBundle`, `ServerRuntime`, target adapters, draft loading, cache management, and observability.
- Default verify policy is now `adaptive`; fixed DFlash verification is available as `--verify-mode dflash`.
- Added explicit verify modes: `adaptive`, `dflash`, `ddtree`, and `off`.
- Added DDTree branch verification mode for Qwen target paths.
- Added internal CopySpec candidate reuse for repeated-token continuation from prompt/generated history.
- Added target-owned Qwen and Gemma4 backend routing, with unknown model families failing closed instead of falling into generic logic.
- Added Gemma4 adapter support for cache construction, logits, hidden capture, GQA routing, and guarded prefix snapshots.
- Added minimal Qwen3-Next fused-GDN projection support in Qwen target verification paths. This is source-level support, not a fully optimized public target claim.
- Moved long-context attention routing behind target adapters; public split-SDPA switches are gone.
- Productized `verify_qmm` through runtime config and target capabilities, with stock MLX fallback for unsupported shapes.
- Large registered DFlash drafts now default to in-memory `w4`; use `--draft-quant none` for bf16/non-quant A/B.
- Added old Apple chip handling so quantized DFlash drafts use fp16 floating tensors on BF16-emulated chips.
- Prefix cache is now a managed L1+L2 snapshot service with stable-prefix lookup, L2 promotion, validation, budgets, and server metrics.
- Added explicit target-only fallback when DFlash context limits are exceeded, with fallback state and physical prefill accounting.
- Hardened the OpenAI-compatible server for OpenCode, aider, Continue, Open WebUI, LM Studio through its OpenAI-compatible adapter, and other OpenAI-compatible clients.
- Added stricter Chat Completions tool-call handling, including streamed `delta.tool_calls`, Qwen XML spans, Gemma4 spans, JSON fallback, and fail-fast validation for malformed or undeclared tool calls.
- Added minimal non-streaming `/v1/responses` compatibility for text input and function-call tools.
- Added live `/metrics`, structured diagnostics, memory reporting, request summaries, and prefix-cache observability.
- Added agentic trace/replay lab tooling for real OpenAI-compatible client sessions such as OpenCode/pi.
- Short-output target-only AR fast path is now opt-in with `--fastpath-max-tokens N`; default serving keeps requests on the DFlash path.
- Switched license to Apache-2.0.

- `--verify-mode auto` was removed. Use `--verify-mode dflash` for fixed DFlash verification.
- Public `--split-sdpa` and `--no-split-sdpa` controls were removed; attention routing is now target-owned.
- `dflash profiles` and old profile/env resolution behavior were removed.
- Old top-level generation invocation is rejected; use `dflash generate`.
- Removed legacy benchmark modes and old diagnostic aliases; use documented benchmark flags and `--diagnostics`.
- Runtime internals moved under `dflash_mlx/runtime/`; old runtime import paths are gone.
- `/v1/responses` is intentionally limited: no streaming, multimodal input, Responses-native reasoning/text/truncation controls, `tool_choice`, `parallel_tool_calls`, `previous_response_id`, or persistent `store`.
- Function-specific Chat Completions `tool_choice` and `parallel_tool_calls: false` are rejected.
- `target_fa_window > 0` disables prefix cache/L2 by design.
- Bumping to `0.1.6` invalidates older L2 prefix snapshots through runtime-version validation, so they rebuild.

- Use explicit runtime flags instead of old profiles.
- Use `--fastpath-max-tokens N` only when you intentionally want target-only AR for very short server responses.
- Treat `tools/benchmarks/agentic_trace` as diagnostic/lab tooling, not as the public benchmark surface.
- Public benchmark claims should continue to come from `dflash benchmark`.

v0.1.5

Toggle v0.1.5's commit message
v0.1.5 — runtime serving surface