Tags: bstnxbt/dflash-mlx
Tags
dflash-mlx v0.1.7 Big runtime + benchmark update focused on Qwen3.6 27B 4-bit and real long-context usage. Highlights: - Retuned adaptive verify policy for long-context and agentic decode. - Added richer adaptive metrics: tokens/cycle, block reductions, reduced-block cycles, min block size, per-mode/per-block speed, CopySpec counters. - Improved /metrics with decode-only average tok/s and split prefill rates: logical, real, restored. - Added AIME25 benchmark suite with exact integer answer scoring. - Added --only-dflash for expensive benchmark runs. - Fixed benchmark cooldown so it applies between baseline and DFlash legs. - Preserved prefix-cache chunking and generation snapshots for tool/agentic turns. - Fixed Qwen thinking default to follow tokenizer/request behavior instead of forcing thinking off. - Fixed GDN recurrent exactness around state dtype in gated-delta tape/tree kernels. - Added public README benchmark artifacts for Qwen3.6 27B 4-bit at 1k, 2k, 4k, 8k, and 16k. Measured README prompt, Qwen3.6 27B 4-bit, repeat=3, cooldown=120s, no EOS: - 1024: baseline 33.26 tok/s, DFlash 98.05 tok/s, 2.95x - 2048: baseline 32.34 tok/s, DFlash 90.67 tok/s, 2.81x - 4096: baseline 30.58 tok/s, DFlash 93.55 tok/s, 3.06x - 8192: baseline 26.03 tok/s, DFlash 79.12 tok/s, 3.04x - 16384: baseline 21.50 tok/s, DFlash 60.77 tok/s, 2.78x This release is mostly about making DFlash more usable and observable in real runs, especially long-context coding/agentic workloads.
Large runtime, server, and agentic-workflow release since v0.1.5, inc… …luding the v0.1.5.1 fixes. - Reworked runtime ownership around typed runtime config, `RuntimeBundle`, `ServerRuntime`, target adapters, draft loading, cache management, and observability. - Default verify policy is now `adaptive`; fixed DFlash verification is available as `--verify-mode dflash`. - Added explicit verify modes: `adaptive`, `dflash`, `ddtree`, and `off`. - Added DDTree branch verification mode for Qwen target paths. - Added internal CopySpec candidate reuse for repeated-token continuation from prompt/generated history. - Added target-owned Qwen and Gemma4 backend routing, with unknown model families failing closed instead of falling into generic logic. - Added Gemma4 adapter support for cache construction, logits, hidden capture, GQA routing, and guarded prefix snapshots. - Added minimal Qwen3-Next fused-GDN projection support in Qwen target verification paths. This is source-level support, not a fully optimized public target claim. - Moved long-context attention routing behind target adapters; public split-SDPA switches are gone. - Productized `verify_qmm` through runtime config and target capabilities, with stock MLX fallback for unsupported shapes. - Large registered DFlash drafts now default to in-memory `w4`; use `--draft-quant none` for bf16/non-quant A/B. - Added old Apple chip handling so quantized DFlash drafts use fp16 floating tensors on BF16-emulated chips. - Prefix cache is now a managed L1+L2 snapshot service with stable-prefix lookup, L2 promotion, validation, budgets, and server metrics. - Added explicit target-only fallback when DFlash context limits are exceeded, with fallback state and physical prefill accounting. - Hardened the OpenAI-compatible server for OpenCode, aider, Continue, Open WebUI, LM Studio through its OpenAI-compatible adapter, and other OpenAI-compatible clients. - Added stricter Chat Completions tool-call handling, including streamed `delta.tool_calls`, Qwen XML spans, Gemma4 spans, JSON fallback, and fail-fast validation for malformed or undeclared tool calls. - Added minimal non-streaming `/v1/responses` compatibility for text input and function-call tools. - Added live `/metrics`, structured diagnostics, memory reporting, request summaries, and prefix-cache observability. - Added agentic trace/replay lab tooling for real OpenAI-compatible client sessions such as OpenCode/pi. - Short-output target-only AR fast path is now opt-in with `--fastpath-max-tokens N`; default serving keeps requests on the DFlash path. - Switched license to Apache-2.0. - `--verify-mode auto` was removed. Use `--verify-mode dflash` for fixed DFlash verification. - Public `--split-sdpa` and `--no-split-sdpa` controls were removed; attention routing is now target-owned. - `dflash profiles` and old profile/env resolution behavior were removed. - Old top-level generation invocation is rejected; use `dflash generate`. - Removed legacy benchmark modes and old diagnostic aliases; use documented benchmark flags and `--diagnostics`. - Runtime internals moved under `dflash_mlx/runtime/`; old runtime import paths are gone. - `/v1/responses` is intentionally limited: no streaming, multimodal input, Responses-native reasoning/text/truncation controls, `tool_choice`, `parallel_tool_calls`, `previous_response_id`, or persistent `store`. - Function-specific Chat Completions `tool_choice` and `parallel_tool_calls: false` are rejected. - `target_fa_window > 0` disables prefix cache/L2 by design. - Bumping to `0.1.6` invalidates older L2 prefix snapshots through runtime-version validation, so they rebuild. - Use explicit runtime flags instead of old profiles. - Use `--fastpath-max-tokens N` only when you intentionally want target-only AR for very short server responses. - Treat `tools/benchmarks/agentic_trace` as diagnostic/lab tooling, not as the public benchmark surface. - Public benchmark claims should continue to come from `dflash benchmark`.