Tags · bstnxbt/dflash-mlx · GitHub

v0.1.6

Large runtime, server, and agentic-workflow release since v0.1.5, inc…

…luding the v0.1.5.1 fixes.

- Reworked runtime ownership around typed runtime config, `RuntimeBundle`, `ServerRuntime`, target adapters, draft loading, cache management, and observability.
- Default verify policy is now `adaptive`; fixed DFlash verification is available as `--verify-mode dflash`.
- Added explicit verify modes: `adaptive`, `dflash`, `ddtree`, and `off`.
- Added DDTree branch verification mode for Qwen target paths.
- Added internal CopySpec candidate reuse for repeated-token continuation from prompt/generated history.
- Added target-owned Qwen and Gemma4 backend routing, with unknown model families failing closed instead of falling into generic logic.
- Added Gemma4 adapter support for cache construction, logits, hidden capture, GQA routing, and guarded prefix snapshots.
- Added minimal Qwen3-Next fused-GDN projection support in Qwen target verification paths. This is source-level support, not a fully optimized public target claim.
- Moved long-context attention routing behind target adapters; public split-SDPA switches are gone.
- Productized `verify_qmm` through runtime config and target capabilities, with stock MLX fallback for unsupported shapes.
- Large registered DFlash drafts now default to in-memory `w4`; use `--draft-quant none` for bf16/non-quant A/B.
- Added old Apple chip handling so quantized DFlash drafts use fp16 floating tensors on BF16-emulated chips.
- Prefix cache is now a managed L1+L2 snapshot service with stable-prefix lookup, L2 promotion, validation, budgets, and server metrics.
- Added explicit target-only fallback when DFlash context limits are exceeded, with fallback state and physical prefill accounting.
- Hardened the OpenAI-compatible server for OpenCode, aider, Continue, Open WebUI, LM Studio through its OpenAI-compatible adapter, and other OpenAI-compatible clients.
- Added stricter Chat Completions tool-call handling, including streamed `delta.tool_calls`, Qwen XML spans, Gemma4 spans, JSON fallback, and fail-fast validation for malformed or undeclared tool calls.
- Added minimal non-streaming `/v1/responses` compatibility for text input and function-call tools.
- Added live `/metrics`, structured diagnostics, memory reporting, request summaries, and prefix-cache observability.
- Added agentic trace/replay lab tooling for real OpenAI-compatible client sessions such as OpenCode/pi.
- Short-output target-only AR fast path is now opt-in with `--fastpath-max-tokens N`; default serving keeps requests on the DFlash path.
- Switched license to Apache-2.0.

- `--verify-mode auto` was removed. Use `--verify-mode dflash` for fixed DFlash verification.
- Public `--split-sdpa` and `--no-split-sdpa` controls were removed; attention routing is now target-owned.
- `dflash profiles` and old profile/env resolution behavior were removed.
- Old top-level generation invocation is rejected; use `dflash generate`.
- Removed legacy benchmark modes and old diagnostic aliases; use documented benchmark flags and `--diagnostics`.
- Runtime internals moved under `dflash_mlx/runtime/`; old runtime import paths are gone.
- `/v1/responses` is intentionally limited: no streaming, multimodal input, Responses-native reasoning/text/truncation controls, `tool_choice`, `parallel_tool_calls`, `previous_response_id`, or persistent `store`.
- Function-specific Chat Completions `tool_choice` and `parallel_tool_calls: false` are rejected.
- `target_fa_window > 0` disables prefix cache/L2 by design.
- Bumping to `0.1.6` invalidates older L2 prefix snapshots through runtime-version validation, so they rebuild.

- Use explicit runtime flags instead of old profiles.
- Use `--fastpath-max-tokens N` only when you intentionally want target-only AR for very short server responses.
- Treat `tools/benchmarks/agentic_trace` as diagnostic/lab tooling, not as the public benchmark surface.
- Public benchmark claims should continue to come from `dflash benchmark`.