Releases: Basaltlabs-app/Gauntlet
v2.1.2 — community submission pipeline + MCP env loading + QOL
Why this release matters
Three independent silent-failure paths converged into "I ran benchmarks and nothing showed up." All three are fixed here, plus a substantial quality-of-life pass while we were in the area. First update in two weeks.
Reliability fixes
- CLI submissions now include
hardware_tier+attestation— previously every CLI row landed in Supabase withhardware_tier="", leaving tier-filtered dashboard views (Consumer-Mid, Consumer-High, Cloud) systematically empty. - MCP fingerprint is environment-aware. Real
collect_fingerprint()on desktop MCP clients (Gemini CLI, Claude Desktop, Cursor); placeholder only on Vercel. Was throwing away real RAM/CPU/GPU data and submittingunknown/0. submit_resultrejection logs bumped DEBUG → WARNING with response body included. API rejections were silent before; now you actually see what the server didn't like.- CTRL-C-safe submission ordering. Payload is enqueued to
~/.gauntlet/pending/before the network call; only HTTP 200 deletes the file. Mid-flight kill no longer loses the result. - Hardware fingerprint collected once per multi-model run (was rebuilt per model — pointless
psutil/sysctl/nvidia-smicalls on every iteration).
New commands & flags
gauntlet --version/-V— long overdue.gauntlet doctor— env-var detection,.envdiscovery, Supabase reachability, hardware preview, recent runs, retry queue. Drains the queue on its way through. Designed to answer "why didn't my last run show up?" in five seconds.gauntlet history— Rich table of past local runs with--limit Nand--model <substr>filtering.--no-submitflag andGAUNTLET_PRIVATE=1— opt out of community submission for iteration runs.- MCP startup banner prints
Community push: ENABLED/DISABLEDto stderr so misconfiguration surfaces upfront, not after a 20-minute run.
New plumbing
- dotenv-style env loader at package import. Searches
$GAUNTLET_ENV_FILE→.env.vercel.local/.env.local/.envwalking up to repo root →~/.gauntlet/.env. Allowlist of 10 keys; never overwrites parent-process env. Fixes the Gemini-CLI / Claude-Desktop subprocess env-blindness on macOS. - Persistent retry queue at
~/.gauntlet/pending/. Failed submissions queue to disk and replay on the next CLI invocation. Permanent 4xx rejections drop after one warning. Capped at 200 files. Self-healing without user intervention. - MCP write path now goes through the same validator as
/api/submit. Newgauntlet/mcp/submit_validator.pyenforces score-range, model-name length, category sanity, score-consistency, probe-count, attestation shape, probe-details size caps, and dedup. Closes the gap where/mcpaccepted unvalidated payloads. /api/versionendpoint returns{server, min_supported, recommended, latest, upgrade_command, upgrade_url, release_notes}.GAUNTLET_API_URLenv override for self-hosting / staging.- Auto-update notification on benchmark runs.
Performance
- Read endpoint cache TTLs:
max-age=30, s-maxage=60→max-age=300, s-maxage=900, stale-while-revalidate=86400. /api/predict+/api/recommendshare a 60s in-memory matrix cache with stale-fallback. Was fetching 2000 rows per request; now once per minute. Serves stale on Supabase outage instead of 503.- Dashboard polling pauses when the tab is hidden, refetches on focus. Background tabs no longer churn Supabase + Vercel functions.
Code health
- HMAC submit key deduplicated to
gauntlet.core.config.get_submit_key().gauntlet.basaltlabs.appURL also deduplicated; 5 hardcoded callsites consolidated. submit_handlerrefactored to use the shared validator. Single source of truth across/api/submitand the MCP write path.str(e)leakage plugged in three API handlers.MIN_CLI_VERSIONbumped 1.3.5 → 2.0.0. Anything below pre-dates the fingerprint + attestation fields and was being silently accepted withhardware_tier="".
Tests
- 42 new tests covering env-loader behavior, save-status formatting, and every validator rule.
- Total: 540 passing, 0 failures.
Docs
- New README section: Running gauntlet as a local MCP server. Per-client config blocks for Gemini CLI / Claude Desktop / Cursor, the macOS Dock-launch gotcha, and the
GAUNTLET_ENV_FILEoverride for CI. - Clarified MCP data-quality language: desktop MCP runs now produce real fingerprints; the "serverless" caveat applies only to the hosted endpoint.
Breaking change
MIN_CLI_VERSION is now 2.0.0. Submissions from CLI versions below 2.0.0 will be rejected with HTTP 400 and a clear "run pipx upgrade gauntlet-cli" message. If you've been pinned, this is the moment to upgrade.
Upgrade
pipx upgrade gauntlet-cli
gauntlet --version # → gauntlet 2.1.2
gauntlet doctor # confirm everything is wiredDeferred to a follow-up
- Splitting
gauntlet/dashboard/server.py(>1200 lines) into per-feature modules.TODO(refactor)block at the top of the file names the natural split points. Pure mechanical move — should land as its own focused PR.
22 files changed · +1,969 / −179 · 540 tests passing
v2.1.1 — lspci fallback + badge rebrand
Features
- Hardware detection —
_detect_gpu_info()now parseslspcioutput as a fallback for NVIDIA GPU names whennvidia-smiisn't on PATH (server installs, stripped-down containers, certain distros). Previously these runs submitted to the community leaderboard withgpu_name="unknown", polluting the data. 5 new parse tests cover bracketed vs codename-only output, multi-GPU selection, AMD-only lines, and empty input.
API / Embeds
-
/api/badgerebrand — dropped the A–F / shields.io colour ramp in favour of Gauntlet's actual certification system:Tier Score Tests Colour Gold ≥ 90 ≥ 20 #c4a05aSilver ≥ 75 ≥ 10 #b0a69cBronze ≥ 60 ≥ 5 #c87850Tested any ≥ 1 #b08d6eno data — 0 #7a6e62A 71-score model that previously rendered as hostile red
F (71.1)now renders as warm copper71 · Bronze. Something a model author actually wants in their HuggingFace / GitHub README. -
Badge
viewBox+role=\"img\"— resizing via<img width=\"...\">now preserves aspect ratio, and screen readers get a proper label.
Install
```bash
pip install --upgrade gauntlet-cli
```
Full changelog: v2.1.0...v2.1.1
v2.1.0 — LM Studio + cloud chat + MCP polish
Features
- LM Studio provider (closes #2) — first-class support for LM Studio's OpenAI-compatible local server.
gauntlet run --model lmstudio/<name>,gauntlet discoverlists loaded models, host configurable viaLMSTUDIO_HOSTenv /gauntlet config --lmstudio-host/ defaulthttp://localhost:1234. Metadata inferred from model ID. - Cloud ChatClient wiring —
openai/*,anthropic/*, andgoogle/*model specs now run directly (previouslyNotImplementedError). Full frontier sweep typically under $5; Gemini has a free tier. - MCP server:
- Self-driving tool instructions with explicit "do NOT shell out" directives so MCP clients (Claude Code, Gemini CLI, Cursor) run the suite without custom prompts.
- Auto-detects the client app via MCP
clientInfo. - New
gauntlet_status(session_id)tool for resumability.
Fixes
- Temporal Reasoning probe — the old "Reply with ONLY the name" framing caused some models to loop for minutes when the correct answer was "Neither". Prompt now lists
'Alice' | 'Bob' | 'Neither'explicitly. - Leaderboard provider mis-attribution —
collect_fingerprint(..., "ollama")hardcode fixed; non-Ollama runs now attribute correctly.
Safety
- Agent-invocation guard on
gauntlet run— refuses to benchmark local models from non-interactive shells (MCP-client subprocess spawns) unlessGAUNTLET_ALLOW_LOCAL=1. Prevents accidentally loading large local models and tanking the user's machine. Cloud providers + interactive humans unaffected.
Polish
- Host resolution honors config file (env > file > default) for Ollama and LM Studio; persistent
gauntlet config --*-hostflags now actually take effect. - Error messages and auto-detect include LM Studio alongside Ollama.
- README: new LM Studio and Cloud Baselines sections; updated provider filter tables.
Tests
- 12 new LM Studio tests (host precedence, parsing, factory, metadata inference).
Install
```bash
pip install --upgrade gauntlet-cli
```
Full changelog: v2.0.3...v2.1.0
v2.0.3: UX Polish & Layer Sensitivity Expansion
What's new
Layer Sensitivity expanded (16 to 25 probes)
New probes across all 5 cognitive categories:
- Syntax: pronoun resolution, word order sensitivity
- Logic: base-rate fallacy detection, double negation resolution
- Spatial: mental rotation, direction tracking after turning
- Pragmatic: understatement recognition, indirect refusal detection
- Factual: additional date/number recall
Degradation API now includes perplexity
/api/degradation response includes perplexity_mean and perplexity_n per quantization level when V2 data is available. Enables the dashboard to overlay perplexity on degradation curves.
Dashboard UX improvements
- Empty tabs fixed: Speed, Quality, and Graph tabs now explain they need
gauntlet compareinstead of showing blank screens - Category explanations: Sycophancy Gradient, Perplexity Baseline, and Layer Sensitivity cards show contextual subtitles explaining what the score means
- Strength Profile: community leaderboard heatmap replaced with expandable glass cards using the warm site palette
22 new tests (529 total, 0 failures)
Upgrade
pip install --upgrade gauntlet-cliv2.0.2: Skip server-error probes from scoring
Fix: Ollama 500 errors no longer tank your scores
Server errors (HTTP 500, 502, 503, connection failures) from Ollama or llama.cpp were being scored as behavioral failures (0.0), unfairly penalizing models for infrastructure issues. On memory-constrained machines (M1 8GB), multi-turn probes frequently trigger Ollama OOM kills, dragging scores down even though the model itself is performing correctly.
Now: Server-error probes are marked as "Skipped (server error)" and excluded from the module score entirely. They still appear in the results display (so you can see what happened) but they don't affect the score. If 5/8 temporal probes crash and 3/8 run successfully, the score is based on the 3 that ran.
Upgrade
pip install --upgrade gauntlet-cliv2.0.1: Hotfix
Fix: Full benchmark runs crashing
The ConfidenceCalibration module's score() method was missing two required arguments (high_failures, summary) when constructing ModuleScore. This caused every full behavioral suite run to crash with:
ModuleScore.__init__() missing 2 required positional arguments: 'high_failures' and 'summary'
Quick Tests (health checks) were unaffected.
Upgrade
pip install --upgrade gauntlet-cliGauntlet V2.0.0
Gauntlet V2.0.0
The "does perplexity predict behavioral degradation?" release.
V2 adds the empirical tools to settle this question with community data instead of speculation.
New Modules
PERPLEXITY_BASELINE
Measures raw token prediction quality on a fixed evaluation corpus using logprobs from Ollama/llama.cpp. NOT factored into TrustScore or GauntletScore. Reported as a standalone metric in community submissions for correlation analysis. Gracefully skips for cloud providers without logprob access.
LAYER_SENSITIVITY (16 probes, 5 cognitive categories)
Probes cognitive functions mapped to different transformer layer groups:
- Shallow syntax (3 probes): subject-verb agreement, format preservation, grammatical errors
- Factual recall (3 probes): chemical symbols, physical constants, biology
- Multi-step logic (4 probes): chained arithmetic, transitivity, modus tollens, syllogism
- Spatial reasoning (3 probes): relative position, mirror reflection, clock position
- Pragmatic inference (3 probes): sarcasm, implicature, social norms
Per-category score breakdown identifies which cognitive functions degrade first under quantization.
Enhanced Metadata
- quant_method: gguf, gguf_iq (importance-matrix), gptq, awq, exl2, cloud
- quant_source: bartowski, thebloke, mradermacher, turboderp, unsloth, official, community
- Both auto-populated from model name patterns and format metadata
Architecture (never manually sync modules again)
- MODULE_LABELS auto-derived from module registry (no manual dict maintenance)
- VALID_CATEGORIES auto-derived from registry at API startup
- Dashboard auto-generates labels for unknown modules
- Adding a new module: write 1 file + add 1 import
Dashboard Redesign
Community panel reorganized with clear narrative hierarchy:
- "The Evidence": perplexity vs behavioral score scatter plot
- Rankings: tier-filtered leaderboard
- Quantization Impact: degradation curves + quant method comparison + layer sensitivity (grouped with sub-tabs)
- Hardware Landscape: condensed 3-column grid
- Predict: collaborative filtering
Tests
59 new tests (508 total, 0 failures):
- test_perplexity_baseline.py: perplexity math, module structure, scoring exclusion
- test_layer_sensitivity.py: probe generation, check logic for all 5 categories, scoring breakdown
- test_quant_method.py: field existence, inference logic for GGUF/IQ/GPTQ/AWQ/EXL2, source detection
Install / Upgrade
pip install --upgrade gauntlet-cliv1.5.1 — Dashboard Polish & Fixes
Fixes PyPI publish. Removes self-judge. Fixes community submission pipeline. Adds Spline 404 page. Animation polish on community leaderboard.