Skip to content

Releases: Basaltlabs-app/Gauntlet

v2.1.2 — community submission pipeline + MCP env loading + QOL

04 May 08:00
3c803c1

Choose a tag to compare

Why this release matters

Three independent silent-failure paths converged into "I ran benchmarks and nothing showed up." All three are fixed here, plus a substantial quality-of-life pass while we were in the area. First update in two weeks.

Reliability fixes

  • CLI submissions now include hardware_tier + attestation — previously every CLI row landed in Supabase with hardware_tier="", leaving tier-filtered dashboard views (Consumer-Mid, Consumer-High, Cloud) systematically empty.
  • MCP fingerprint is environment-aware. Real collect_fingerprint() on desktop MCP clients (Gemini CLI, Claude Desktop, Cursor); placeholder only on Vercel. Was throwing away real RAM/CPU/GPU data and submitting unknown/0.
  • submit_result rejection logs bumped DEBUG → WARNING with response body included. API rejections were silent before; now you actually see what the server didn't like.
  • CTRL-C-safe submission ordering. Payload is enqueued to ~/.gauntlet/pending/ before the network call; only HTTP 200 deletes the file. Mid-flight kill no longer loses the result.
  • Hardware fingerprint collected once per multi-model run (was rebuilt per model — pointless psutil / sysctl / nvidia-smi calls on every iteration).

New commands & flags

  • gauntlet --version / -V — long overdue.
  • gauntlet doctor — env-var detection, .env discovery, Supabase reachability, hardware preview, recent runs, retry queue. Drains the queue on its way through. Designed to answer "why didn't my last run show up?" in five seconds.
  • gauntlet history — Rich table of past local runs with --limit N and --model <substr> filtering.
  • --no-submit flag and GAUNTLET_PRIVATE=1 — opt out of community submission for iteration runs.
  • MCP startup banner prints Community push: ENABLED/DISABLED to stderr so misconfiguration surfaces upfront, not after a 20-minute run.

New plumbing

  • dotenv-style env loader at package import. Searches $GAUNTLET_ENV_FILE.env.vercel.local/.env.local/.env walking up to repo root → ~/.gauntlet/.env. Allowlist of 10 keys; never overwrites parent-process env. Fixes the Gemini-CLI / Claude-Desktop subprocess env-blindness on macOS.
  • Persistent retry queue at ~/.gauntlet/pending/. Failed submissions queue to disk and replay on the next CLI invocation. Permanent 4xx rejections drop after one warning. Capped at 200 files. Self-healing without user intervention.
  • MCP write path now goes through the same validator as /api/submit. New gauntlet/mcp/submit_validator.py enforces score-range, model-name length, category sanity, score-consistency, probe-count, attestation shape, probe-details size caps, and dedup. Closes the gap where /mcp accepted unvalidated payloads.
  • /api/version endpoint returns {server, min_supported, recommended, latest, upgrade_command, upgrade_url, release_notes}.
  • GAUNTLET_API_URL env override for self-hosting / staging.
  • Auto-update notification on benchmark runs.

Performance

  • Read endpoint cache TTLs: max-age=30, s-maxage=60max-age=300, s-maxage=900, stale-while-revalidate=86400.
  • /api/predict + /api/recommend share a 60s in-memory matrix cache with stale-fallback. Was fetching 2000 rows per request; now once per minute. Serves stale on Supabase outage instead of 503.
  • Dashboard polling pauses when the tab is hidden, refetches on focus. Background tabs no longer churn Supabase + Vercel functions.

Code health

  • HMAC submit key deduplicated to gauntlet.core.config.get_submit_key(). gauntlet.basaltlabs.app URL also deduplicated; 5 hardcoded callsites consolidated.
  • submit_handler refactored to use the shared validator. Single source of truth across /api/submit and the MCP write path.
  • str(e) leakage plugged in three API handlers.
  • MIN_CLI_VERSION bumped 1.3.5 → 2.0.0. Anything below pre-dates the fingerprint + attestation fields and was being silently accepted with hardware_tier="".

Tests

  • 42 new tests covering env-loader behavior, save-status formatting, and every validator rule.
  • Total: 540 passing, 0 failures.

Docs

  • New README section: Running gauntlet as a local MCP server. Per-client config blocks for Gemini CLI / Claude Desktop / Cursor, the macOS Dock-launch gotcha, and the GAUNTLET_ENV_FILE override for CI.
  • Clarified MCP data-quality language: desktop MCP runs now produce real fingerprints; the "serverless" caveat applies only to the hosted endpoint.

Breaking change

MIN_CLI_VERSION is now 2.0.0. Submissions from CLI versions below 2.0.0 will be rejected with HTTP 400 and a clear "run pipx upgrade gauntlet-cli" message. If you've been pinned, this is the moment to upgrade.

Upgrade

pipx upgrade gauntlet-cli
gauntlet --version   # → gauntlet 2.1.2
gauntlet doctor      # confirm everything is wired

Deferred to a follow-up

  • Splitting gauntlet/dashboard/server.py (>1200 lines) into per-feature modules. TODO(refactor) block at the top of the file names the natural split points. Pure mechanical move — should land as its own focused PR.

22 files changed · +1,969 / −179 · 540 tests passing

v2.1.1 — lspci fallback + badge rebrand

21 Apr 06:36

Choose a tag to compare

Features

  • Hardware detection_detect_gpu_info() now parses lspci output as a fallback for NVIDIA GPU names when nvidia-smi isn't on PATH (server installs, stripped-down containers, certain distros). Previously these runs submitted to the community leaderboard with gpu_name="unknown", polluting the data. 5 new parse tests cover bracketed vs codename-only output, multi-GPU selection, AMD-only lines, and empty input.

API / Embeds

  • /api/badge rebrand — dropped the A–F / shields.io colour ramp in favour of Gauntlet's actual certification system:

    Tier Score Tests Colour
    Gold ≥ 90 ≥ 20 #c4a05a
    Silver ≥ 75 ≥ 10 #b0a69c
    Bronze ≥ 60 ≥ 5 #c87850
    Tested any ≥ 1 #b08d6e
    no data 0 #7a6e62

    A 71-score model that previously rendered as hostile red F (71.1) now renders as warm copper 71 · Bronze. Something a model author actually wants in their HuggingFace / GitHub README.

  • Badge viewBox + role=\"img\" — resizing via <img width=\"...\"> now preserves aspect ratio, and screen readers get a proper label.

Install

```bash
pip install --upgrade gauntlet-cli
```

Full changelog: v2.1.0...v2.1.1

v2.1.0 — LM Studio + cloud chat + MCP polish

20 Apr 10:57

Choose a tag to compare

Features

  • LM Studio provider (closes #2) — first-class support for LM Studio's OpenAI-compatible local server. gauntlet run --model lmstudio/<name>, gauntlet discover lists loaded models, host configurable via LMSTUDIO_HOST env / gauntlet config --lmstudio-host / default http://localhost:1234. Metadata inferred from model ID.
  • Cloud ChatClient wiringopenai/*, anthropic/*, and google/* model specs now run directly (previously NotImplementedError). Full frontier sweep typically under $5; Gemini has a free tier.
  • MCP server:
    • Self-driving tool instructions with explicit "do NOT shell out" directives so MCP clients (Claude Code, Gemini CLI, Cursor) run the suite without custom prompts.
    • Auto-detects the client app via MCP clientInfo.
    • New gauntlet_status(session_id) tool for resumability.

Fixes

  • Temporal Reasoning probe — the old "Reply with ONLY the name" framing caused some models to loop for minutes when the correct answer was "Neither". Prompt now lists 'Alice' | 'Bob' | 'Neither' explicitly.
  • Leaderboard provider mis-attributioncollect_fingerprint(..., "ollama") hardcode fixed; non-Ollama runs now attribute correctly.

Safety

  • Agent-invocation guard on gauntlet run — refuses to benchmark local models from non-interactive shells (MCP-client subprocess spawns) unless GAUNTLET_ALLOW_LOCAL=1. Prevents accidentally loading large local models and tanking the user's machine. Cloud providers + interactive humans unaffected.

Polish

  • Host resolution honors config file (env > file > default) for Ollama and LM Studio; persistent gauntlet config --*-host flags now actually take effect.
  • Error messages and auto-detect include LM Studio alongside Ollama.
  • README: new LM Studio and Cloud Baselines sections; updated provider filter tables.

Tests

  • 12 new LM Studio tests (host precedence, parsing, factory, metadata inference).

Install

```bash
pip install --upgrade gauntlet-cli
```

Full changelog: v2.0.3...v2.1.0

v2.0.3: UX Polish & Layer Sensitivity Expansion

17 Apr 17:22

Choose a tag to compare

What's new

Layer Sensitivity expanded (16 to 25 probes)

New probes across all 5 cognitive categories:

  • Syntax: pronoun resolution, word order sensitivity
  • Logic: base-rate fallacy detection, double negation resolution
  • Spatial: mental rotation, direction tracking after turning
  • Pragmatic: understatement recognition, indirect refusal detection
  • Factual: additional date/number recall

Degradation API now includes perplexity

/api/degradation response includes perplexity_mean and perplexity_n per quantization level when V2 data is available. Enables the dashboard to overlay perplexity on degradation curves.

Dashboard UX improvements

  • Empty tabs fixed: Speed, Quality, and Graph tabs now explain they need gauntlet compare instead of showing blank screens
  • Category explanations: Sycophancy Gradient, Perplexity Baseline, and Layer Sensitivity cards show contextual subtitles explaining what the score means
  • Strength Profile: community leaderboard heatmap replaced with expandable glass cards using the warm site palette

22 new tests (529 total, 0 failures)

Upgrade

pip install --upgrade gauntlet-cli

v2.0.2: Skip server-error probes from scoring

17 Apr 15:41

Choose a tag to compare

Fix: Ollama 500 errors no longer tank your scores

Server errors (HTTP 500, 502, 503, connection failures) from Ollama or llama.cpp were being scored as behavioral failures (0.0), unfairly penalizing models for infrastructure issues. On memory-constrained machines (M1 8GB), multi-turn probes frequently trigger Ollama OOM kills, dragging scores down even though the model itself is performing correctly.

Now: Server-error probes are marked as "Skipped (server error)" and excluded from the module score entirely. They still appear in the results display (so you can see what happened) but they don't affect the score. If 5/8 temporal probes crash and 3/8 run successfully, the score is based on the 3 that ran.

Upgrade

pip install --upgrade gauntlet-cli

v2.0.1: Hotfix

17 Apr 09:32

Choose a tag to compare

Fix: Full benchmark runs crashing

The ConfidenceCalibration module's score() method was missing two required arguments (high_failures, summary) when constructing ModuleScore. This caused every full behavioral suite run to crash with:

ModuleScore.__init__() missing 2 required positional arguments: 'high_failures' and 'summary'

Quick Tests (health checks) were unaffected.

Upgrade

pip install --upgrade gauntlet-cli

Gauntlet V2.0.0

17 Apr 04:28

Choose a tag to compare

Gauntlet V2.0.0

The "does perplexity predict behavioral degradation?" release.

V2 adds the empirical tools to settle this question with community data instead of speculation.

New Modules

PERPLEXITY_BASELINE
Measures raw token prediction quality on a fixed evaluation corpus using logprobs from Ollama/llama.cpp. NOT factored into TrustScore or GauntletScore. Reported as a standalone metric in community submissions for correlation analysis. Gracefully skips for cloud providers without logprob access.

LAYER_SENSITIVITY (16 probes, 5 cognitive categories)
Probes cognitive functions mapped to different transformer layer groups:

  • Shallow syntax (3 probes): subject-verb agreement, format preservation, grammatical errors
  • Factual recall (3 probes): chemical symbols, physical constants, biology
  • Multi-step logic (4 probes): chained arithmetic, transitivity, modus tollens, syllogism
  • Spatial reasoning (3 probes): relative position, mirror reflection, clock position
  • Pragmatic inference (3 probes): sarcasm, implicature, social norms

Per-category score breakdown identifies which cognitive functions degrade first under quantization.

Enhanced Metadata

  • quant_method: gguf, gguf_iq (importance-matrix), gptq, awq, exl2, cloud
  • quant_source: bartowski, thebloke, mradermacher, turboderp, unsloth, official, community
  • Both auto-populated from model name patterns and format metadata

Architecture (never manually sync modules again)

  • MODULE_LABELS auto-derived from module registry (no manual dict maintenance)
  • VALID_CATEGORIES auto-derived from registry at API startup
  • Dashboard auto-generates labels for unknown modules
  • Adding a new module: write 1 file + add 1 import

Dashboard Redesign

Community panel reorganized with clear narrative hierarchy:

  1. "The Evidence": perplexity vs behavioral score scatter plot
  2. Rankings: tier-filtered leaderboard
  3. Quantization Impact: degradation curves + quant method comparison + layer sensitivity (grouped with sub-tabs)
  4. Hardware Landscape: condensed 3-column grid
  5. Predict: collaborative filtering

Tests

59 new tests (508 total, 0 failures):

  • test_perplexity_baseline.py: perplexity math, module structure, scoring exclusion
  • test_layer_sensitivity.py: probe generation, check logic for all 5 categories, scoring breakdown
  • test_quant_method.py: field existence, inference logic for GGUF/IQ/GPTQ/AWQ/EXL2, source detection

Install / Upgrade

pip install --upgrade gauntlet-cli

v1.5.1 — Dashboard Polish & Fixes

13 Apr 22:18

Choose a tag to compare

Fixes PyPI publish. Removes self-judge. Fixes community submission pipeline. Adds Spline 404 page. Animation polish on community leaderboard.