Releases · Basaltlabs-app/Gauntlet · GitHub

04 May 08:00

v2.1.2

3c803c1

v2.1.2 — community submission pipeline + MCP env loading + QOL Latest

Latest

Why this release matters

Three independent silent-failure paths converged into "I ran benchmarks and nothing showed up." All three are fixed here, plus a substantial quality-of-life pass while we were in the area. First update in two weeks.

Reliability fixes

CLI submissions now include hardware_tier + attestation — previously every CLI row landed in Supabase with hardware_tier="", leaving tier-filtered dashboard views (Consumer-Mid, Consumer-High, Cloud) systematically empty.
MCP fingerprint is environment-aware. Real collect_fingerprint() on desktop MCP clients (Gemini CLI, Claude Desktop, Cursor); placeholder only on Vercel. Was throwing away real RAM/CPU/GPU data and submitting unknown/0.
submit_result rejection logs bumped DEBUG → WARNING with response body included. API rejections were silent before; now you actually see what the server didn't like.
CTRL-C-safe submission ordering. Payload is enqueued to ~/.gauntlet/pending/ before the network call; only HTTP 200 deletes the file. Mid-flight kill no longer loses the result.
Hardware fingerprint collected once per multi-model run (was rebuilt per model — pointless psutil / sysctl / nvidia-smi calls on every iteration).

New commands & flags

gauntlet --version / -V — long overdue.
gauntlet doctor — env-var detection, .env discovery, Supabase reachability, hardware preview, recent runs, retry queue. Drains the queue on its way through. Designed to answer "why didn't my last run show up?" in five seconds.
gauntlet history — Rich table of past local runs with --limit N and --model <substr> filtering.
--no-submit flag and GAUNTLET_PRIVATE=1 — opt out of community submission for iteration runs.
MCP startup banner prints Community push: ENABLED/DISABLED to stderr so misconfiguration surfaces upfront, not after a 20-minute run.

New plumbing

dotenv-style env loader at package import. Searches $GAUNTLET_ENV_FILE → .env.vercel.local/.env.local/.env walking up to repo root → ~/.gauntlet/.env. Allowlist of 10 keys; never overwrites parent-process env. Fixes the Gemini-CLI / Claude-Desktop subprocess env-blindness on macOS.
Persistent retry queue at ~/.gauntlet/pending/. Failed submissions queue to disk and replay on the next CLI invocation. Permanent 4xx rejections drop after one warning. Capped at 200 files. Self-healing without user intervention.
MCP write path now goes through the same validator as /api/submit. New gauntlet/mcp/submit_validator.py enforces score-range, model-name length, category sanity, score-consistency, probe-count, attestation shape, probe-details size caps, and dedup. Closes the gap where /mcp accepted unvalidated payloads.
/api/version endpoint returns {server, min_supported, recommended, latest, upgrade_command, upgrade_url, release_notes}.
GAUNTLET_API_URL env override for self-hosting / staging.
Auto-update notification on benchmark runs.

Performance

Read endpoint cache TTLs: max-age=30, s-maxage=60 → max-age=300, s-maxage=900, stale-while-revalidate=86400.
/api/predict + /api/recommend share a 60s in-memory matrix cache with stale-fallback. Was fetching 2000 rows per request; now once per minute. Serves stale on Supabase outage instead of 503.
Dashboard polling pauses when the tab is hidden, refetches on focus. Background tabs no longer churn Supabase + Vercel functions.

Code health

HMAC submit key deduplicated to gauntlet.core.config.get_submit_key(). gauntlet.basaltlabs.app URL also deduplicated; 5 hardcoded callsites consolidated.
submit_handler refactored to use the shared validator. Single source of truth across /api/submit and the MCP write path.
str(e) leakage plugged in three API handlers.
MIN_CLI_VERSION bumped 1.3.5 → 2.0.0. Anything below pre-dates the fingerprint + attestation fields and was being silently accepted with hardware_tier="".

Tests

42 new tests covering env-loader behavior, save-status formatting, and every validator rule.
Total: 540 passing, 0 failures.

Docs

New README section: Running gauntlet as a local MCP server. Per-client config blocks for Gemini CLI / Claude Desktop / Cursor, the macOS Dock-launch gotcha, and the GAUNTLET_ENV_FILE override for CI.
Clarified MCP data-quality language: desktop MCP runs now produce real fingerprints; the "serverless" caveat applies only to the hosted endpoint.

Breaking change

MIN_CLI_VERSION is now 2.0.0. Submissions from CLI versions below 2.0.0 will be rejected with HTTP 400 and a clear "run pipx upgrade gauntlet-cli" message. If you've been pinned, this is the moment to upgrade.

Upgrade

pipx upgrade gauntlet-cli
gauntlet --version   # → gauntlet 2.1.2
gauntlet doctor      # confirm everything is wired

Deferred to a follow-up

Splitting gauntlet/dashboard/server.py (>1200 lines) into per-feature modules. TODO(refactor) block at the top of the file names the natural split points. Pure mechanical move — should land as its own focused PR.

22 files changed · +1,969 / −179 · 540 tests passing

Assets 2

21 Apr 06:36

Basaltlabs-app

v2.1.1

c960337

v2.1.1 — lspci fallback + badge rebrand

Features

Hardware detection — _detect_gpu_info() now parses lspci output as a fallback for NVIDIA GPU names when nvidia-smi isn't on PATH (server installs, stripped-down containers, certain distros). Previously these runs submitted to the community leaderboard with gpu_name="unknown", polluting the data. 5 new parse tests cover bracketed vs codename-only output, multi-GPU selection, AMD-only lines, and empty input.

API / Embeds

/api/badge rebrand — dropped the A–F / shields.io colour ramp in favour of Gauntlet's actual certification system:

Tier Score Tests Colour

Gold ≥ 90 ≥ 20 #c4a05a

Silver ≥ 75 ≥ 10 #b0a69c

Bronze ≥ 60 ≥ 5 #c87850

Tested any ≥ 1 #b08d6e

no data — 0 #7a6e62

A 71-score model that previously rendered as hostile red F (71.1) now renders as warm copper 71 · Bronze. Something a model author actually wants in their HuggingFace / GitHub README.
Badge viewBox + role=\"img\" — resizing via <img width=\"...\"> now preserves aspect ratio, and screen readers get a proper label.

Tier	Score	Tests	Colour
Gold	≥ 90	≥ 20	`#c4a05a`
Silver	≥ 75	≥ 10	`#b0a69c`
Bronze	≥ 60	≥ 5	`#c87850`
Tested	any	≥ 1	`#b08d6e`
no data	—	0	`#7a6e62`

Install

```bash
pip install --upgrade gauntlet-cli
```

Full changelog: v2.1.0...v2.1.1

Assets 2

20 Apr 10:57

Basaltlabs-app

v2.1.0

bdedeea

v2.1.0 — LM Studio + cloud chat + MCP polish

Features

LM Studio provider (closes #2) — first-class support for LM Studio's OpenAI-compatible local server. gauntlet run --model lmstudio/<name>, gauntlet discover lists loaded models, host configurable via LMSTUDIO_HOST env / gauntlet config --lmstudio-host / default http://localhost:1234. Metadata inferred from model ID.
Cloud ChatClient wiring — openai/*, anthropic/*, and google/* model specs now run directly (previously NotImplementedError). Full frontier sweep typically under $5; Gemini has a free tier.
MCP server:
- Self-driving tool instructions with explicit "do NOT shell out" directives so MCP clients (Claude Code, Gemini CLI, Cursor) run the suite without custom prompts.
- Auto-detects the client app via MCP clientInfo.
- New gauntlet_status(session_id) tool for resumability.

Fixes

Temporal Reasoning probe — the old "Reply with ONLY the name" framing caused some models to loop for minutes when the correct answer was "Neither". Prompt now lists 'Alice' | 'Bob' | 'Neither' explicitly.
Leaderboard provider mis-attribution — collect_fingerprint(..., "ollama") hardcode fixed; non-Ollama runs now attribute correctly.

Safety

Agent-invocation guard on gauntlet run — refuses to benchmark local models from non-interactive shells (MCP-client subprocess spawns) unless GAUNTLET_ALLOW_LOCAL=1. Prevents accidentally loading large local models and tanking the user's machine. Cloud providers + interactive humans unaffected.

Polish

Host resolution honors config file (env > file > default) for Ollama and LM Studio; persistent gauntlet config --*-host flags now actually take effect.
Error messages and auto-detect include LM Studio alongside Ollama.
README: new LM Studio and Cloud Baselines sections; updated provider filter tables.

Tests

12 new LM Studio tests (host precedence, parsing, factory, metadata inference).

Install

```bash
pip install --upgrade gauntlet-cli
```

Full changelog: v2.0.3...v2.1.0

Assets 2

17 Apr 17:22

Basaltlabs-app

v2.0.3

001634f

v2.0.3: UX Polish & Layer Sensitivity Expansion

What's new

Layer Sensitivity expanded (16 to 25 probes)

New probes across all 5 cognitive categories:

Syntax: pronoun resolution, word order sensitivity
Logic: base-rate fallacy detection, double negation resolution
Spatial: mental rotation, direction tracking after turning
Pragmatic: understatement recognition, indirect refusal detection
Factual: additional date/number recall

Degradation API now includes perplexity

/api/degradation response includes perplexity_mean and perplexity_n per quantization level when V2 data is available. Enables the dashboard to overlay perplexity on degradation curves.

Dashboard UX improvements

Empty tabs fixed: Speed, Quality, and Graph tabs now explain they need gauntlet compare instead of showing blank screens
Category explanations: Sycophancy Gradient, Perplexity Baseline, and Layer Sensitivity cards show contextual subtitles explaining what the score means
Strength Profile: community leaderboard heatmap replaced with expandable glass cards using the warm site palette

22 new tests (529 total, 0 failures)

Upgrade

pip install --upgrade gauntlet-cli

Assets 2

17 Apr 15:41

Basaltlabs-app

v2.0.2

cf034ad

v2.0.2: Skip server-error probes from scoring

Fix: Ollama 500 errors no longer tank your scores

Server errors (HTTP 500, 502, 503, connection failures) from Ollama or llama.cpp were being scored as behavioral failures (0.0), unfairly penalizing models for infrastructure issues. On memory-constrained machines (M1 8GB), multi-turn probes frequently trigger Ollama OOM kills, dragging scores down even though the model itself is performing correctly.

Now: Server-error probes are marked as "Skipped (server error)" and excluded from the module score entirely. They still appear in the results display (so you can see what happened) but they don't affect the score. If 5/8 temporal probes crash and 3/8 run successfully, the score is based on the 3 that ran.

Upgrade

pip install --upgrade gauntlet-cli

Assets 2

17 Apr 09:32

Basaltlabs-app

v2.0.1

94e2a39

v2.0.1: Hotfix

Fix: Full benchmark runs crashing

The ConfidenceCalibration module's score() method was missing two required arguments (high_failures, summary) when constructing ModuleScore. This caused every full behavioral suite run to crash with:

ModuleScore.__init__() missing 2 required positional arguments: 'high_failures' and 'summary'

Quick Tests (health checks) were unaffected.

Upgrade

pip install --upgrade gauntlet-cli

Assets 2

17 Apr 04:28

Basaltlabs-app

v2.0.0

f9c2640

Gauntlet V2.0.0

The "does perplexity predict behavioral degradation?" release.

V2 adds the empirical tools to settle this question with community data instead of speculation.

New Modules

PERPLEXITY_BASELINE
Measures raw token prediction quality on a fixed evaluation corpus using logprobs from Ollama/llama.cpp. NOT factored into TrustScore or GauntletScore. Reported as a standalone metric in community submissions for correlation analysis. Gracefully skips for cloud providers without logprob access.

LAYER_SENSITIVITY (16 probes, 5 cognitive categories)
Probes cognitive functions mapped to different transformer layer groups:

Shallow syntax (3 probes): subject-verb agreement, format preservation, grammatical errors
Factual recall (3 probes): chemical symbols, physical constants, biology
Multi-step logic (4 probes): chained arithmetic, transitivity, modus tollens, syllogism
Spatial reasoning (3 probes): relative position, mirror reflection, clock position
Pragmatic inference (3 probes): sarcasm, implicature, social norms

Per-category score breakdown identifies which cognitive functions degrade first under quantization.

Enhanced Metadata

quant_method: gguf, gguf_iq (importance-matrix), gptq, awq, exl2, cloud
quant_source: bartowski, thebloke, mradermacher, turboderp, unsloth, official, community
Both auto-populated from model name patterns and format metadata

Architecture (never manually sync modules again)

MODULE_LABELS auto-derived from module registry (no manual dict maintenance)
VALID_CATEGORIES auto-derived from registry at API startup
Dashboard auto-generates labels for unknown modules
Adding a new module: write 1 file + add 1 import

Dashboard Redesign

Community panel reorganized with clear narrative hierarchy:

"The Evidence": perplexity vs behavioral score scatter plot
Rankings: tier-filtered leaderboard
Quantization Impact: degradation curves + quant method comparison + layer sensitivity (grouped with sub-tabs)
Hardware Landscape: condensed 3-column grid
Predict: collaborative filtering

Tests

59 new tests (508 total, 0 failures):

test_perplexity_baseline.py: perplexity math, module structure, scoring exclusion
test_layer_sensitivity.py: probe generation, check logic for all 5 categories, scoring breakdown
test_quant_method.py: field existence, inference logic for GGUF/IQ/GPTQ/AWQ/EXL2, source detection

Install / Upgrade

pip install --upgrade gauntlet-cli

Assets 2

13 Apr 22:18

Basaltlabs-app

v1.5.1

64ffaca

v1.5.1 — Dashboard Polish & Fixes

Fixes PyPI publish. Removes self-judge. Fixes community submission pipeline. Adds Spline 404 page. Animation polish on community leaderboard.

Assets 2

Releases: Basaltlabs-app/Gauntlet

v2.1.2 — community submission pipeline + MCP env loading + QOL

Why this release matters

Reliability fixes

New commands & flags

New plumbing

Performance

Code health

Tests

Docs

Breaking change

Upgrade

Deferred to a follow-up

Uh oh!

v2.1.1 — lspci fallback + badge rebrand

Features

API / Embeds

Install

Uh oh!

v2.1.0 — LM Studio + cloud chat + MCP polish

Features

Fixes

Safety

Polish

Tests

Install

Uh oh!

v2.0.3: UX Polish & Layer Sensitivity Expansion

What's new

Layer Sensitivity expanded (16 to 25 probes)

Degradation API now includes perplexity

Dashboard UX improvements

22 new tests (529 total, 0 failures)

Upgrade

Uh oh!

v2.0.2: Skip server-error probes from scoring

Fix: Ollama 500 errors no longer tank your scores

Upgrade

Uh oh!

v2.0.1: Hotfix

Fix: Full benchmark runs crashing

Upgrade

Uh oh!

Gauntlet V2.0.0

Gauntlet V2.0.0

New Modules

Enhanced Metadata

Architecture (never manually sync modules again)

Dashboard Redesign

Tests

Install / Upgrade

Uh oh!

v1.5.1 — Dashboard Polish & Fixes

Uh oh!