[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Community benchmark database for running LLMs on Apple Silicon Macs
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
An agent evaluation framework with native multi-turn feedback iteration.
[ICML 2026] CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
Benchmark abierto de 44 modelos LLM con 5,000+ tests reales. Alternativas a Claude, GPT-5 y Gemini para agentes N8N, OpenClaw y emprendedores. Calculadora interactiva + LLM-as-Judge Phi-4.
Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Testing how well LLMs can solve jigsaw puzzles
Self-hosted LLM API benchmark, monitoring & playground. Compare latency, TTFT, throughput across OpenAI, Anthropic, Gemini & any OpenAI-compatible endpoint. Deploy with one command via Docker. | 自托管 LLM API 性能测试、监控与调试平台,一键 Docker 部署,支持多家服务商对比。
Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.
Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка — здесь. 100 сценариев, 10 критериев, русский корпоративный контекст.
Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
A reproducible, deterministic CLI to measure political bias and positioning of LLMs on economic and social axes.
RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.
Local LLM BenchMarking
Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."