This directory contains benchmarking scripts for measuring Bergson's performance across different models and configurations.
-
benchmark_bergson.py- Programmatic benchmarks for Bergsonrun- In-memory benchmark usingInMemoryCollector(fast, single GPU)run-disk- Disk-based benchmark using realbuild(),reduce(),score_dataset()(single GPU)
-
benchmark_bergson_cli.py- CLI-based benchmark using subprocess- Tests the actual CLI commands (
bergson build,bergson reduce,bergson score) - Supports multi-GPU via
--num-gpus
- Tests the actual CLI commands (
benchmark_dattri.py- Dattri influence function benchmarkkronfluence_benchmark.py- Kronfluence influence function benchmark
-
benchmark_utils.py- Shared utilities for all benchmarks- Model specifications
- Token parsing
- Path generation
- Timestamp utilities
load_benchmark_dataset()- Load on-disk tokenized dataset with filtering
-
save_to_disk.py- Utility for preprocessing and saving tokenized datasets to disk
plot_cli_benchmark.py- Plot benchmark results- Automatically separates plots by num_gpus and hardware
- Generates
cli_benchmark_1gpu.png,cli_benchmark_8gpu.png, etc. - Each PNG only contains results from the same GPU/hardware configuration
run_full_benchmark.py- Orchestrate full benchmark suite
All benchmarks should use the pre-tokenized on-disk dataset for consistency:
from benchmarks.benchmark_utils import load_benchmark_dataset
# Load and filter to sequences >= 1024 tokens
ds = load_benchmark_dataset()Or test it directly:
python -m benchmarks.test_load_datasetThis will:
- Load the tokenized dataset from
data/EleutherAI/SmolLM2-135M-10B-tokenized - Filter out sequences shorter than 1024 tokens (for even batching)
- Print statistics about total tokens available
python -m benchmarks.benchmark_bergson run pythia-14m 1M 100Kpython -m benchmarks.benchmark_bergson run-disk pythia-14m 1M 100KSingle GPU (default):
python -m benchmarks.benchmark_bergson_cli pythia-70m 10MMulti-GPU (8 GPUs):
python -m benchmarks.benchmark_bergson_cli pythia-70m 10M --num_gpus 8Small models (1 GPU):
./benchmarks/run_small_models_cli_benchmark.shSmall models (8 GPUs):
./benchmarks/run_small_models_8gpu.shLarge models (1 GPU):
./benchmarks/run_large_models_cli_benchmark.shLarge models (8 GPUs):
./benchmarks/run_large_models_8gpu.shThe plotting script automatically separates results by GPU count and hardware:
python -m benchmarks.plot_cli_benchmarkThis will:
- Load all benchmark results from
runs/bergson_cli_benchmark/ - Group by (num_gpus, hardware) combination
- Generate separate plots for each configuration:
figures/cli_benchmark_1gpu.png- Single GPU resultsfigures/cli_benchmark_8gpu.png- 8 GPU resultsruns/benchmarks/cli_benchmark_1gpu.csv- Single GPU dataruns/benchmarks/cli_benchmark_8gpu.csv- 8 GPU data
Each plot only contains results from the same GPU/hardware configuration, making comparisons fair and meaningful.
| Benchmark | Method | Multi-GPU | Disk I/O | Use Case |
|---|---|---|---|---|
run |
In-memory collector | No (FSDP only) | None | Quick memory scaling tests |
run-disk |
Real build/reduce/score | No | Yes | Test production code paths |
| CLI (1 GPU) | Subprocess CLI commands | No | Yes | Single GPU baseline |
| CLI (8 GPU) | Subprocess CLI commands | Yes | Yes | Full multi-GPU distributed |
All benchmarks now include:
- num_gpus: Number of GPUs used for the run
- hardware: Hardware information (node name + GPU type/count)
This allows proper comparison between single-GPU and multi-GPU runs.
- Add your benchmark script to this directory
- Import from
benchmarks.benchmark_utilsfor shared functionality - Follow the existing pattern for saving results (JSON records)
- Update this README with your benchmark's purpose and usage