BLXBench 1.0.0 – General Availability
Today we are proud to announce BLXBench 1.0.0, the first GA release since the 0.7 series. The new version is not just a version bump – it is a consolidated platform that brings together a year of feedback, new adapters, and a stronger focus on comparability and fairness.
What’s new in 1.0.0
1. First‑class adapters
- LM Studio (
lms) – Choose a locally running OpenAI‑compatible server via/provider lmsor the headless flag--provider lms.
The adapter behaves like any other cloud provider: model IDs are taken directly from LM Studio, API‑keys are optional, and runs receive the same metadata (badges, manifest hashes) as cloud runs. - Roblox OpenGameEval – An optional
robloxcategory that ships with the official Game‑Eval fixtures.
Runs in this category are reported separately, so the core suite stays comparable across versions while you still get game‑specific scores.
2. Fairer usage model
- Per‑model weekly quotas (Scout, Bencher, Founder tiers) replace a shared pool.
The CLI now tells you which model has hit its limit via/usage. - Scout tier – A lightweight, paid lane for low‑volume publishers with a small per‑model budget.
- Parallel execution – Each selected model runs in its own sub‑process by default; you can fine‑tune parallelism with
--parallel.
3. Stronger evidence chain
- Suite labels & manifest seals – Every run bundles a suite version and a manifest hash, making it possible to prove that two executions used the exact same evaluation set.
- Improved cost visibility – The status line shows
· run $…estimates while a job runs; skipped attempts do not incur phantom costs. - Deep links from web to CLI – After publishing a run on the website, clicking the model name opens the CLI with the provider + model context pre‑filled.
4. Trustworthy runner
- Phase labeling (Streaming → Artifact‑Save → Browser‑Validation → Judge → Scoring) replaces silent hanging.
- Automatic retries on timeouts, HTTP 429, 5xx before a graceful
SKIP. - Per‑model rate limits (
/ratelimitor--ratelimit) ensure that parallel jobs do not share a global bottleneck. - Better Playground stability – Renderer caches and the option to off‑load Playwright workers to Node reduce flakiness across OS.
5. Polished TUI & Shell
/saveand/loadunder~/.blxbench/saves/store configs and preferences (no API secrets).- Command recall (
↑/↓), meta‑arrows for long logs,PgUp/PgDn,/clearwithout losing muscle memory. /notifyplus headless--notify(OS‑dependent) give a reliable “run finished” signal.- Leaderboard entries are bound to the strongest verified public run, protecting the integrity of public scores.
- Web‑side UI now renders correctly on phones, ultrawides, and includes report‑replay explanations when browser previews are skipped.
Getting started
# Install the latest CLI globally
npm install -g @bitslix/blxbench@latest
# Verify the installation
blxbench --version # → 1.0.0
# Login (if you want to use cloud providers)
blxbench auth login
# Try the new LM Studio adapter (make sure LM Studio is running)
blxbench provider lms
blxbench run --model "lfm-40b-chat" --prompt "Hello, world!"
For a full list of changes, see the changelog.
Looking ahead
With 1.0.0 we have laid a solid foundation for future work:
- More first‑class adapters (e.g., Ollama, vLLM).
- Deeper integration of usage‑based billing and team workspaces.
- Continued investment in the evidence pipeline to make cross‑model, cross‑suite comparisons as trustworthy as possible.
We thank the community for the feedback that shaped this release. If you have questions, ideas, or want to share your own benchmark results, drop us a line at [email protected] or join the discussion on our Discord.
Happy benchmarking!
— Klara, Business Relations & Partnerships, Bitslix
