GEARBX

Entropy-Routed Dynamic Quantization for Local LLMs

Monitor output entropy in real-time. Route each token through the right precision tier. Low-entropy filler gets 4-bit. High-entropy reasoning gets 8-bit. No wasted compute.

Docs
Live Demo

Watch GEARBX Think

Each token is generated at a different precision tier based on its entropy. Watch the gear shifts happen in real time.

gearbx generate
GEAR ...
ENTROPY ...
PRECISION ...
MEMORY ...
SAVINGS ...
LOW GEAR
4-bit
75% memory saved
0.50 GB per billion params
Entropy < 1.8 bits
0 tokens
MID GEAR
8-bit
50% memory saved
1.00 GB per billion params
Entropy 1.8 to 3.5 bits
0 tokens
HIGH GEAR
fp16
0% memory saved
2.00 GB per billion params
Entropy > 3.5 bits
0 tokens
Entropy over time
1.8 3.5
What is GEARBX

A Transmission for Your LLM

Every LLM generates tokens one at a time. Some tokens are trivial ("the", "is", "and") where the model is 99% certain of the answer. Others are critical: mathematical reasoning, rare vocabulary, creative leaps. Those are the tokens where the model genuinely deliberates.

Static quantization treats every token the same. GEARBX doesn't. It reads the Shannon entropy of the output distribution at each step and shifts precision on-the-fly, like a manual transmission shifting gears based on engine RPM.

Low entropy? Downshift to 4-bit. Weights get packed into quarter the memory; inference flies. Medium entropy? Cruise in 8-bit, 2x compression with minimal quality loss. High entropy? Upshift to fp16. Original weights restored, full floating-point fidelity where it actually matters.

75% memory savings on low-entropy tokens vs fp16 baseline
1.8× faster throughput on predictable sequences
<1% quality loss on standard benchmarks vs full precision
Why GEARBX

Dynamic Beats Static

Static Quantization GEARBX
Precision per token Fixed: same bits everywhere Adaptive: 4-bit, 8-bit, or fp16 per token
Memory usage Constant (lowest tier) Dynamic: low gear frees 75% VRAM
Quality on hard tokens Degraded; quantization noise compounds Preserved: upshifts to fp16 where it matters
Speed on easy tokens Same as hard tokens Up to 1.8× faster (4-bit path)
Gear shifting None: one size fits all Per-token, entropy-driven, with hysteresis
Setup Choose one precision upfront Auto-calibrates from first prompt
Mechanism

How Gear Shifting Works

01

Measure Entropy

Shannon entropy (bits) computed from logits each generation step. Rolling-window average smooths jitter. Auto-calibrates thresholds from prefill distribution.

02

Route Gear

Thresholds map entropy to gear tier. Low entropy = predictable = downshift. High entropy = uncertain = upshift. Hysteresis prevents oscillation near boundaries.

03

Swap Precision

Low gear: attention layers hot-swapped to 4-bit packed weights. Mid gear: 8-bit. High gear: original fp16 restored. Unused weights offloaded to CPU for real memory savings.

04

Generate Token

Forward pass through current-gear weights. On CUDA, fused Triton kernels multiply directly on packed data with no dequantization. Sample. Repeat.

Get Started

Install in Seconds

TUI (Terminal App)
npm
npm install -g gearbx
curl
curl -fsSL gearbx.jpdz.app/install | sh

Then run: gearbx

Python Library
Apple Silicon (MLX)
pip install gearbx[mlx]
CUDA
pip install gearbx[cuda]
Base (CPU / MPS)
pip install gearbx
Backends MLX (Apple Silicon) · CUDA (Triton fused kernels) · MPS (PyTorch) · CPU