GEARBX

Entropy-Routed Dynamic Quantization for Local LLMs

Monitor output entropy in real-time. Route each token through the right precision tier. Low-entropy filler gets 4-bit. High-entropy reasoning gets 8-bit. No wasted compute.

Docs → Paper →

gearbx

$ gearbx

Loading llama3.2:1b...

Calibrating entropy thresholds...

⚙ MID 2.4 bits

Live Demo

Watch GEARBX Think

Each token is generated at a different precision tier based on its entropy. Watch the gear shifts happen in real time.

gearbx generate

GEAR ...

ENTROPY ...

PRECISION ...

MEMORY ...

SAVINGS ...

LOW GEAR

4-bit

75% memory saved

0.50 GB per billion params

Entropy < 1.8 bits

0 tokens

MID GEAR

8-bit

50% memory saved

1.00 GB per billion params

Entropy 1.8 to 3.5 bits

0 tokens

HIGH GEAR

fp16

0% memory saved

2.00 GB per billion params

Entropy > 3.5 bits

0 tokens

Entropy over time

1.8 3.5

What is GEARBX

A Transmission for Your LLM

Every LLM generates tokens one at a time. Some tokens are trivial ("the", "is", "and") where the model is 99% certain of the answer. Others are critical: mathematical reasoning, rare vocabulary, creative leaps. Those are the tokens where the model genuinely deliberates.

Static quantization treats every token the same. GEARBX doesn't. It reads the Shannon entropy of the output distribution at each step and shifts precision on-the-fly, like a manual transmission shifting gears based on engine RPM.

Low entropy? Downshift to 4-bit. Weights get packed into quarter the memory; inference flies. Medium entropy? Cruise in 8-bit, 2x compression with minimal quality loss. High entropy? Upshift to fp16. Original weights restored, full floating-point fidelity where it actually matters.

75% memory savings on low-entropy tokens vs fp16 baseline

1.8× faster throughput on predictable sequences

<1% quality loss on standard benchmarks vs full precision

Why GEARBX

Dynamic Beats Static

Static Quantization GEARBX

Precision per token Fixed: same bits everywhere Adaptive: 4-bit, 8-bit, or fp16 per token

Memory usage Constant (lowest tier) Dynamic: low gear frees 75% VRAM

Quality on hard tokens Degraded; quantization noise compounds Preserved: upshifts to fp16 where it matters

Speed on easy tokens Same as hard tokens Up to 1.8× faster (4-bit path)

Gear shifting None: one size fits all Per-token, entropy-driven, with hysteresis

Setup Choose one precision upfront Auto-calibrates from first prompt

Mechanism

How Gear Shifting Works

Measure Entropy

Shannon entropy (bits) computed from logits each generation step. Rolling-window average smooths jitter. Auto-calibrates thresholds from prefill distribution.

Route Gear

Thresholds map entropy to gear tier. Low entropy = predictable = downshift. High entropy = uncertain = upshift. Hysteresis prevents oscillation near boundaries.

Swap Precision

Low gear: attention layers hot-swapped to 4-bit packed weights. Mid gear: 8-bit. High gear: original fp16 restored. Unused weights offloaded to CPU for real memory savings.

Generate Token

Forward pass through current-gear weights. On CUDA, fused Triton kernels multiply directly on packed data with no dequantization. Sample. Repeat.

Get Started

Install in Seconds

TUI (Terminal App)

npm

npm install -g gearbx

curl

curl -fsSL gearbx.jpdz.app/install | sh

Then run: gearbx

Python Library

Apple Silicon (MLX)

pip install gearbx[mlx]

CUDA

pip install gearbx[cuda]

Base (CPU / MPS)

pip install gearbx

Backends MLX (Apple Silicon) · CUDA (Triton fused kernels) · MPS (PyTorch) · CPU