Entropy-Routed Dynamic Quantization for Local LLMs
Monitor output entropy in real-time. Route each token through the right precision tier. Low-entropy filler gets 4-bit. High-entropy reasoning gets 8-bit. No wasted compute.
Each token is generated at a different precision tier based on its entropy. Watch the gear shifts happen in real time.
Every LLM generates tokens one at a time. Some tokens are trivial ("the", "is", "and") where the model is 99% certain of the answer. Others are critical: mathematical reasoning, rare vocabulary, creative leaps. Those are the tokens where the model genuinely deliberates.
Static quantization treats every token the same. GEARBX doesn't. It reads the Shannon entropy of the output distribution at each step and shifts precision on-the-fly, like a manual transmission shifting gears based on engine RPM.
Low entropy? Downshift to 4-bit. Weights get packed into quarter the memory; inference flies. Medium entropy? Cruise in 8-bit, 2x compression with minimal quality loss. High entropy? Upshift to fp16. Original weights restored, full floating-point fidelity where it actually matters.
Shannon entropy (bits) computed from logits each generation step. Rolling-window average smooths jitter. Auto-calibrates thresholds from prefill distribution.
Thresholds map entropy to gear tier. Low entropy = predictable = downshift. High entropy = uncertain = upshift. Hysteresis prevents oscillation near boundaries.
Low gear: attention layers hot-swapped to 4-bit packed weights. Mid gear: 8-bit. High gear: original fp16 restored. Unused weights offloaded to CPU for real memory savings.
Forward pass through current-gear weights. On CUDA, fused Triton kernels multiply directly on packed data with no dequantization. Sample. Repeat.
npm install -g gearbx
curl -fsSL gearbx.jpdz.app/install | sh
Then run: gearbx
pip install gearbx[mlx]
pip install gearbx[cuda]
pip install gearbx