BitNet Inference Pipeline: From Model to Ternary Kernel
How bitnet.cpp converts and runs 1.58-bit models using lookup-table kernels
CPU Path (LUT Kernels)
GPU Path (CUDA Kernels)
Reset
Model Source
Conversion
Optimized Runtime
Output
safetensors
safetensors
GGUF
packed
tokens
tokens
shared weights
HuggingFace
Model
{-1, 0, 1} weights
HuggingFace
Model
{-1, 0, 1} weights
GGUF Export
setup_env.py
i2_s / tl1 quant
Weight Packing
pack_weight.py
GPU format
CPU LUT Kernels
I2_S / TL1 / TL2
mul becomes add/sub/skip
ARM NEON + x86 AVX2
parallel tiling
GPU CUDA Kernels
Ternary matmul
Batch parallel ops
bitnet_kernels/
CPU Inference
5-7 tok/s @ 100B
55-82% less energy
runs on laptop
GPU Inference
Batch serving
NVIDIA CUDA
FP16: 16 bits/weight = ~200 GB for 100B params
BitNet: 1.58 bits/weight = ~12.5 GB for 100B params
Interactive Diagram:
Hover over components to see details, or use the buttons to trace the CPU or GPU inference path.