BitNet Inference Pipeline: From Model to Ternary Kernel

How bitnet.cpp converts and runs 1.58-bit models using lookup-table kernels

safetensors safetensors GGUF packed tokens tokens shared weights HuggingFace Model {-1, 0, 1} weights HuggingFace Model {-1, 0, 1} weights GGUF Export setup_env.py i2_s / tl1 quant Weight Packing pack_weight.py GPU format CPU LUT Kernels I2_S / TL1 / TL2 mul becomes add/sub/skip ARM NEON + x86 AVX2 parallel tiling GPU CUDA Kernels Ternary matmul Batch parallel ops bitnet_kernels/ CPU Inference 5-7 tok/s @ 100B 55-82% less energy runs on laptop GPU Inference Batch serving NVIDIA CUDA FP16: 16 bits/weight = ~200 GB for 100B params BitNet: 1.58 bits/weight = ~12.5 GB for 100B params
Interactive Diagram: Hover over components to see details, or use the buttons to trace the CPU or GPU inference path.