Abstract
The biggest bottleneck in autoregressive Large Language Model decoding at batch size 1 isn’t compute, but memory bandwidth and the thermal cost of heavy floating-point math. In this post, we present a data-free quantization architecture that we developed to compress modern models down to a 3.125-bit footprint while preserving complex reasoning and coding capabilities. By leveraging mathematical smoothing and vector quantization, we replace a large portion of standard matrix multiplications with LUT operations and bitwise additions. This near-multiplication-free approach suggests that future AI inference chips may not need massive Tensor Cores; they may instead benefit more from optimized memory routing and simple ALUs, enabling state-of-the-art intelligence on ultra-low-power edge devices.
The Bottleneck: The Memory Wall, Tensor Cores, and the 2-Bit Death Zone
If you want to understand why local Large Language Models are so difficult to deploy, you have to look at the physical realities of silicon.
During autoregressive decoding at batch size 1, running an LLM doesn’t require massive amounts of computational power. In that regime, compute is relatively cheap compared with moving weights, and the true bottleneck is Memory Bandwidth.
Autoregressive LLM decoding at batch size 1 is therefore largely memory-bound. To generate a single word, the device must load the entire neural network from memory into the compute cores. The processors calculate the math, but then they are forced to sit idle, waiting for the next layer of weights to travel across the memory bus.
When you try to move this architecture to edge devices, this bottleneck turns into a multi-front war:
- The Hardware Mismatch: You cannot put an NVIDIA datacenter GPU in a robot, a smartphone, or a local network rack. More importantly, edge devices are not Tensor Core dense. They lack the massive, dedicated arrays of hardware multipliers that datacenter GPUs use to brute-force dense matrix math.
- The Energy Drain: The physical act of continuously moving 15 to 20 gigabytes of weights across a circuit board drains a battery in minutes. But the math itself is also heavily taxing. Standard FP16 matrix multiplications (MatMuls) require complex, power-hungry floating-point logic circuits. On an edge device, sustaining heavy FP16 math quickly hits a thermal wall, throttling the chip to prevent it from overheating.
The Hidden Tax of Standard Quantization
To solve the memory wall and make edge inference viable, the obvious solution is compression. If we shrink the weights, we move less data, and power consumption drops.
However, industry-standard quantization algorithms (like AWQ, GPTQ, or standard GGUF) come with a hidden penalty. When these frameworks run a 4-bit model, they aren’t actually doing the matrix multiplication in 4-bit integer math. The hardware loads the compressed weight, dequantizes it back to FP16 on the fly in the registers, and then executes the FP16 MatMul.
This continuous on-the-fly dequantization adds a heavy layer of latency and energy consumption to every single token generated. You save on memory bandwidth, but you are still paying the FP16 energy toll.
The 3-Bit “Death Zone”
To fit a modern 8B parameter model like Qwen 3 into a >3GB footprint, you are forced to push past 4-bit formats and use hard quantization—compressing weights down to 3 bits or less.
Herein lies the final problem: at 3 bits, standard scalar rounding completely breaks down. You enter the “Death Zone.” The quantization grid becomes too wide, and the highly sensitive outlier weights—which act as the load-bearing pillars for the model’s logic—get arbitrarily clipped. The model’s perplexity skyrockets, its coding skills degrade, and it starts hallucinating logic.
To successfully run these models on the edge, we cannot just round the weights off, and we cannot afford the energy tax of FP16 dequantization. We need a fundamentally different mathematical approach that protects the model’s intelligence while permanently retiring the hardware multipliers.
How to reach a data-free 3-bit quantization
To solve the outlier problem and compress the weights down to 3 bits without losing accuracy, we developed a multi-stage quantization pipeline. Based on our implementation, here is how a single transformer layer (such as the MLP projections) is compressed:
Algorithm 1: Data-Free, 3-Bit Quantization Pipeline
Let’s break down the intuition behind these steps:
-
Mathematical Smoothing (Hadamard Rotation): Similar in spirit to QuaRot, we first apply a randomized Hadamard transform to the weight matrices before quantization. This rotates the weight space, effectively spreading the outlier energy across all weights. The result is a much smoother distribution that is far easier to quantize without clipping crucial information.
-
Vector Quantization (Product Quantization): With the weights smoothed, we apply Product Quantization. This places the method in the same broad codebook-based compression family as approaches such as AQLM, but here we use a single sensitivity-aware codebook plus a binary residual path tailored to our inference design. The weight matrix is reshaped into small blocks (vectors), and we use sensitivity-based K-Means clustering to build a representative codebook of centroids. Rather than treating all weights equally, this sensitivity-aware approach ensures that the most critical weights are preserved with higher fidelity. Each block of weights is then replaced by a simple pointer (index) to the closest centroid in the codebook, drastically reducing the memory footprint.
-
1-Bit Binary Residuals: To capture the remaining quantization error, we compute the residual difference between our quantized approximation and the rotated weights. We store this error as a highly compressed 1-bit sign along with a scaling factor () for a specific group size.
Conceptually, our pipeline sits at the intersection of QuaRot-style smoothing and AQLM-style codebook compression, but extends that combination with a binary residual path and a LUT-oriented inference formulation.
By combining these techniques, we achieve a highly compressed format consisting of codebooks, pointers, packed residual signs, and scales. But how much are we actually compressing? Let’s break down the math to see how we reach exactly 3.125 bits per weight.
The 3.125-Bit Math Breakdown
Let’s assume a standard weight matrix of size , and we apply the following compression parameters:
- Block Dimension (): weights per block
- Codebook Size (): centroids
- Group Size (): weights per scale
Here is a simplified pseudo-code replicating the core function from our pipeline:
def quantize(W, D=4, K=256, G=128):
# 1. Apply Randomized Hadamard Transform to smooth outliers
Q = randomized_hadamard(dim=W.shape[1])
W_rot = W @ Q
# 2. Reshape weights into blocks (e.g., vectors of size 4)
blocks = reshape_to_blocks(W_rot, block_dim=D)
# 3. Run sensitivity-aware K-Means to build the codebook and assign pointers
codebook, pointers = run_sensitivity_kmeans(blocks, k=K)
# 4. Calculate the residual error from the K-Means approximation
approx = reconstruct_from_codebook(codebook, pointers)
residual = W_rot - approx
# 5. Extract 1-bit signs from the residual
signs = torch.sign(residual)
# 6. Compute the optimal FP16 scaling factor (alpha) per group
alpha = compute_optimal_scales(residual, signs, group_size=G)
# 7. Pack the binary signs densely into uint8
packed_signs = pack_sign_bits(signs)
return codebook, pointers, packed_signs, alpha
Assuming our original weight matrix W is , let’s calculate the exact bit-rate per weight (excluding the codebook, whose size is negligible when amortized over millions of parameters):
- Pointers: 8 bits per pointer / 4 weights per block = 2.0 bits/weight
- Residual Signs: 1 bit per weight = 1.0 bits/weight
- Group Scales: 16 bits per scale / 128 weights per group = 0.125 bits/weight
- Total Footprint:
How to run inference on 3.125 bits
With the weights highly compressed, the next challenge is executing the forward pass efficiently. We need to compute the matrix multiplication without ever dequantizing the weights back to FP16.
The Anatomy of a Compressed Layer
In our new architecture, a single linear layer no longer contains a massive FP16 weight matrix . Instead, it is composed of:
- : The shared codebook (e.g., FP16 centroids).
- : The PQ Pointers (
uint8indices). - : The 1-bit Binary Residuals (packed into
uint8). - : The FP16 scaling factors for the residuals.
The Naive Approach: On-the-Fly Dequantization
The most straightforward way to run inference would be to take the incoming FP16 activation vector , load the compressed components () from memory, reconstruct the FP16 weights on the fly, and then execute a standard dense matrix multiplication.
While this solves the memory bandwidth problem by moving less data, it entirely defeats the purpose of removing the FP16 computational tax. The hardware is still forced to execute massive floating-point MatMuls, burning through the power budget of edge devices.
The Inference Equation: Pre-computing the Math
Instead of dequantizing the weights, we can refactor the math to operate directly on the compressed structures. The fundamental realization is this: for the shared-codebook term, no matter what the incoming activations are, there are only (e.g., 256) possible centroid vectors any given block will ever be multiplied against.
Let’s trace the mathematical transition. In a standard layer (ignoring rotation for a moment), the output is computed as:
In our quantized model, the weight matrix is approximated by the codebook () indexed by pointers (), plus the scaled binary residuals (). Substituting this into the standard equation:
To see where the LUT trick comes from, focus on a single activation block . The codebook is shared across the entire layer. In a classical PQ implementation, the pointer tells us which centroid vector from that layer-wide codebook is used for output row and block :
The contribution of that block to the dot product is then:
At this point, the pointer is still only acting as an address into a table of centroid vectors. We fetch a centroid, and then we still have to multiply by it.
But for a fixed activation block , there are only possible centroids it can ever interact with from the shared codebook. So instead of repeatedly fetching centroid vectors and multiplying them one by one, we pre-compute all possible dot products once:
Now the same pointer no longer needs to retrieve a centroid vector. It can retrieve the final scalar result directly from the activation LUT:
So we have effectively moved from a lookup table of centroid vectors to a temporary activation-conditioned lookup table of already-computed dot-product results. For a full output row, Engine 1 becomes:
Because matrix multiplication is distributive, we can split this single massive operation into two independent, highly efficient blocks. Factoring in the orthogonal rotations ( and ), we arrive at the following algebraic form of the Inference Equation:
In hardware, we do not execute Engine 1 as a dense multiply against reconstructed centroid blocks. We lower it into the LUT form derived above, where the pointer path becomes instead of followed by a dot product.
If we break this down, we can execute the math in a fundamentally different way:
- Rotate the Input: First, we apply the input rotation: .
- Build the Activation Lookup Table (Engine 1): Instead of multiplying the activations with the reconstructed weights, we split into small blocks of size and calculate the dot product of each block against the shared Codebook () directly. Since the codebook has only 256 centroids for the entire layer, we build a small 256-entry LUT for each activation block and store the results in fast SRAM.
- The Dual-Engine Compute: Now, to generate the outputs, we don’t do any more FP16 multiplication for the main weights. We simply read the 8-bit pointers () from memory and fetch the pre-computed scalar results directly from our SRAM LUT. Simultaneously (Engine 2), we read the 1-bit residual signs () and use integer ALUs to simply add or subtract the values, scaling each residual group by .
- Rotate the Output: Finally, we apply the output rotation: .
By calculating all 256 possible multiplications for each activation block against the shared codebook upfront, we bypass the massive Tensor Core MatMul over the full weight matrix. The heavy lifting shifts to rapid SRAM gathers of pre-computed activation-LUT results (Engine 1) and bitwise additions (Engine 2).
Why We Do Not Need Explicit Rotate/Unrotate Ops
There is an important implementation detail hiding inside this construction: at inference time, we do not need separate kernels that rotate activations into a new basis and then rotate them back out.
The activations stay in FP16, while the linear layers are executed through the shared codebook plus an activation LUT. Because the weights were rotated offline before quantization, the basis change is already baked into the centroids and pointer tables. In other words, the codebooks carry the rotation for us.
What gets rotated offline
Let be an orthogonal matrix:
Before quantization, we transform the dense weights once:
Only after this transformation do we run K-Means, build the shared codebook, and store the pointer tables. So the quantized representation itself is already aligned with the rotated basis.
What happens during inference
The residual stream lives in the rotated space:
Now consider an entry projection, for example the query matrix. The linear layer implicitly cancels the rotation:
So the tensor that reaches the non-linear part of the transformer is exactly the same tensor the original model would have produced.
At the exit of the sub-layer, the reverse happens automatically. For example, the attention output projection restores the rotated residual basis:
So there is never a need for an explicit runtime operation of the form “rotate activations now” or “unrotate activations now.” The shared-codebook-plus-activation-LUT linear layers already perform the required basis changes algebraically.
Proof Of Equivalence
We can state the result more formally: a transformer block executing on rotated activations with offline-rotated weights is functionally identical to the original block, up to a persistent global rotation of the residual stream.
1. The RMSNorm Prerequisite
A standard RMSNorm operation consists of length normalization followed by an element-wise scaling by a learnable affine gain :
Because the element-wise affine gain does not commute with an orthogonal rotation matrix (i.e., ), we must first fold into the successor linear layers offline. This leaves a parameter-free normalization operator at runtime, which does commute with orthogonal rotations since it relies solely on the norm:
2. Offline Setup
Assume the incoming residual state is already in the rotated basis:
Offline, we first fold the affine gains into the entry projection weights:
Next, we apply the orthogonal rotation to all weights:
Phase 1: Attention
Because the affine gain is folded into the weights, the normalization step preserves the orthogonal rotation:
The entry projections now implicitly unrotate the state, returning it to the standard basis:
By the exact same cancellation mechanism:
Because the Q, K, and V matrices have returned to the standard space, the core attention operations (RoPE, Softmax, and value mixing) compute exactly the same tensors as the original block:
The output projection then re-rotates the result back into the global residual basis:
Finally, the residual addition preserves the rotated-state invariant:
Phase 2: MLP
The same mechanism applies to the MLP. The normalization step commutes with the rotation:
The gate and up projections implicitly unrotate the activation before the non-linearity:
SwiGLU therefore acts exactly on the standard-space activations:
The down projection re-rotates the MLP output:
The final residual addition preserves the invariant through the end of the block:
Conclusion
By induction through the layer, the full block satisfies:
The non-linear pieces of the transformer always see the same standard-space tensors as the original model, while the residual stream remains consistently in the rotated basis from one end of the block to the other. In a deployed implementation, these basis changes and affine gains are permanently absorbed into the rotated, quantized shared codebook, eliminating the need for standalone rotate/unrotate activation kernels at runtime.
The Math: How Many MatMuls Did We Just Delete?
To truly appreciate the hardware impact of this equation, let’s calculate exactly how many FP16 multiplications we eliminated from the inference pass.
Assume we are generating a single token (batch size ) through our matrix layer, with and .
1. The Standard FP16 Baseline A standard dense layer performs a full dot product for every single weight.
- Total Multiplications: .
2. The 3.125-Bit Approach (Our Equation) Instead of doing the full MatMul, our hardware executes the Equation in its LUT-based form. Let’s count the multiplications:
- Engine 1 (The LUT): The input vector has length 4096, so with it is split into activation blocks. For each block, we multiply against all 256 centroids in the shared layer-wide codebook, and each centroid is a vector of length 4. Building the full LUT therefore takes: .
- Engine 2 (The Residuals): The 1-bit residuals require zero multiplications (they use integer ALUs for addition/subtraction). At the end of each residual group, we multiply the accumulated sum by the group scale . With , there are scales per row. For all 4096 rows: .
- Total Multiplications: .
The Reduction: By pre-computing the LUT and relying on SRAM lookups and integer additions, we dropped the heavy FP16 multiplication workload from million down to about million.
That is nearly a 93.0% reduction in FP16 multiplications for the core layer weights. This removes the vast majority of the power-hungry hardware multiplier workload and suggests a viable path toward running state-of-the-art models on ultra-low-power edge devices and FPGAs.
Benchmarking the approach
To validate this architecture, we applied the uniform 3.125 bpw pipeline to Gemma 2 9B.
Before looking at downstream task performance, we also measured the raw quantization error using Normalized Mean Squared Error (NMSE):
Across all layers, our quantized reconstruction reached an average NMSE of 3.6%.
The Footprint Reduction
Before looking at the intelligence, look at the physical memory footprint. By enforcing a strict 3.125 bpw average, we transformed a model that requires a datacenter GPU into one that fits in the RAM of a smartphone.
- Original FP16 VRAM: ~18.2 GB
- 3.125 bpw VRAM: ~3.4 GB
Benchmark Results
We used lm-eval-harness to benchmark the model on a variety of tasks. We exactly replicated the benchmarks and metrics used in the original Gemma 2 model.
| Task Category | Benchmark | Original (FP16) | 3.125 bits (Our Build) | % Change |
|---|---|---|---|---|
| Logic / Binary | BoolQ | 0.842 | 0.870 | +3.33% |
| Logic / Binary | WinoGrande | 0.806 | 0.740 | -8.19% |
| Math & Reasoning | GSM8K (Flex) | 0.686 | 0.735 | +7.14% |
| Compliance | IFEval (Strict) | 0.780 | 0.730 | -6.41% |
| Coding | HumanEval | 0.402 | 0.370 | -7.96% |
| Coding | MBPP (3-shot) | 0.524 | 0.520 | -0.76% |
| Knowledge | MMLU (Avg) | 0.713 | 0.690 | -3.23% |
| Knowledge | ARC-Easy (Norm) | 0.880 | 0.820 | -6.82% |
| Total | Average Change | — | — | -2.85% |
Overall, the results demonstrate that while the 3.125 bpw build delivers a massive 5.12x reduction in memory footprint, it maintains strong performance across most domains with a manageable average degradation of approximately 2.85%.
What this means for hardware
This inference path is theoretically well aligned with edge deployment, but it does not automatically mean it will run faster on today’s GPUs. Modern GPUs are multiplication monsters: they are built around massive Tensor Core throughput, highly optimized dense kernels, and memory pipelines tuned for standard MatMuls. As a result, the LUT-heavy part of this method may underperform, or at best only match, conventional inference on GPUs unless extremely well-optimized custom kernels are written.
That is precisely why this result is interesting. The point is not that we should force every existing GPU to behave like a lookup engine. The point is that state-of-the-art LLM inference may not fundamentally require giant Tensor Cores. If most of the useful work can be reduced to SRAM gathers, pointer chasing, bit unpacking, and cheap integer accumulation, then the right inference hardware for the edge may look very different from a datacenter GPU. It may be a chip optimized for local SRAM bandwidth, low-latency lookups, simple ALUs, and tightly scheduled data movement rather than raw FP16 multiplication throughput.
Next steps
The next step is to apply this compression pipeline to more model families and evaluate how well the approach transfers beyond this first Gemma 2 9B and Qwen 3 8B result. We want to understand which architectures are the most robust to 3.125 bpw compression and where the failure modes appear.
We will also share the compression code, along with the naive inference script used for evaluation.