Skip to content
Aniss Djellal
Go back

Trading MatMuls for SRAM Lookups: A 3-Bit Edge Architecture

Abstract

The biggest bottleneck in autoregressive Large Language Model decoding at batch size 1 isn’t compute, but memory bandwidth and the thermal cost of heavy floating-point math. In this post, we present a data-free quantization architecture that we developed to compress modern models down to a 3.125-bit footprint while preserving complex reasoning and coding capabilities. By leveraging mathematical smoothing and vector quantization, we replace a large portion of standard matrix multiplications with LUT operations and bitwise additions. This near-multiplication-free approach suggests that future AI inference chips may not need massive Tensor Cores; they may instead benefit more from optimized memory routing and simple ALUs, enabling state-of-the-art intelligence on ultra-low-power edge devices.


The Bottleneck: The Memory Wall, Tensor Cores, and the 2-Bit Death Zone

If you want to understand why local Large Language Models are so difficult to deploy, you have to look at the physical realities of silicon.

During autoregressive decoding at batch size 1, running an LLM doesn’t require massive amounts of computational power. In that regime, compute is relatively cheap compared with moving weights, and the true bottleneck is Memory Bandwidth.

Autoregressive LLM decoding at batch size 1 is therefore largely memory-bound. To generate a single word, the device must load the entire neural network from memory into the compute cores. The processors calculate the math, but then they are forced to sit idle, waiting for the next layer of weights to travel across the memory bus.

When you try to move this architecture to edge devices, this bottleneck turns into a multi-front war:

The Hidden Tax of Standard Quantization

To solve the memory wall and make edge inference viable, the obvious solution is compression. If we shrink the weights, we move less data, and power consumption drops.

However, industry-standard quantization algorithms (like AWQ, GPTQ, or standard GGUF) come with a hidden penalty. When these frameworks run a 4-bit model, they aren’t actually doing the matrix multiplication in 4-bit integer math. The hardware loads the compressed weight, dequantizes it back to FP16 on the fly in the registers, and then executes the FP16 MatMul.

This continuous on-the-fly dequantization adds a heavy layer of latency and energy consumption to every single token generated. You save on memory bandwidth, but you are still paying the FP16 energy toll.

The 3-Bit “Death Zone”

To fit a modern 8B parameter model like Qwen 3 into a >3GB footprint, you are forced to push past 4-bit formats and use hard quantization—compressing weights down to 3 bits or less.

Herein lies the final problem: at 3 bits, standard scalar rounding completely breaks down. You enter the “Death Zone.” The quantization grid becomes too wide, and the highly sensitive outlier weights—which act as the load-bearing pillars for the model’s logic—get arbitrarily clipped. The model’s perplexity skyrockets, its coding skills degrade, and it starts hallucinating logic.

To successfully run these models on the edge, we cannot just round the weights off, and we cannot afford the energy tax of FP16 dequantization. We need a fundamentally different mathematical approach that protects the model’s intelligence while permanently retiring the hardware multipliers.

How to reach a data-free 3-bit quantization

To solve the outlier problem and compress the weights down to 3 bits without losing accuracy, we developed a multi-stage quantization pipeline. Based on our implementation, here is how a single transformer layer (such as the MLP projections) is compressed:

Algorithm 1: Data-Free, 3-Bit Quantization Pipeline

Input: WRdout×din, D, K, GOutput: C, I, B, α1) Hadamard Rotation: WrotWQ2) Codebook Initialization: {Wrot(j)}j=1NBlockify(Wrot,D),CSensitivityKMeans({Wrot(j)}j=1N,K)3) Pointer Assignment: Ijargmink{1,,K}Wrot(j)Ck22, j4) Binary Residual Encoding: EWrotC[I], Bsign(E),αg1gigEi, gG\begin{aligned} &\textbf{Input: } W \in \mathbb{R}^{d_{out} \times d_{in}},\ D,\ K,\ G \\ &\textbf{Output: } C,\ I,\ B,\ \alpha \\[4pt] &\textbf{1) Hadamard Rotation: } W_{\mathrm{rot}} \leftarrow WQ \\[2pt] &\textbf{2) Codebook Initialization: } \{W_{\mathrm{rot}}^{(j)}\}_{j=1}^{N} \leftarrow \mathrm{Blockify}(W_{\mathrm{rot}}, D), \\ &\qquad\qquad\qquad\qquad\quad C \leftarrow \mathrm{SensitivityKMeans}(\{W_{\mathrm{rot}}^{(j)}\}_{j=1}^{N}, K) \\[2pt] &\textbf{3) Pointer Assignment: } I_j \leftarrow \arg\min_{k \in \{1,\dots,K\}} \left\| W_{\mathrm{rot}}^{(j)} - C_k \right\|_2^2,\ \forall j \\[2pt] &\textbf{4) Binary Residual Encoding: } E \leftarrow W_{\mathrm{rot}} - C[I],\ B \leftarrow \mathrm{sign}(E), \\ &\qquad\qquad\qquad\qquad\quad \alpha_g \leftarrow \frac{1}{|g|} \sum_{i \in g} |E_i|,\ \forall g \in G \end{aligned}

Let’s break down the intuition behind these steps:

Conceptually, our pipeline sits at the intersection of QuaRot-style smoothing and AQLM-style codebook compression, but extends that combination with a binary residual path and a LUT-oriented inference formulation.

By combining these techniques, we achieve a highly compressed format consisting of codebooks, pointers, packed residual signs, and scales. But how much are we actually compressing? Let’s break down the math to see how we reach exactly 3.125 bits per weight.

The 3.125-Bit Math Breakdown

Let’s assume a standard weight matrix WW of size 4096×40964096 \times 4096, and we apply the following compression parameters:

Here is a simplified pseudo-code replicating the core function from our pipeline:

def quantize(W, D=4, K=256, G=128):
    # 1. Apply Randomized Hadamard Transform to smooth outliers
    Q = randomized_hadamard(dim=W.shape[1])
    W_rot = W @ Q
    # 2. Reshape weights into blocks (e.g., vectors of size 4)
    blocks = reshape_to_blocks(W_rot, block_dim=D)
    # 3. Run sensitivity-aware K-Means to build the codebook and assign pointers
    codebook, pointers = run_sensitivity_kmeans(blocks, k=K)
    # 4. Calculate the residual error from the K-Means approximation
    approx = reconstruct_from_codebook(codebook, pointers)
    residual = W_rot - approx
    # 5. Extract 1-bit signs from the residual
    signs = torch.sign(residual)
    # 6. Compute the optimal FP16 scaling factor (alpha) per group
    alpha = compute_optimal_scales(residual, signs, group_size=G)
    # 7. Pack the binary signs densely into uint8
    packed_signs = pack_sign_bits(signs)
    
    return codebook, pointers, packed_signs, alpha

Assuming our original weight matrix W is 4096×40964096 \times 4096, let’s calculate the exact bit-rate per weight (excluding the codebook, whose size is negligible when amortized over millions of parameters):


How to run inference on 3.125 bits

With the weights highly compressed, the next challenge is executing the forward pass efficiently. We need to compute the matrix multiplication Y=XWTY = X \cdot W^T without ever dequantizing the weights back to FP16.

The Anatomy of a Compressed Layer

In our new architecture, a single linear layer no longer contains a massive FP16 weight matrix WW. Instead, it is composed of:

The Naive Approach: On-the-Fly Dequantization

The most straightforward way to run inference would be to take the incoming FP16 activation vector XX, load the compressed components (C,I,B,αC, I, B, \alpha) from memory, reconstruct the FP16 weights on the fly, and then execute a standard dense matrix multiplication.

While this solves the memory bandwidth problem by moving less data, it entirely defeats the purpose of removing the FP16 computational tax. The hardware is still forced to execute massive floating-point MatMuls, burning through the power budget of edge devices.

The Inference Equation: Pre-computing the Math

Instead of dequantizing the weights, we can refactor the math to operate directly on the compressed structures. The fundamental realization is this: for the shared-codebook term, no matter what the incoming activations are, there are only KK (e.g., 256) possible centroid vectors any given block will ever be multiplied against.

Let’s trace the mathematical transition. In a standard layer (ignoring rotation for a moment), the output is computed as: Y=XWTY = X \cdot W^T

In our quantized model, the weight matrix WW is approximated by the codebook (CC) indexed by pointers (II), plus the scaled binary residuals (αB\alpha \odot B). Substituting this into the standard equation:

Y=X(Lookup(C,I)+αB)TY = X \cdot \left( \text{Lookup}(C, I) + \alpha \odot B \right)^T

To see where the LUT trick comes from, focus on a single activation block xbRDx_b \in \mathbb{R}^D. The codebook CC is shared across the entire layer. In a classical PQ implementation, the pointer ir,bi_{r,b} tells us which centroid vector from that layer-wide codebook is used for output row rr and block bb:

wr,b=Cir,bw_{r,b} = C_{i_{r,b}}

The contribution of that block to the dot product is then:

xbwr,b=xbCir,bx_b^\top w_{r,b} = x_b^\top C_{i_{r,b}}

At this point, the pointer is still only acting as an address into a table of centroid vectors. We fetch a centroid, and then we still have to multiply by it.

But for a fixed activation block xbx_b, there are only KK possible centroids it can ever interact with from the shared codebook. So instead of repeatedly fetching centroid vectors and multiplying them one by one, we pre-compute all possible dot products once:

Lb[k]=xbCk,k{1,,K}L_b[k] = x_b^\top C_k,\qquad k \in \{1,\dots,K\}

Now the same pointer no longer needs to retrieve a centroid vector. It can retrieve the final scalar result directly from the activation LUT:

xbCir,b=Lb[ir,b]x_b^\top C_{i_{r,b}} = L_b[i_{r,b}]

So we have effectively moved from a lookup table of centroid vectors to a temporary activation-conditioned lookup table of already-computed dot-product results. For a full output row, Engine 1 becomes:

yr(LUT)=bLb[ir,b]y_r^{(\mathrm{LUT})} = \sum_b L_b[i_{r,b}]

Because matrix multiplication is distributive, we can split this single massive operation into two independent, highly efficient blocks. Factoring in the orthogonal rotations (QinQ_{in} and QoutQ_{out}), we arrive at the following algebraic form of the Inference Equation:

Y=[(XQin)Lookup(C,I)TEngine 1: Codebook Term+(XQin)(αB)TEngine 2: The Residuals]QoutTY = \left[ \underbrace{(X \cdot Q_{in}) \cdot \text{Lookup}(C, I)^T}_{\text{Engine 1: Codebook Term}} + \underbrace{(X \cdot Q_{in}) \cdot (\alpha \odot B)^T}_{\text{Engine 2: The Residuals}} \right] Q_{out}^T

In hardware, we do not execute Engine 1 as a dense multiply against reconstructed centroid blocks. We lower it into the LUT form derived above, where the pointer path becomes ir,bLb[ir,b]i_{r,b} \mapsto L_b[i_{r,b}] instead of ir,bCir,bi_{r,b} \mapsto C_{i_{r,b}} followed by a dot product.

If we break this down, we can execute the math in a fundamentally different way:

  1. Rotate the Input: First, we apply the input rotation: Xrot=XQinX_{\text{rot}} = X \cdot Q_{in}.
  2. Build the Activation Lookup Table (Engine 1): Instead of multiplying the activations with the reconstructed weights, we split XrotX_{\text{rot}} into small blocks of size DD and calculate the dot product of each block against the shared Codebook (CC) directly. Since the codebook has only 256 centroids for the entire layer, we build a small 256-entry LUT for each activation block and store the results in fast SRAM.
  3. The Dual-Engine Compute: Now, to generate the outputs, we don’t do any more FP16 multiplication for the main weights. We simply read the 8-bit pointers (II) from memory and fetch the pre-computed scalar results Lb[ir,b]L_b[i_{r,b}] directly from our SRAM LUT. Simultaneously (Engine 2), we read the 1-bit residual signs (BB) and use integer ALUs to simply add or subtract the XrotX_{\text{rot}} values, scaling each residual group by α\alpha.
  4. Rotate the Output: Finally, we apply the output rotation: Y=YrotQoutTY = Y_{\text{rot}} \cdot Q_{out}^T.

By calculating all 256 possible multiplications for each activation block against the shared codebook upfront, we bypass the massive Tensor Core MatMul over the full weight matrix. The heavy lifting shifts to rapid SRAM gathers of pre-computed activation-LUT results (Engine 1) and bitwise additions (Engine 2).

Why We Do Not Need Explicit Rotate/Unrotate Ops

There is an important implementation detail hiding inside this construction: at inference time, we do not need separate kernels that rotate activations into a new basis and then rotate them back out.

The activations stay in FP16, while the linear layers are executed through the shared codebook plus an activation LUT. Because the weights were rotated offline before quantization, the basis change is already baked into the centroids and pointer tables. In other words, the codebooks carry the rotation for us.

What gets rotated offline

Let QQ be an orthogonal matrix:

QQ=QQ=IQ Q^\top = Q^\top Q = I

Before quantization, we transform the dense weights once:

Wq=QWq,Wk=QWk,Wv=QWv,Wo=WoQ,Wgate=QWgate,Wup=QWup,Wdown=WdownQ.\begin{aligned} W_q' &= Q^\top W_q, & W_k' &= Q^\top W_k, & W_v' &= Q^\top W_v, \\ W_o' &= W_o Q, \\ W_{\mathrm{gate}}' &= Q^\top W_{\mathrm{gate}}, & W_{\mathrm{up}}' &= Q^\top W_{\mathrm{up}}, & W_{\mathrm{down}}' &= W_{\mathrm{down}} Q . \end{aligned}

Only after this transformation do we run K-Means, build the shared codebook, and store the pointer tables. So the quantized representation itself is already aligned with the rotated basis.

What happens during inference

The residual stream lives in the rotated space:

X~in=XinQ.\tilde{X}_{\mathrm{in}} = X_{\mathrm{in}} Q .

Now consider an entry projection, for example the query matrix. The linear layer implicitly cancels the rotation:

X~inWq=(XinQ)(QWq)=Xin(QQ)Wq=XinWq.\begin{aligned} \tilde{X}_{\mathrm{in}} W_q' &= (X_{\mathrm{in}} Q)(Q^\top W_q) \\ &= X_{\mathrm{in}} (Q Q^\top) W_q \\ &= X_{\mathrm{in}} W_q . \end{aligned}

So the tensor that reaches the non-linear part of the transformer is exactly the same tensor the original model would have produced.

At the exit of the sub-layer, the reverse happens automatically. For example, the attention output projection restores the rotated residual basis:

AoutWo=Aout(WoQ)=(AoutWo)Q.\begin{aligned} A_{\mathrm{out}} W_o' &= A_{\mathrm{out}} (W_o Q) \\ &= (A_{\mathrm{out}} W_o) Q . \end{aligned}

So there is never a need for an explicit runtime operation of the form “rotate activations now” or “unrotate activations now.” The shared-codebook-plus-activation-LUT linear layers already perform the required basis changes algebraically.

Proof Of Equivalence

We can state the result more formally: a transformer block executing on rotated activations with offline-rotated weights is functionally identical to the original block, up to a persistent global rotation of the residual stream.

1. The RMSNorm Prerequisite

A standard RMSNorm operation consists of length normalization followed by an element-wise scaling by a learnable affine gain γ\gamma:

RMSNorm(X)=XRMS(X)γ=XRMS(X)diag(γ)\text{RMSNorm}(X) = \frac{X}{\text{RMS}(X)} \odot \gamma = \frac{X}{\text{RMS}(X)} \text{diag}(\gamma)

Because the element-wise affine gain does not commute with an orthogonal rotation matrix QQ (i.e., RMSNorm(XQ)RMSNorm(X)Q\text{RMSNorm}(XQ) \neq \text{RMSNorm}(X)Q), we must first fold γ\gamma into the successor linear layers offline. This leaves a parameter-free normalization operator at runtime, which does commute with orthogonal rotations since it relies solely on the L2L_2 norm:

NormOnly(X):=XRMS(X)\text{NormOnly}(X) := \frac{X}{\text{RMS}(X)} NormOnly(XQ)=NormOnly(X)Q\text{NormOnly}(XQ) = \text{NormOnly}(X)Q

2. Offline Setup

Assume the incoming residual state is already in the rotated basis:

X~in=XinQ\tilde{X}_{in} = X_{in}Q

Offline, we first fold the affine gains into the entry projection weights:

W^q=diag(γattn)Wq,W^k=diag(γattn)Wk,W^v=diag(γattn)Wv\hat{W}_q = \text{diag}(\gamma_{attn}) W_q, \quad \hat{W}_k = \text{diag}(\gamma_{attn}) W_k, \quad \hat{W}_v = \text{diag}(\gamma_{attn}) W_v W^gate=diag(γmlp)Wgate,W^up=diag(γmlp)Wup\hat{W}_{gate} = \text{diag}(\gamma_{mlp}) W_{gate}, \quad \hat{W}_{up} = \text{diag}(\gamma_{mlp}) W_{up}

Next, we apply the orthogonal rotation QQ to all weights:

Wq=QW^q,Wk=QW^k,Wv=QW^v,Wo=WoQW'_q = Q^\top \hat{W}_q, \quad W'_k = Q^\top \hat{W}_k, \quad W'_v = Q^\top \hat{W}_v, \quad W'_o = W_o Q Wgate=QW^gate,Wup=QW^up,Wdown=WdownQW'_{gate} = Q^\top \hat{W}_{gate}, \quad W'_{up} = Q^\top \hat{W}_{up}, \quad W'_{down} = W_{down} Q

Phase 1: Attention

Because the affine gain is folded into the weights, the normalization step preserves the orthogonal rotation:

X~norm=NormOnly(X~in)=NormOnly(XinQ)=NormOnly(Xin)Q\tilde{X}_{norm} = \text{NormOnly}(\tilde{X}_{in}) = \text{NormOnly}(X_{in}Q) = \text{NormOnly}(X_{in})Q

The entry projections now implicitly unrotate the state, returning it to the standard basis:

Q~=X~normWq=(NormOnly(Xin)Q)(QW^q)=NormOnly(Xin)W^q=Qstandard\tilde{Q} = \tilde{X}_{norm}W'_q = (\text{NormOnly}(X_{in})Q)(Q^\top \hat{W}_q) = \text{NormOnly}(X_{in})\hat{W}_q = Q_{standard}

By the exact same cancellation mechanism:

K~=Kstandard,V~=Vstandard\tilde{K} = K_{standard}, \quad \tilde{V} = V_{standard}

Because the Q, K, and V matrices have returned to the standard space, the core attention operations (RoPE, Softmax, and value mixing) compute exactly the same tensors as the original block:

Aout=Softmax(RoPE(Q~)RoPE(K~)d)V~=AstandardA_{out} = \text{Softmax}\left(\frac{\text{RoPE}(\tilde{Q})\text{RoPE}(\tilde{K})^\top}{\sqrt{d}}\right)\tilde{V} = A_{standard}

The output projection then re-rotates the result back into the global residual basis:

O~=AoutWo=Astandard(WoQ)=(AstandardWo)Q\tilde{O} = A_{out}W'_o = A_{standard}(W_o Q) = (A_{standard}W_o)Q

Finally, the residual addition preserves the rotated-state invariant:

X~mid=X~in+O~=XinQ+(AstandardWo)Q=(Xin+AstandardWo)Q=XmidQ\tilde{X}_{mid} = \tilde{X}_{in} + \tilde{O} = X_{in}Q + (A_{standard}W_o)Q = (X_{in} + A_{standard}W_o)Q = X_{mid}Q

Phase 2: MLP

The same mechanism applies to the MLP. The normalization step commutes with the rotation:

X~mid,norm=NormOnly(X~mid)=NormOnly(Xmid)Q\tilde{X}_{mid,norm} = \text{NormOnly}(\tilde{X}_{mid}) = \text{NormOnly}(X_{mid})Q

The gate and up projections implicitly unrotate the activation before the non-linearity:

G~=X~mid,normWgate=(NormOnly(Xmid)Q)(QW^gate)=NormOnly(Xmid)W^gate=Gstandard\tilde{G} = \tilde{X}_{mid,norm}W'_{gate} = (\text{NormOnly}(X_{mid})Q)(Q^\top \hat{W}_{gate}) = \text{NormOnly}(X_{mid})\hat{W}_{gate} = G_{standard} U~=X~mid,normWup=(NormOnly(Xmid)Q)(QW^up)=NormOnly(Xmid)W^up=Ustandard\tilde{U} = \tilde{X}_{mid,norm}W'_{up} = (\text{NormOnly}(X_{mid})Q)(Q^\top \hat{W}_{up}) = \text{NormOnly}(X_{mid})\hat{W}_{up} = U_{standard}

SwiGLU therefore acts exactly on the standard-space activations:

Fout=Swish(G~)U~=FstandardF_{out} = \text{Swish}(\tilde{G}) \odot \tilde{U} = F_{standard}

The down projection re-rotates the MLP output:

D~=FoutWdown=Fstandard(WdownQ)=(FstandardWdown)Q\tilde{D} = F_{out}W'_{down} = F_{standard}(W_{down}Q) = (F_{standard}W_{down})Q

The final residual addition preserves the invariant through the end of the block:

X~out=X~mid+D~=XmidQ+(FstandardWdown)Q=(Xmid+FstandardWdown)Q=XoutQ\tilde{X}_{out} = \tilde{X}_{mid} + \tilde{D} = X_{mid}Q + (F_{standard}W_{down})Q = (X_{mid} + F_{standard}W_{down})Q = X_{out}Q

Conclusion

By induction through the layer, the full block satisfies:

BlockQuaRot(XinQ)BlockStandard(Xin)Q\text{Block}_{\text{QuaRot}}(X_{in}Q) \equiv \text{Block}_{\text{Standard}}(X_{in})Q

The non-linear pieces of the transformer always see the same standard-space tensors as the original model, while the residual stream remains consistently in the rotated basis from one end of the block to the other. In a deployed implementation, these basis changes and affine gains are permanently absorbed into the rotated, quantized shared codebook, eliminating the need for standalone rotate/unrotate activation kernels at runtime.

The Math: How Many MatMuls Did We Just Delete?

To truly appreciate the hardware impact of this equation, let’s calculate exactly how many FP16 multiplications we eliminated from the inference pass.

Assume we are generating a single token (batch size 11) through our 4096×40964096 \times 4096 matrix layer, with D=4D=4 and K=256K=256.

1. The Standard FP16 Baseline A standard dense layer performs a full dot product for every single weight.

2. The 3.125-Bit Approach (Our Equation) Instead of doing the full MatMul, our hardware executes the Equation in its LUT-based form. Let’s count the multiplications:

The Reduction: By pre-computing the LUT and relying on SRAM lookups and integer additions, we dropped the heavy FP16 multiplication workload from 16.716.7 million down to about 1.181.18 million.

That is nearly a 93.0% reduction in FP16 multiplications for the core layer weights. This removes the vast majority of the power-hungry hardware multiplier workload and suggests a viable path toward running state-of-the-art models on ultra-low-power edge devices and FPGAs.


Benchmarking the approach

To validate this architecture, we applied the uniform 3.125 bpw pipeline to Gemma 2 9B.

Before looking at downstream task performance, we also measured the raw quantization error using Normalized Mean Squared Error (NMSE):

NMSE(W,W^)=WW^22W22\mathrm{NMSE}(W, \hat{W}) = \frac{\|W - \hat{W}\|_2^2}{\|W\|_2^2}

Across all layers, our quantized reconstruction reached an average NMSE of 3.6%.

The Footprint Reduction

Before looking at the intelligence, look at the physical memory footprint. By enforcing a strict 3.125 bpw average, we transformed a model that requires a datacenter GPU into one that fits in the RAM of a smartphone.

Benchmark Results

We used lm-eval-harness to benchmark the model on a variety of tasks. We exactly replicated the benchmarks and metrics used in the original Gemma 2 model.

Task CategoryBenchmarkOriginal (FP16)3.125 bits (Our Build)% Change
Logic / BinaryBoolQ0.8420.870+3.33%
Logic / BinaryWinoGrande0.8060.740-8.19%
Math & ReasoningGSM8K (Flex)0.6860.735+7.14%
ComplianceIFEval (Strict)0.7800.730-6.41%
CodingHumanEval0.4020.370-7.96%
CodingMBPP (3-shot)0.5240.520-0.76%
KnowledgeMMLU (Avg)0.7130.690-3.23%
KnowledgeARC-Easy (Norm)0.8800.820-6.82%
TotalAverage Change-2.85%

Overall, the results demonstrate that while the 3.125 bpw build delivers a massive 5.12x reduction in memory footprint, it maintains strong performance across most domains with a manageable average degradation of approximately 2.85%.


What this means for hardware

This inference path is theoretically well aligned with edge deployment, but it does not automatically mean it will run faster on today’s GPUs. Modern GPUs are multiplication monsters: they are built around massive Tensor Core throughput, highly optimized dense kernels, and memory pipelines tuned for standard MatMuls. As a result, the LUT-heavy part of this method may underperform, or at best only match, conventional inference on GPUs unless extremely well-optimized custom kernels are written.

That is precisely why this result is interesting. The point is not that we should force every existing GPU to behave like a lookup engine. The point is that state-of-the-art LLM inference may not fundamentally require giant Tensor Cores. If most of the useful work can be reduced to SRAM gathers, pointer chasing, bit unpacking, and cheap integer accumulation, then the right inference hardware for the edge may look very different from a datacenter GPU. It may be a chip optimized for local SRAM bandwidth, low-latency lookups, simple ALUs, and tightly scheduled data movement rather than raw FP16 multiplication throughput.


Next steps

The next step is to apply this compression pipeline to more model families and evaluate how well the approach transfers beyond this first Gemma 2 9B and Qwen 3 8B result. We want to understand which architectures are the most robust to 3.125 bpw compression and where the failure modes appear.

We will also share the compression code, along with the naive inference script used for evaluation.


Share this post on:

Next Post
An approach to calibrating LLM reasoning effort