4.Quantization TurboQuant

📅 2026.03.26 17:10 EDT from Gemini 3 Flash 👉 #LLM-Optimization #TurboQuant #Quantization #PolarQuant #Polar-Coordinates 👉 #Data-Modeling #Vector-Computation #Linear-Algebra 📎 Google Research Blog | OpenReview ICLR 2026 📎 PolarQuant Technical Specification 📎 Fast Walsh-Hadamard Transform — Wikipedia 📎 Understanding Polar Quantization

1. TurboQuant

1.1. Overview

TurboQuant is an extreme-compression algorithm for LLMs and vector search engines, introduced by Google Research at the end of 2025 and published at ICLR 2026.
Its primary goal is to solve VRAM out-of-memory (OOM) problems caused by KV-Cache (Key-Value Cache) when LLMs handle long text.
Claim: with zero accuracy loss, VRAM usage drops to 1/6 of the original; reduced memory-bandwidth pressure boosts inference throughput up to 8×.

1.2. Technical Principles

TurboQuant breaks the limits of Cartesian-coordinate quantization by adopting a data-oblivious pipeline composed of two stages:

(1) PolarQuant — Polar-Coordinate Quantization

Traditional INT8/FP8 quantization works in Cartesian coordinates, which requires storing scale factors per data block — a non-trivial bit overhead.
Random Rotation: first rotate the input vector randomly so that its distribution becomes uniform and predictable.
Polar Transformation: convert the vector into radius and angles. After rotation, angles concentrate, allowing high compression without storing extra normalization constants.

(2) QJL (Quantized Johnson-Lindenstrauss) Residual Correction

PolarQuant has a high compression ratio but leaves a small residual.
Error correction: project the residual to a lower dimension using QJL, recording only 1 bit (sign).
Bias elimination: this step eliminates quantization bias, ensuring 100% accuracy on long-context tasks like Needle-in-a-Haystack stress tests.

1.3. Guide for Individual Users

For local-LLM enthusiasts, TurboQuant means very long context (100K+ tokens) becomes feasible on consumer 12-16 GB VRAM cards (RTX 3060 / 4070).

(1) Deployment & Tools

TurboQuant is in the early stages of moving from paper to industry. Individual users typically engage through these paths: - llama.cpp / ExLlamaV2: the fastest entry point. The community is integrating TurboQuant into llama.cpp's KV-Cache quantization options. - AutoGPTQ / AutoAWQ: watch these mainstream conversion tools for updates. Future model conversions may expose --kv-quant turboquant-like flags. - vLLM / PagedAttention: if you run a local inference factory, vLLM is highly likely to be among the first to natively support this algorithm for concurrency-throughput optimization.

(2) Configuration (hypothetical)

Based on community integration trends, future config (YAML or launch command) may look like:

# Example: launch llama.cpp with TurboQuant
./main -m llama-3-70b-q4_k_m.gguf \
  --cache-type tq \         # enable TurboQuant cache
  --ctx-size 128000 \       # very large context even with limited VRAM
  --n-gpu-layers 81

1.4. Application Scenarios

For data-engineering professionals, focus on these landing scenarios:

Semantic Search
Vector DB: when building RAG systems, use TurboQuant to compress vector indices.
Effect: an index that previously needed 64 GB of memory shrinks to under 10 GB without sacrificing recall.
Personal AI Agents
Long-term memory: even after weeks of conversation, no need to clear history to maintain context.
Local RAG: deploy a household-server knowledge base over your full personal documents; response time goes from "seconds" to "milliseconds".
Code Intelligence
Repo-level understanding: in VS Code, the AI can "swallow" all Python scripts of a project in one go without breaking on VRAM.

Performance summary:

Metric	Traditional (FP16/INT8)	TurboQuant
KV-Cache compression	1× - 2×	6× - 8×
Accuracy loss	Noticeable (grows with length)	Near zero
Pre-processing cost	High (needs calibration data)	Very low (data-oblivious)
Hardware	High VRAM dependence	Very friendly to consumer GPUs

2. PolarQuant

2.1. Overview

PolarQuant is the core quantization engine within TurboQuant.
Design philosophy: rather than struggle to compress wildly non-uniform values in Cartesian coordinates, rotate them and project into polar coordinates.
LLM KV-Cache activations have strong outliers, forcing traditional linear quantization to reserve a huge dynamic range and lose precision. PolarQuant smooths this away through a mathematical transformation.

2.2. Key Mechanisms

PolarQuant operates in three steps:

Random Orthogonal Transformation
Theory: rotate the vector with a Hadamard matrix or a random orthogonal matrix.
Effect: by the Johnson-Lindenstrauss Lemma, the rotation preserves Euclidean distances while spreading energy concentrated in a few dimensions across all dimensions, eliminating outliers.
Angular Discretization
Coordinate switch: convert an n-dimensional point from $(x_1, ..., x_n)$ to polar form $(\rho, \theta_1, ..., \theta_{n-1})$.
Quantization: keep $\rho$ (magnitude) at high precision; quantize all $\theta$ angles uniformly with very few bits (3-4) since their distribution is extremely uniform.
Data-Oblivious Property
Unlike AWQ or GPTQ, PolarQuant doesn't need a calibration dataset to compute scales — meaning it can run on-the-fly and dramatically lowers compute load in data-engineering pipelines.

2.3. Python Demo: Vector Rotation & Projection

The following Python demo simulates PolarQuant's first step — making a non-uniform vector smooth through orthogonal rotation:

import numpy as np
from scipy.stats import ortho_group

def polar_quant_demo():
    # 1. Simulate a raw vector with strong outliers (e.g., LLM hidden output)
    original_vector = np.array([120.5, 0.2, -0.5, 88.4, 1.1, -0.3, 0.1, 0.5])
    print(f"Original Vector:\n{original_vector}")
    print(f"Max value: {original_vector.max()}, Std Dev: {original_vector.std():.2f}\n")

    # 2. Generate a random orthogonal matrix
    # In real PolarQuant, FWHT (Fast Walsh-Hadamard Transform) is used for speed
    dimension = len(original_vector)
    rotation_matrix = ortho_group.rvs(dim=dimension)

    # 3. Apply random rotation
    rotated_vector = np.dot(rotation_matrix, original_vector)
    print(f"Rotated Vector (Energy Redistributed):\n{rotated_vector}")
    print(f"Max value: {rotated_vector.max():.2f}, Std Dev: {rotated_vector.std():.2f}")

    # 4. Energy conservation check (L2 norm preserved)
    original_norm = np.linalg.norm(original_vector)
    rotated_norm = np.linalg.norm(rotated_vector)
    print(f"\nL2 Norm Check: Original={original_norm:.4f}, Rotated={rotated_norm:.4f}")
    # Confidence: very high — orthogonal rotation preserves distance and norm

if __name__ == "__main__":
    polar_quant_demo()

2.4. Adoption & Practical Tips

Implementation latency: rotation compresses the space, but $O(n^2)$ matrix multiplication adds latency. In practice, use FWHT (Fast Walsh-Hadamard Transform) with $O(n \log n)$ complexity.
Precision trade-off: with ample VRAM, keep $\rho$ in FP16 and only quantize the angles.
Best fit: high-dimensional embeddings. For large-scale user-profile vectors, PolarQuant preserves Top-K retrieval accuracy better than simple linear quantization.

3. Algorithm Example

For data-engineering professionals, the choice of data structure determines the algorithm's underlying efficiency. In Python, we don't use the native list for TurboQuant / PolarQuant — we use numpy.ndarray for contiguous-memory layout.

3.1. FWHT (Fast Walsh-Hadamard Transform)

Algorithmic complexity comparisons are CS fundamentals.

(1) Data Structure: Defining Vectors

A scientific way to define a vector in Python is via a NumPy array — at the C level it is a contiguous memory buffer that supports SIMD instruction-set optimization.

import numpy as np
# 1. Define a vector (rank-1 tensor)
# Use dtype=np.float32 to mimic single-precision floats in inference engines
v = np.array([1.2, 3.4, 5.6, 7.8], dtype=np.float32)

# 2. Element-wise operations
v_scaled = v * 2.0   # scale
v_sum = v + 1.0      # offset (broadcasting)
v_norm = np.linalg.norm(v)  # L2 norm

(2) Vector Rotation: The Geometry Step

PolarQuant's rotation is essentially a linear transformation: - Math: $y = Wx$ where $W$ is orthogonal, satisfying $W^T W = I$ - Data structure: $W$ is a 2D matrix np.ndarray (shape=(n, n))

(3) Optimized Implementation: FWHT ($O(n \log n)$)

PolarQuant uses Fast Walsh-Hadamard Transform (FWHT) to simulate random rotation. - Direct matrix multiplication is $O(n^2)$. - FWHT avoids the huge matrix and uses recursive butterfly operations. - Python implementation looks similar to FFT and dramatically reduces compute.

def fwht(a):
    """Recursive/iterative Fast Walsh-Hadamard Transform.
    a: input vector; length must be a power of 2.
    """
    n = len(a)
    if n == 1:
        return a

    # Split into front/back halves
    a_left = fwht(a[0 : n // 2])
    a_right = fwht(a[n // 2 : n])

    # Butterfly: (x + y), (x - y)
    res = np.zeros(n, dtype=a.dtype)
    res[0 : n // 2] = a_left + a_right
    res[n // 2 : n] = a_left - a_right

    return res / np.sqrt(2)  # normalize to preserve orthogonality (L2 norm)


# (1) Run and verify
# Simulate one block in KV-Cache (dim=8)
kv_block = np.array([10.0, 1.0, 0.5, -2.0, 5.0, 0.0, 1.1, 0.2], dtype=np.float32)

# Run fast rotation
rotated_kv = fwht(kv_block)

print(f"Original KV: {kv_block}")
print(f"Rotated  KV: {rotated_kv}")
print(f"Norm Check: {np.linalg.norm(kv_block):.4f} == {np.linalg.norm(rotated_kv):.4f}")

(4) Adoption Guide

Setup environment: pip install numpy scipy
Workflow integration:
Input: a tensor from an LLM intermediate layer
Transform: call fwht() to rotate
Quantize: apply PolarQuant to the rotation (extract magnitude, low-bit quantize angles)
Reverse (de-quantize): at inference read-back, de-quantize to floats; call fwht() again (the inverse of the Hadamard transform is the transform itself, up to a scalar)
Complexity comparison:

Data scale ($n$)	Matrix multiply ($n^2$)	FWHT ($n \log n$)	Speedup
1024 (LLM dim)	1,048,576	10,240	~100×
4096 (Llama-3)	16,777,216	49,152	~340×

3.2. Converting Rotated Vectors to Polar Coordinates

The most important factor for adoption is memory layout. The example shows how to take the rotated vector to polar coordinates and apply 3-bit quantization (a common simplification of polar-coordinate quantization).

(1) Data Structure: Polar Representation

Two independent ndarrays store the polar-coordinate data, saving roughly 75% space versus storing the original float vector: - Radius ($\rho$): a scalar in float16 or float32; the magnitude in n-dim space. - Angles ($\theta$): an int8 array. Although we use only 3 bits (values 0-7), Python's smallest integer storage unit is int8 (1 byte).

(2) Core Operations: Transformation & Quantization

Decompose into: Cartesian → Polar → Quantize → Dequantize.

Vector to angles (encode): for an n-dim vector $V$, polar form has 1 radius and $n-1$ angles.
$\rho$ (magnitude): $\sqrt{\sum x_i^2}$
$\theta$ (angles): $\arccos(x_i / \text{partial_norm})$
Bit-packing: since 3 bits don't divide 8 evenly, in industrial pipelines use np.packbits or bit-shifts to pack multiple 3-bit values into a uint8 array — that's where the real compression comes from.

(3) Complete Python Pipeline

This code shows the full path from raw data to compression:

import numpy as np

def polar_quantize_pipeline(vector, bits=3):
    """PolarQuant full pipeline: rotate → polarize → quantize."""
    dim = len(vector)
    # Use a fixed seed for reproducibility
    q, _ = np.linalg.qr(np.random.standard_normal((dim, dim)))
    rotated_v = q @ vector

    # Magnitude
    radius = np.linalg.norm(rotated_v)

    # Angles: simplified — compute normalized components and map to [0, 1]
    normalized_v = rotated_v / (radius + 1e-9)

    # 3-bit quantization
    levels = 2 ** bits
    quantized_angles = np.round((normalized_v + 1) / 2 * (levels - 1)).astype(np.int8)
    quantized_angles = np.clip(quantized_angles, 0, levels - 1)

    return radius, quantized_angles, q


def dequantize(radius, quantized_angles, rotation_matrix, bits=3):
    """Dequantize: restore to the original space."""
    levels = 2 ** bits
    # Recover from int to [-1, 1]
    reconstructed_v = (quantized_angles / (levels - 1)) * 2 - 1

    # Renormalize and apply radius
    reconstructed_v = reconstructed_v / np.linalg.norm(reconstructed_v) * radius

    # Inverse rotation (orthogonal matrix's inverse equals its transpose)
    original_space_v = rotation_matrix.T @ reconstructed_v
    return original_space_v


# (1) Example
# Define an 8-dim feature vector
raw_data = np.array([55.2, -10.5, 0.1, 120.8, -3.3, 45.0, 7.7, -1.2], dtype=np.float32)

# Compress
r, q_angles, rot_m = polar_quantize_pipeline(raw_data)

# Decompress
recovered_data = dequantize(r, q_angles, rot_m)

print(f"Original: {raw_data[0:3]} ...")
print(f"Recovered: {recovered_data[0:3]} ...")
print(f"Compression Ratio (Theoretical): ~{32/3:.1f}x for angles")
print(f"Cosine Similarity: {np.dot(raw_data, recovered_data)/(np.linalg.norm(raw_data)*np.linalg.norm(recovered_data)):.4f}")

(4) Execution & Validation

How to run:
Environment: ensure numpy is installed.
Inspect data structure: in debug mode, observe q_angles — every element is in [0, 7], representable in 3 bits, while raw data takes 32 bits per value.
Cosine similarity: closer to 1.0 means the rotation + quantization preserved precision well.
Adoption tips:
Vector DB: in production, this transformation is typically a C++ plugin attached to PostgreSQL or Milvus.
Data modeling: this algorithm can be part of feature engineering, useful for compressing storage of cold data.