4.Quantization TurboQuant
📅 2026.03.26 17:10 EDT from Gemini 3 Flash 👉 #LLM-Optimization #TurboQuant #Quantization #PolarQuant #Polar-Coordinates 👉 #Data-Modeling #Vector-Computation #Linear-Algebra 📎 Google Research Blog | OpenReview ICLR 2026 📎 PolarQuant Technical Specification 📎 Fast Walsh-Hadamard Transform — Wikipedia 📎 Understanding Polar Quantization
1. TurboQuant
1.1. Overview
- TurboQuant is an extreme-compression algorithm for LLMs and vector search engines, introduced by Google Research at the end of 2025 and published at ICLR 2026.
- Its primary goal is to solve VRAM out-of-memory (OOM) problems caused by KV-Cache (Key-Value Cache) when LLMs handle long text.
- Claim: with zero accuracy loss, VRAM usage drops to 1/6 of the original; reduced memory-bandwidth pressure boosts inference throughput up to 8×.
1.2. Technical Principles
TurboQuant breaks the limits of Cartesian-coordinate quantization by adopting a data-oblivious pipeline composed of two stages:
(1) PolarQuant — Polar-Coordinate Quantization
- Traditional INT8/FP8 quantization works in Cartesian coordinates, which requires storing scale factors per data block — a non-trivial bit overhead.
- Random Rotation: first rotate the input vector randomly so that its distribution becomes uniform and predictable.
- Polar Transformation: convert the vector into radius and angles. After rotation, angles concentrate, allowing high compression without storing extra normalization constants.
(2) QJL (Quantized Johnson-Lindenstrauss) Residual Correction
- PolarQuant has a high compression ratio but leaves a small residual.
- Error correction: project the residual to a lower dimension using QJL, recording only 1 bit (sign).
- Bias elimination: this step eliminates quantization bias, ensuring 100% accuracy on long-context tasks like Needle-in-a-Haystack stress tests.
1.3. Guide for Individual Users
For local-LLM enthusiasts, TurboQuant means very long context (100K+ tokens) becomes feasible on consumer 12-16 GB VRAM cards (RTX 3060 / 4070).
(1) Deployment & Tools
TurboQuant is in the early stages of moving from paper to industry. Individual users typically engage through these paths:
- llama.cpp / ExLlamaV2: the fastest entry point. The community is integrating TurboQuant into llama.cpp's KV-Cache quantization options.
- AutoGPTQ / AutoAWQ: watch these mainstream conversion tools for updates. Future model conversions may expose --kv-quant turboquant-like flags.
- vLLM / PagedAttention: if you run a local inference factory, vLLM is highly likely to be among the first to natively support this algorithm for concurrency-throughput optimization.
(2) Configuration (hypothetical)
Based on community integration trends, future config (YAML or launch command) may look like:
# Example: launch llama.cpp with TurboQuant
./main -m llama-3-70b-q4_k_m.gguf \
--cache-type tq \ # enable TurboQuant cache
--ctx-size 128000 \ # very large context even with limited VRAM
--n-gpu-layers 81
1.4. Application Scenarios
For data-engineering professionals, focus on these landing scenarios:
- Semantic Search
- Vector DB: when building RAG systems, use TurboQuant to compress vector indices.
- Effect: an index that previously needed 64 GB of memory shrinks to under 10 GB without sacrificing recall.
- Personal AI Agents
- Long-term memory: even after weeks of conversation, no need to clear history to maintain context.
- Local RAG: deploy a household-server knowledge base over your full personal documents; response time goes from "seconds" to "milliseconds".
- Code Intelligence
- Repo-level understanding: in VS Code, the AI can "swallow" all Python scripts of a project in one go without breaking on VRAM.
Performance summary:
| Metric | Traditional (FP16/INT8) | TurboQuant |
|---|---|---|
| KV-Cache compression | 1× - 2× | 6× - 8× |
| Accuracy loss | Noticeable (grows with length) | Near zero |
| Pre-processing cost | High (needs calibration data) | Very low (data-oblivious) |
| Hardware | High VRAM dependence | Very friendly to consumer GPUs |
2. PolarQuant
2.1. Overview
- PolarQuant is the core quantization engine within TurboQuant.
- Design philosophy: rather than struggle to compress wildly non-uniform values in Cartesian coordinates, rotate them and project into polar coordinates.
- LLM KV-Cache activations have strong outliers, forcing traditional linear quantization to reserve a huge dynamic range and lose precision. PolarQuant smooths this away through a mathematical transformation.
2.2. Key Mechanisms
PolarQuant operates in three steps:
- Random Orthogonal Transformation
- Theory: rotate the vector with a Hadamard matrix or a random orthogonal matrix.
-
Effect: by the Johnson-Lindenstrauss Lemma, the rotation preserves Euclidean distances while spreading energy concentrated in a few dimensions across all dimensions, eliminating outliers.
-
Angular Discretization
- Coordinate switch: convert an n-dimensional point from $(x_1, ..., x_n)$ to polar form $(\rho, \theta_1, ..., \theta_{n-1})$.
-
Quantization: keep $\rho$ (magnitude) at high precision; quantize all $\theta$ angles uniformly with very few bits (3-4) since their distribution is extremely uniform.
-
Data-Oblivious Property
- Unlike AWQ or GPTQ, PolarQuant doesn't need a calibration dataset to compute scales — meaning it can run on-the-fly and dramatically lowers compute load in data-engineering pipelines.
2.3. Python Demo: Vector Rotation & Projection
The following Python demo simulates PolarQuant's first step — making a non-uniform vector smooth through orthogonal rotation:
import numpy as np
from scipy.stats import ortho_group
def polar_quant_demo():
# 1. Simulate a raw vector with strong outliers (e.g., LLM hidden output)
original_vector = np.array([120.5, 0.2, -0.5, 88.4, 1.1, -0.3, 0.1, 0.5])
print(f"Original Vector:\n{original_vector}")
print(f"Max value: {original_vector.max()}, Std Dev: {original_vector.std():.2f}\n")
# 2. Generate a random orthogonal matrix
# In real PolarQuant, FWHT (Fast Walsh-Hadamard Transform) is used for speed
dimension = len(original_vector)
rotation_matrix = ortho_group.rvs(dim=dimension)
# 3. Apply random rotation
rotated_vector = np.dot(rotation_matrix, original_vector)
print(f"Rotated Vector (Energy Redistributed):\n{rotated_vector}")
print(f"Max value: {rotated_vector.max():.2f}, Std Dev: {rotated_vector.std():.2f}")
# 4. Energy conservation check (L2 norm preserved)
original_norm = np.linalg.norm(original_vector)
rotated_norm = np.linalg.norm(rotated_vector)
print(f"\nL2 Norm Check: Original={original_norm:.4f}, Rotated={rotated_norm:.4f}")
# Confidence: very high — orthogonal rotation preserves distance and norm
if __name__ == "__main__":
polar_quant_demo()
2.4. Adoption & Practical Tips
- Implementation latency: rotation compresses the space, but $O(n^2)$ matrix multiplication adds latency. In practice, use FWHT (Fast Walsh-Hadamard Transform) with $O(n \log n)$ complexity.
- Precision trade-off: with ample VRAM, keep $\rho$ in FP16 and only quantize the angles.
- Best fit: high-dimensional embeddings. For large-scale user-profile vectors, PolarQuant preserves Top-K retrieval accuracy better than simple linear quantization.
3. Algorithm Example
For data-engineering professionals, the choice of data structure determines the algorithm's underlying efficiency. In Python, we don't use the native list for TurboQuant / PolarQuant — we use numpy.ndarray for contiguous-memory layout.
3.1. FWHT (Fast Walsh-Hadamard Transform)
Algorithmic complexity comparisons are CS fundamentals.
(1) Data Structure: Defining Vectors
A scientific way to define a vector in Python is via a NumPy array — at the C level it is a contiguous memory buffer that supports SIMD instruction-set optimization.
import numpy as np
# 1. Define a vector (rank-1 tensor)
# Use dtype=np.float32 to mimic single-precision floats in inference engines
v = np.array([1.2, 3.4, 5.6, 7.8], dtype=np.float32)
# 2. Element-wise operations
v_scaled = v * 2.0 # scale
v_sum = v + 1.0 # offset (broadcasting)
v_norm = np.linalg.norm(v) # L2 norm
(2) Vector Rotation: The Geometry Step
PolarQuant's rotation is essentially a linear transformation:
- Math: $y = Wx$ where $W$ is orthogonal, satisfying $W^T W = I$
- Data structure: $W$ is a 2D matrix np.ndarray (shape=(n, n))
(3) Optimized Implementation: FWHT ($O(n \log n)$)
PolarQuant uses Fast Walsh-Hadamard Transform (FWHT) to simulate random rotation. - Direct matrix multiplication is $O(n^2)$. - FWHT avoids the huge matrix and uses recursive butterfly operations. - Python implementation looks similar to FFT and dramatically reduces compute.
def fwht(a):
"""Recursive/iterative Fast Walsh-Hadamard Transform.
a: input vector; length must be a power of 2.
"""
n = len(a)
if n == 1:
return a
# Split into front/back halves
a_left = fwht(a[0 : n // 2])
a_right = fwht(a[n // 2 : n])
# Butterfly: (x + y), (x - y)
res = np.zeros(n, dtype=a.dtype)
res[0 : n // 2] = a_left + a_right
res[n // 2 : n] = a_left - a_right
return res / np.sqrt(2) # normalize to preserve orthogonality (L2 norm)
# (1) Run and verify
# Simulate one block in KV-Cache (dim=8)
kv_block = np.array([10.0, 1.0, 0.5, -2.0, 5.0, 0.0, 1.1, 0.2], dtype=np.float32)
# Run fast rotation
rotated_kv = fwht(kv_block)
print(f"Original KV: {kv_block}")
print(f"Rotated KV: {rotated_kv}")
print(f"Norm Check: {np.linalg.norm(kv_block):.4f} == {np.linalg.norm(rotated_kv):.4f}")
(4) Adoption Guide
- Setup environment:
pip install numpy scipy - Workflow integration:
- Input: a tensor from an LLM intermediate layer
- Transform: call
fwht()to rotate - Quantize: apply PolarQuant to the rotation (extract magnitude, low-bit quantize angles)
- Reverse (de-quantize): at inference read-back, de-quantize to floats; call
fwht()again (the inverse of the Hadamard transform is the transform itself, up to a scalar) - Complexity comparison:
| Data scale ($n$) | Matrix multiply ($n^2$) | FWHT ($n \log n$) | Speedup |
|---|---|---|---|
| 1024 (LLM dim) | 1,048,576 | 10,240 | ~100× |
| 4096 (Llama-3) | 16,777,216 | 49,152 | ~340× |
3.2. Converting Rotated Vectors to Polar Coordinates
The most important factor for adoption is memory layout. The example shows how to take the rotated vector to polar coordinates and apply 3-bit quantization (a common simplification of polar-coordinate quantization).
(1) Data Structure: Polar Representation
Two independent ndarrays store the polar-coordinate data, saving roughly 75% space versus storing the original float vector:
- Radius ($\rho$): a scalar in float16 or float32; the magnitude in n-dim space.
- Angles ($\theta$): an int8 array. Although we use only 3 bits (values 0-7), Python's smallest integer storage unit is int8 (1 byte).
(2) Core Operations: Transformation & Quantization
Decompose into: Cartesian → Polar → Quantize → Dequantize.
- Vector to angles (encode): for an n-dim vector $V$, polar form has 1 radius and $n-1$ angles.
- $\rho$ (magnitude): $\sqrt{\sum x_i^2}$
- $\theta$ (angles): $\arccos(x_i / \text{partial_norm})$
- Bit-packing: since 3 bits don't divide 8 evenly, in industrial pipelines use
np.packbitsor bit-shifts to pack multiple 3-bit values into auint8array — that's where the real compression comes from.
(3) Complete Python Pipeline
This code shows the full path from raw data to compression:
import numpy as np
def polar_quantize_pipeline(vector, bits=3):
"""PolarQuant full pipeline: rotate → polarize → quantize."""
dim = len(vector)
# Use a fixed seed for reproducibility
q, _ = np.linalg.qr(np.random.standard_normal((dim, dim)))
rotated_v = q @ vector
# Magnitude
radius = np.linalg.norm(rotated_v)
# Angles: simplified — compute normalized components and map to [0, 1]
normalized_v = rotated_v / (radius + 1e-9)
# 3-bit quantization
levels = 2 ** bits
quantized_angles = np.round((normalized_v + 1) / 2 * (levels - 1)).astype(np.int8)
quantized_angles = np.clip(quantized_angles, 0, levels - 1)
return radius, quantized_angles, q
def dequantize(radius, quantized_angles, rotation_matrix, bits=3):
"""Dequantize: restore to the original space."""
levels = 2 ** bits
# Recover from int to [-1, 1]
reconstructed_v = (quantized_angles / (levels - 1)) * 2 - 1
# Renormalize and apply radius
reconstructed_v = reconstructed_v / np.linalg.norm(reconstructed_v) * radius
# Inverse rotation (orthogonal matrix's inverse equals its transpose)
original_space_v = rotation_matrix.T @ reconstructed_v
return original_space_v
# (1) Example
# Define an 8-dim feature vector
raw_data = np.array([55.2, -10.5, 0.1, 120.8, -3.3, 45.0, 7.7, -1.2], dtype=np.float32)
# Compress
r, q_angles, rot_m = polar_quantize_pipeline(raw_data)
# Decompress
recovered_data = dequantize(r, q_angles, rot_m)
print(f"Original: {raw_data[0:3]} ...")
print(f"Recovered: {recovered_data[0:3]} ...")
print(f"Compression Ratio (Theoretical): ~{32/3:.1f}x for angles")
print(f"Cosine Similarity: {np.dot(raw_data, recovered_data)/(np.linalg.norm(raw_data)*np.linalg.norm(recovered_data)):.4f}")
(4) Execution & Validation
- How to run:
- Environment: ensure
numpyis installed. - Inspect data structure: in debug mode, observe
q_angles— every element is in[0, 7], representable in 3 bits, while raw data takes 32 bits per value. - Cosine similarity: closer to 1.0 means the rotation + quantization preserved precision well.
- Adoption tips:
- Vector DB: in production, this transformation is typically a C++ plugin attached to PostgreSQL or Milvus.
- Data modeling: this algorithm can be part of feature engineering, useful for compressing storage of cold data.