TurboQuant

Why LucidPal uses a custom llama.cpp fork and what TurboQuant brings.

Background

TurboQuant (Zandieh et al., ICLR 2026) is a quantization algorithm published by Google. It compresses neural network tensors — both model weights and the KV cache used during inference — down to 1–2 bits per element using a two-step pipeline:

Randomized Hadamard Transform — spreads energy uniformly across all coordinates, making the distribution analytically predictable.
Lloyd-Max scalar quantization — computes the theoretically optimal quantization buckets for that distribution. No calibration data or fine-tuning required.

The fork: TheTom/llama-cpp-turboquant

The official llama.cpp doesn't include TurboQuant yet (an upstream PR is open). LucidPal is built against TheTom's fork, which implements TurboQuant with Metal GPU kernels for Apple Silicon.

The fork introduces new GGML quantization types:

GGML type	Bits per element	Notes
`GGML_TYPE_TQ1_0`	1-bit	Maximum weight compression — available in xcframework, not used in LucidPal
`GGML_TYPE_TQ2_0`	2-bit	Near-lossless weight compression — available in xcframework, not used in LucidPal
`GGML_TYPE_TURBO4_0`	4-bit	Active in LucidPal — used for KV cache (`type_k` and `type_v`)

All types are compiled into LucidPal's llama.xcframework — the quantize, dequantize, and dot-product kernels, including Metal GPU kernels for TURBO2_0/TURBO3_0/TURBO4_0, are present in the binary.

Current status: KV cache active

GGML_TYPE_TURBO4_0 KV cache compression is enabled in LucidPal as of the feature/turboquant-kv-cache branch. Both type_k and type_v are set to TURBO4_0 at model load in LlamaActor.loadSingleModel:

cp.type_k = GGML_TYPE_TURBO4_0
cp.type_v = GGML_TYPE_TURBO4_0

The dedicated Metal attention kernels for TURBO4_0 landed in TheTom's fork and are compiled into LucidPal's llama.xcframework. On every model load, LucidPal logs:

KV cache types: type_k=turbo4_0 type_v=turbo4_0

The active type is visible in Settings → Advanced → KV Cache (shows turbo4_0). Model weights continue to ship as Q4_K_M GGUF files — the TQ1/TQ2 weight-quantized formats remain available for future use.

Context windows with TURBO4_0 KV cache compression

LlamaActor selects context size at model load time based on device RAM (ProcessInfo.processInfo.physicalMemory):

Device RAM	Context window (`n_ctx`)	Batch capacity
< 6 GB	4 096 tokens	8 192 (shared batch)
≥ 6 GB	8 192 tokens	8 192 (shared batch)

The constants are defined in LLMConstants:

static let smallContextSize: UInt32 = 4096   // < 6 GB RAM
static let largeContextSize: UInt32 = 8192   // ≥ 6 GB RAM
static let largeContextRAMThresholdGB = 6
static let batchCapacity: Int32 = 8192       // shared by both model slots

Background​

The fork: TheTom/llama-cpp-turboquant​

Current status: KV cache active​

Context windows with TURBO4_0 KV cache compression​

Sources​

Background

The fork: TheTom/llama-cpp-turboquant

Current status: KV cache active

Context windows with TURBO4_0 KV cache compression

Sources