An Interactive Reading of

Make Attention
Sub-Quadratic Again

Learned Search Projections for Attention Candidate Retrieval

Marcel Butucea
Independent Researcher · May 2026

📂 Hugging Face Dataset 💻 GitHub Implementation

The paper, in plain English

Every time a large language model generates a new word, it looks back at every previous word it has written and asks: "which of these should I pay attention to?" That look-back is called self-attention, and its cost grows with the square of the sequence length. At 4,000 tokens the scoring matrix has 16 million entries. At 1 million tokens it has a trillion. The model is doing valuable work for most of those entries, but a surprisingly large fraction of the attention probability mass lands on just a handful of keys.

The paper's insight is simple but subtle: you cannot just throw a nearest-neighbor search index at the model's internal vectors and hope it works. Those vectors were never trained to be good neighbors to each other. Instead, the paper attaches a tiny trainable search layer to selected attention layers. This layer learns to project hidden states into a shared low-dimensional space where "nearest neighbor" actually means "the keys the teacher model would have attended to most strongly." Two training losses — a contrastive objective that rewards pulling teacher-preferred keys closer, and a KL divergence that matches the full attention distribution — teach these projections to retrieve the right candidates.

The first headline result, from a clean 6-layer pilot in Qwen3-4B, was that learned search projections preserve full-attention perplexity within +0.01% at K=256 while scoring only 256 keys instead of all prior tokens. Learned retrieval captures more teacher-attention mass than Quest (a strong page-based baseline) at equal token budgets, and the learned vectors are compatible with off-the-shelf FAISS/HNSW approximate nearest-neighbor search.

The new broad-layer experiments push the question further: how much of the model can actually run this way? Substituting all 36 layers is feasible but costs quality (+3.23% relative perplexity gap). Per-layer diagnostics identify layers 0–2 as the weakest contributors, so reserving those — along with the final layer — yields a 32-of-36-layer reserved-edge configuration with only +1.746% PPL gap, Recall@K 0.825, and 20.97M trainable parameters. A post-hoc K-sweep on this checkpoint actually matches full attention at K=256 (−0.062% gap on a 2-batch slice). The picture that emerges is that coverage is a Pareto knob, not a binary choice: trading a small amount of model quality for a much larger reduction in candidate-scoring cost is something you can dial in.

The honest framing stays the same: this shows ANN-compatible retrieval can work — now at near-full-model scale, not just a pilot — but it does not yet beat existing methods in wall-clock speed on today's hardware. Decode-mode KV-cache integration, multi-seed confidence intervals, and long-context task validation are explicitly listed as next experiments.

I

Learned Search Projections

Small trainable projections map hidden states into a shared d_search=128-dimensional retrieval space where nearest-neighbor search actually recovers attention-relevant keys.

II

Dual Distillation

A contrastive InfoNCE loss pulls teacher-topK keys closer, while a KL distillation loss aligns the full search distribution with the teacher's attention pattern.

III

ANN Compatibility

The learned vectors plug directly into off-the-shelf FAISS/HNSW, tracking exact retrieval quality within +0.03% PPL on the clean evaluation slice.

Start with the quadratic wall →

Chapter 1

The Quadratic Wall

Self-attention is the engine that gives transformer language models their power. For every new token a model generates, it computes a score against every previous token in the sequence, passes those scores through a softmax, and uses the resulting weights to blend value vectors into an output. The scoring step is where the cost hides.

In plain English

Imagine you are at a dinner party with 10 people and you want to figure out who to listen to. You scan 10 faces — manageable. Now the party grows to 1,000 people, then 100,000. Every additional guest means another face to evaluate, and your total scanning effort grows as the square of the party size: not 10× more work for 10× more people, but 100× more.

Self-attention works the same way. Each query token scores itself against every prior key token. A sequence of length $N$ produces a scoring matrix with $N^2$ entries. FlashAttention makes this fast in practice for moderate lengths, but the fundamental shape of the computation — a dense scan of all prior keys — remains.

The interactive heatmap below shows what the scoring matrix looks like. Drag the sequence length slider and watch the matrix explode. The key insight: most of the attention probability mass concentrates on a tiny fraction of keys. The question is how to find that tiny fraction without looking at everything.

Scaled Dot-Product Attention

$$A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_h}} + M\right), \quad O = AV$$

Here $Q, K, V$ are the query, key, and value tensors, $d_h$ is the head dimension, and $M$ is a causal or block-causal mask. The matrix $QK^\top$ contains $\mathcal{O}(N^2)$ pairwise scores — one for every (query, key) pair in the eligible set.

Sequence length N

N = 32

Chart updates as you drag. The heatmap shows relative attention weight intensity.

Scoring entries

1,024

Top-5% mass

87%

FLOPs per query

4,096

The attention matrix is almost always sparse in probability mass — a few keys receive most of the weight. The entire paper rests on one question: can we find those few keys without scoring all of them?

Why you can't just use native vectors →

Chapter 2

Why Native Vectors Fail

The obvious idea: build an approximate-nearest-neighbor (ANN) index over the transformer's own key vectors. For each query, retrieve the nearest keys in Euclidean space, and attend only to those. The problem is that native query and key vectors are not trained to be mutual nearest neighbors.

In plain English

Think of two filing systems in the same office. One organizes documents by topic (the key vectors). The other is a set of retrieval requests that rank documents by a complicated scoring formula (the query vectors). The topics and the scoring formula evolved together inside the transformer's training — they work as a pair through dot products, not as independently meaningful locations in space.

RetrievalAttention, a prior system, found exactly this problem. When you take native $Q$ and $K$ vectors and try to find nearest neighbors in Euclidean distance, the results are poor. The vectors occupy different parts of the embedding space, and "close in Euclidean distance" does not mean "high attention score." The geometry of attention scoring is not the geometry of nearest-neighbor retrieval.

The scatter plot below shows a toy version of this mismatch. Native query and key distributions are offset and rotated relative to each other, while the learned projections align them in a shared space where proximity predicts attention relevance.

Per-query scoring cost

$$s_{tj} = \frac{q_t^\top k_j}{\sqrt{d_h}}, \qquad C_{\text{full}}(N) = N \cdot d_h$$

RetrievalAttention handles the mismatch by making the index attention-aware. This paper takes the opposite route: instead of fixing the index, it fixes the vectors.

Distribution offset

60%

Rotation angle

30°

Left: native Q/K distributions — offset means ANN retrieval misses. Right: learned space — aligned distributions enable ANN.

Native ANN recall

0.41

Learned ANN recall

0.89

RetrievalAttention adapts the index to native Q/K vectors. This paper adapts the vectors to the index. The trade-off: RetrievalAttention is training-free; this method requires a small training phase. The benefit: standard ANN machinery works out of the box.

How the search space is learned →

Chapter 3

Learning a Shared Search Space

The core idea: attach lightweight trainable projections to selected attention layers. Each projection maps the hidden state into a low-dimensional "search space" where queries and keys are trained to be mutually retrievable.

In plain English

Imagine a conference where every attendee speaks a different language. The conversations work — people have learned to communicate through gestures, context, and shared context. But if you try to build a directory that says "person A is closest to person B," the language barrier makes it useless. The paper's solution is to teach every attendee a common pidgin — a small, shared vocabulary — just for the purpose of finding each other. The real conversations still happen in the original languages. The pidgin is only for the directory.

In the transformer, the "languages" are the native Q and K vectors inside each attention layer. The "pidgin" is a pair of small matrices $W^Q_s$ and $W^K_s$ that project hidden states into a shared $d_{\text{search}}$-dimensional space. The base model stays frozen — only the projection matrices are trained.

With $d_{\text{search}} = 128$ and six trained layers on Qwen3-4B, the search module contains about 3.93M trainable parameters — less than 0.1% of the base model.

Search Projections

$$Q_i^s = h_i \, W_i^{Q_s}, \qquad K_i^s = h_i \, W_i^{K_s}$$

where $W_i^{Q_s}, W_i^{K_s} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{search}}}$ are per-layer trainable projections, and $h_i$ is the hidden state entering layer $i$'s self-attention module.

d_search dimension

128

Layers substituted

6

Drag to see how d_search and the number of substituted layers affect parameter count and retrieval capacity.

Trainable params

3.93M

% of base model

0.098%

The base model weights are never modified. The approximation is in candidate selection only — once K candidates are found, the model's own native Q, K, and V are used for the actual attention computation.

How the projections are trained →

Chapter 4

Dual Distillation

The search projections are trained by two complementary losses. The first is a contrastive InfoNCE loss that teaches the projections to rank teacher-preferred keys above distractors. The second is a KL-divergence loss that aligns the full search distribution with the teacher's attention pattern.

In plain English

Training the search layer is like training a new librarian for a library that has always had a single expert curator. The curator (the frozen teacher model) knows exactly which books matter for each query. The librarian (the search projection) needs to learn to surface those same books — but using a much simpler filing system.

The first training signal says: "For each query, the curator's top-K favorite books are the right answers. Your filing system should put them closer than everything else." The second signal is finer-grained: "Your ranking of all books should look similar to the curator's ranking in terms of how much attention each book gets." Together, these two signals teach the projections to be good proxies for attention relevance.

The contrastive loss is like teaching the librarian to always shelve the important books on the eye-level shelf. The KL loss is like teaching them to maintain the same relative ordering of importance across all shelves.

Contrastive Teacher-TopK Objective

$$\mathcal{L}_{\text{NCE}}(t) = -\log \frac{\sum_{j \in P_t} \exp(z_{tj})}{\sum_{j \in \mathcal{C}_t} \exp(z_{tj})}$$

where $P_t$ is the set of teacher top-$K_{\text{pos}}$ keys, $\mathcal{C}_t$ is the valid causal key set, and $z_{tj} = \frac{(\tilde{q}_t^s)^\top \tilde{k}_j^s}{\tau}$ is the search similarity between L2-normalized vectors scaled by temperature $\tau$.

Distribution-Level KL Distillation

$$\mathcal{L}_{\text{KL}}(t) = D_{\text{KL}}\!\left(A_i^T[t, \cdot] \;\|\; A_i^S[t, \cdot]\right) = \sum_{j \in \mathcal{C}_t} A_i^T[t,j] \log \frac{A_i^T[t,j]}{A_i^S[t,j]}$$

The total layer-averaged objective combines both with $\alpha = \beta = 1$:

$$\mathcal{L} = \frac{\alpha}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \mathcal{L}^i_{\text{NCE}} + \frac{\beta}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \mathcal{L}^i_{\text{KL}}$$

Temperature τ

0.50

K_pos (positive set size)

16

NCE weight α

1.0

Adjust temperature and positive set size to see how the contrastive loss landscape changes.

The teacher distribution is reconstructed outside the model forward pass — by capturing native post-RoPE Q and K tensors and recomputing softmax. This avoids forcing the model onto slower eager attention paths during training.

How inference works →

Chapter 5

Retrieve Then Attend

At inference time, selected attention layers are replaced with a retrieve-then-attend pipeline. The model's native Q, K, V are computed as usual. Then the search projections find the top-K candidates. Attention is computed over only those candidates.

In plain English

Think of a librarian who used to read every single book in the library before answering your question. Now, instead, the librarian uses a card catalog (the search projection) to pull out the 128 most relevant books, and reads only those. The card catalog was trained by watching the old librarian work — learning which books the expert always pulled for similar queries.

The critical detail: the reading still happens in the original language. The card catalog is just for finding the right books. In transformer terms, the final attention uses the model's own native Q, K, and V — the search space is only for candidate selection.

Block-causal masking ensures that tokens from different packed documents never attend to each other. Each document gets its own attention scope, and the search space respects these boundaries.

Substituted Sparse Attention

$$\hat{o}_t = \sum_{j \in S_t} \frac{\exp(q_t^\top k_j / \sqrt{d_h})}{\sum_{\ell \in S_t} \exp(q_t^\top k_\ell / \sqrt{d_h})} \, v_j$$

where $S_t$ is the retrieved candidate set for query $t$. The native $q_t$, $k_j$, $v_j$ are the frozen model's own vectors — the approximation is in the candidate set, not in the attention values.

Block-Causal Mask

$$M_{tj} = 0 \;\;\text{iff}\;\; \text{segment}(t) = \text{segment}(j) \;\text{and}\; j \le t$$

and $M_{tj} = -\infty$ otherwise. This prevents cross-document attention leakage in packed sequences.

Click on any pipeline stage to see details. The blue path shows the native attention flow; the orange path shows the search projection layer.

The sparse attention is not an approximation in value space. The model's own Q, K, V, RoPE, and output projection are all used exactly as in full attention. The only approximation is in which keys participate.

When does the math pay off? →

Chapter 6

The Complexity Payoff

The method replaces a linear scan over all N keys with a sub-linear HNSW retrieval. The scoring-cost proxy tells a clear story: full attention grows as $\mathcal{O}(N)$, Quest-style page selection also grows as $\mathcal{O}(N)$ (but with a smaller constant), and learned HNSW grows as $\mathcal{O}(\log N)$.

In plain English

Imagine three ways to find a book in a library. Method 1: scan every shelf (full attention). Method 2: check the section labels on each aisle, then scan only the best aisles (Quest). Method 3: look it up in a hierarchical card catalog that branches at every level (HNSW). For a small library, scanning shelves is fine. For a city library, the card catalog wins — and it wins harder as the library grows.

The crossover happens around 300K tokens. Below that, Quest's cheaper per-page scan beats HNSW's logarithmic traversal. Above it, HNSW's $\log_2 N$ growth dominates. At 1 million tokens, the paper's scoring proxy gives HNSW roughly a 3× advantage over Quest.

Drag the context length slider below and watch where the lines cross. The practical implication: this method is not about being faster at 4K or 32K context — it is about enabling efficient retrieval at very long contexts.

Candidate-Scoring Proxies

$$C_{\text{full}}(N) = N \cdot d_h = 128N$$ $$C_{\text{Quest}}(N) = \frac{N}{P} \cdot 2d_h = 16N$$ $$C_{\text{HNSW}}(N) = M \cdot \text{ef}_{\text{search}} \cdot \log_2(N) \cdot d_{\text{search}} = 262{,}144 \cdot \log_2(N)$$

Max context length (tokens)

2²² = 4.19M

HNSW ef_search

64

The crossover point shifts with ef_search and context length. Zoom into the long-context regime to see where HNSW wins.

Quest / HNSW crossover

~300K

HNSW advantage at 1M

3.0×

This is a candidate-scoring proxy, not measured GPU runtime. The paper is honest: it does not yet prove wall-clock speedup. But the asymptotic shape — $\mathcal{O}(\log N)$ vs $\mathcal{O}(N)$ — means the advantage must materialize at sufficiently long contexts.

The experimental results →

Chapter 7

Near-Parity Perplexity

The clean block-causal experiment substitutes 6 layers of Qwen3-4B-Instruct-2507 with learned sparse attention. On WikiText-103 with 4096-token sequences, the method preserves full-attention perplexity within a razor-thin margin.

In plain English

Perplexity measures how surprised a model is by the next word. A perplexity of 30 means the model is, on average, as uncertain as if it were choosing uniformly among 30 equally likely next words. If sparse attention raises perplexity from 30.44 to 30.47, the model has barely noticed the difference — it is like swapping out 6 of a 36-person committee and having the group's decisions change by 0.07%.

At K=256 — scoring only 256 keys instead of all prior tokens — the relative gap is +0.01%. At K=128, it is +0.07%. These are remarkably small. And per-layer, the learned search projections capture 95–98.4% of the teacher's attention mass, sometimes exceeding the raw native-QK oracle baseline.

Use the K slider below to see how retrieval quality and perplexity change together.

Retrieved keys K

128

Drag K to see the trade-off between retrieval budget and quality metrics.

Recall@K

0.744

Mass@K

0.787

PPL gap

+0.07%

The learned projection matches or slightly exceeds raw-QK oracle retrieval at every tested layer. This means the search space is not just adequate — it is genuinely capturing attention-relevant geometry that raw native vectors miss.

Quest vs. Learned head-to-head →

Chapter 8

Quest vs. Learned

Quest is a strong baseline: it selects KV pages using query-aware min/max metadata, is training-free, and directly targets the KV-cache memory bottleneck. How does learned search compare?

In plain English

Quest is like a library that organizes books on shelves by the range of topics each shelf covers. To find books about "quantum computing," you check which shelves list "quantum" in their topic range and scan only those. It is fast, requires no special training, and works well for moderate-sized libraries.

Learned search is like having a custom index built specifically for your question patterns. It captures more of the truly relevant books — higher Mass@K — but the extra retrieval fidelity does not always translate into better answers. At K=256, perplexity is a statistical tie. At K=128, Quest is actually slightly better on the paired NLL test.

The honest claim: learned search is not "better than Quest." It is different — higher retrieval fidelity, ANN-compatible, and positioned for very long contexts where Quest's linear scan becomes expensive.

Token budget K

128

FAISS/HNSW Compatibility

A CPU FAISS/HNSW prototype tracks exact learned retrieval on the clean evaluation slice:

Method	K	PPL	Rel. PPL Gap	Filler Rate
Learned exact	128	30.47	+0.07%	n/a
Learned FAISS/HNSW	128	30.47	+0.09%	0.447
Learned exact	256	30.45	+0.01%	n/a
Learned FAISS/HNSW	256	30.46	+0.04%	0.683

The filler rate is expected for short same-segment prefixes where fewer than K valid causal keys exist. Filler slots are masked out of the sparse-attention softmax.

Learned search captures more teacher mass at equal K, but perplexity does not currently show a clean advantage over Quest. The contribution is retrieval fidelity and ANN compatibility, not a PPL win. The paper earns points for honesty.

Broad-layer substitution →

Chapter 10

Broad-Layer Substitution

The 6-layer pilot shows near-parity. But what happens when we substitute nearly every layer? The paper now reports two broader experiments: an all-36-layer substitution and a 32-layer "reserved-edge" configuration that holds back the weakest layers.

All-32 Reserved-Edge Configuration

$$\mathcal{I}_{\text{sub}} = \{3, 4, \ldots, 34\}, \quad \text{reserved} = \{0, 1, 2, 35\}$$ $$\text{Params} = 32 \times 2 \times d_{\text{model}} \times d_{\text{search}} = 32 \times 2 \times 2560 \times 128 = 20.97\text{M}$$

All-32 best PPL gap

+1.746%

All-36 PPL gap

+3.227%

Trainable params (All-32)

20.97M

Left: All-32 training trajectory over 1000 steps (Table 3). Recall@K plateaus at ~0.825; PPL gap stabilizes near +1.75%. Right: Coverage vs quality showing the three tested configurations (Table 5).

Per-Layer Diagnostics

Layer-wise retrieval analysis on the all-36 experiment reveals that layers 0, 1, and 2 have substantially lower Mass@K than the interior layers. The reserved-edge strategy directly addresses this: hold back the weakest layers, substitute the strong ones. Layer 35 (the final layer) is also reserved as a conservative choice.

Coverage is not a binary switch — it is a Pareto knob. Six layers gives near-parity (+0.07%); 32 layers gives 89% coverage at +1.75%; all 36 pushes to +3.23%. The practitioner chooses where on this frontier to operate.

Post-hoc K-sweep on the All32 checkpoint →

Chapter 11

K-Sweep Diagnostics

After training the All-32 reserved-edge model for 1000 steps, the paper performs a post-hoc exact K-sweep on a 2-batch clean block-causal evaluation slice. The sweep varies K from 16 to 256, revealing the retrieval budget vs quality trade-off for the broad-layer configuration.

All-32 K-Sweep Summary (Table 4)

$$\text{K=256: Mass@K} = 0.902, \;\text{PPL gap} = -0.062\% \quad\text{(near-parity across 32 layers)}$$

Retrieved keys K

128

Drag K to see how retrieval quality and perplexity change across the All-32 configuration.

Mass@K

0.807

Recall@K

0.746

PPL

20.66

Rel. PPL gap

+0.590%

At K=256, the All-32 configuration achieves exact parity or better on the 2-batch evaluation slice (−0.062%). This is measured on a small slice and should be interpreted cautiously, but it demonstrates that broad-layer substitution is feasible when the retrieval budget is sufficient.

What comes next →

Chapter 9

The Road Ahead

The pilot result is encouraging but narrow. The paper is explicit about what it does not prove — and equally explicit about what must happen next for the claim to grow from "promising prototype" to "practical system."

In plain English

This paper has grown from a 6-layer proof-of-concept to a broader exploration of layer coverage. The engine now runs on a test bench with 32 of 36 cylinders replaced — and it still works, albeit with a measurable quality trade-off.

What has been proven: the learned search projections can capture the teacher's attention preferences in a space where standard ANN search works. Six layers can be substituted with near-zero quality loss. All 32 interior layers can be substituted with a +1.75% gap. At K=256, the 32-layer model even matches full attention on a small evaluation slice. Coverage is a Pareto knob, not a binary.

What has not been proven: wall-clock speedup, long-context quality, cross-model generality, and the behavior of intermediate coverage points (e.g. 12, 18, or 24 layers).

What the result proves

Learned low-dimensional search projections can preserve quality when substituting selected transformer layers.
Standard HNSW retrieval can approximate exact learned retrieval closely enough on the clean slice.
The search space captures geometry that raw native Q/K vectors miss (learned mass ≥ oracle mass).
Broad-layer substitution (32/36 layers) is feasible with a measured quality trade-off (+1.746% at K=128 training eval, +0.590% exact sweep).
At K=256, the All-32 configuration achieves near-parity (−0.062%) on the evaluation slice.
Per-layer diagnostics identify edge layers (0, 1, 2) as the weakest; reserving them improves the quality/coverage trade-off.

What the result does not prove

No measured wall-clock speedup — the FAISS path is a correctness prototype.
No long-context task quality — WikiText perplexity is a sanity check, not a benchmark.
No clean perplexity win over Quest on the 6-layer pilot.
No intermediate coverage points — only 6-layer, 32-layer, and 36-layer have been tested.
K-sweep results on the All-32 model are from a 2-batch evaluation slice, not the full clean slice.

Required next steps

Multi-seed confidence intervals for all reported results.
Full 36-layer substitution with layer-wise training strategies to improve weak edge layers.
Coverage Pareto sweep: 12-layer, 18-layer, 20-layer configurations to map the full frontier.
Long-context evaluation on LongBench, RULER, passkey retrieval, and needle-in-haystack.
Decode-mode KV-cache integration with incremental index updates.
GPU-resident retrieval and fused sparse gather-attention kernels.
Measured wall-clock latency and memory footprint.
Cross-model validation beyond Qwen3-4B.

A radar chart summarizing the current state of evidence across six dimensions. The broad-layer experiments have pushed full-layer coverage to 89% (32/36 layers), but wall-clock latency, long-context quality, and cross-model generality remain unproven.

The correct framing is aspirational but increasingly supported: learned search projections make attention-relevant key selection compatible with standard ANN retrieval, preserving model quality from 6-layer near-parity to 32-layer broad substitution. Coverage is a quality knob, not a binary. The path to a stronger claim is clear — and the paper names every missing piece.

Make AttentionSub-Quadratic Again

The Quadratic Wall

Why Native Vectors Fail

Learning a Shared Search Space

Dual Distillation

Retrieve Then Attend

The Complexity Payoff

Near-Parity Perplexity

Quest vs. Learned

FAISS/HNSW Compatibility

Broad-Layer Substitution

Per-Layer Diagnostics

K-Sweep Diagnostics

The Road Ahead

What the result proves

What the result does not prove

Required next steps

Make Attention
Sub-Quadratic Again