Note on Attention, Transformers, and the New Linear Models
A Quick Note on Attention, Transformers, and the New Linear Models.
Overview
I came across a YouTube talk on Kimi Linear and found the ideas compelling enough to jot down a quick, structured note. With GPU demand surging and memory getting pricier, architectures that reduce KVâcache and speed up decoding without hurting quality are increasingly important.
SelfâAttention & the Transformer (Quick Primer)
From sequential RNNs to attention.
Before 2017, NLP models relied on RNNs/LSTMs that processed text one token at a time and struggled with longârange dependencies.
The attention mechanism.
Instead of strict leftâright processing, attention lets each token âlook atâ all others:
- Query, Key, Value (Q/K/V):
- Query â what a token is looking for
- Key â what a token represents
- Value â the information carried by the token
- Weighted relationships. The model scores Q¡Kᾠto decide which tokens to focus on, then aggregates the corresponding Values.
Transformers (2017, âAttention Is All You Needâ).
- Parallel processing of entire sequences (faster training than RNNs).
- MultiâHead Attention learns multiple relationship patterns (syntax, semantics, longârange links).
- Foundation of modern LLMs (e.g., GPT, BERT, etc.).
What Kimi Linear Proposes (Doc dated 2025â11â01)
Kimi Linear introduces a hybrid architecture that interleaves linear attention with occasional full attention, aiming to match or beat full attention under fair training conditions while being cheaper at long context lengths.
Key Ideas
-
Kimi Delta Attention (KDA).
A linearâattention module extending Gated DeltaNet. It moves from headâwise forgetting to channelâwise gating, so each feature dimension can âforgetâ at its own rateâmaking better use of the limited (finiteâstate) recurrent memory common to linear attention. -
Hardwareâefficient chunkwise algorithm.
Based on a specialized DiagonalâPlusâLowâRank (DPLR) transition structure, tailored to reduce compute vs. more general DPLR variants while staying consistent with the deltaârule view. -
Hybrid stacking (3:1).
The recipe emphasized is 3 KDA layers : 1 fullâattention layerâretaining global information flow while keeping most layers fast and KVâcacheâlight. -
Design choices.
- No positional embeddings (NoPE) to better extend to very long contexts; compared against a RoPE variant.
- MoE setup: an example configuration mentions 48B total / 3B activated (MixtureâofâExperts) when comparing to matched baselines.
Claimed Empirical Results
-
Quality across regimes.
Reported as the first hybrid linearâattention approach to outperform full attention under matched training (shortâcontext, longâcontext, and RL postâtraining). -
Longâcontext leaderboards (e.g., 128k).
Strong results on suites like RULER and RepoQA, with best average among compared models at very long context. -
RL behavior (mathâfocused).
Under the same RL setup, training/test curves show faster improvement and higher final scores than a fullâattention baseline. -
Scaling & training setup.
âFairâ comparisons reportedly use 4,096 pretrain context and 1.4T tokens (shared across models). A longerâtrained checkpoint (5.7T tokens) is mentioned with support up to 1M context.
Note: These are their reported results in the referenced document; verify against the latest numbers as the ecosystem moves quickly.
Efficiency Wins
-
KVâcache reduction.
With most layers using linear attention, reported up to ~75% less KVâcache usage in longâsequence generation. -
Decoding throughput at long context.
At 1M tokens, they report up to ~6.3Ă faster timeâperâoutputâtoken (TPOT) by leveraging memory savings to run larger batchesâand sustaining low TPOT as decode length grows.
Practical Takeaway
If your workloads are longâcontext and decodingâheavy (agentic tool use, repoâlevel code understanding, long RL trajectories), the pitch is:
- Keep some global full attention for true longârange interactions.
- Make most layers linear + gatedâdelta (KDA) to slash KVâcache and speed up decoding, while targeting equal or better quality than a fullâattention stack.
Reference
đ Kimi Linear: An Expressive, Efficient Attention Architecture
For readers who want the full technical details, the complete paper is available at the link above.
Leave a comment