Understanding Lightning Attention: A Breakthrough in Linear Attention Efficiency

25 Jan, 2025

We all needs attention in our lives, don't we? But its surprisingly difficult to compute the exponential terms which helps us taking the best plausible solution. It goes like this,

$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

The main issue here is $Q K^{T}$ multiplication, which perhaps has the complexity of O( $n^{2} d$ ). What's n and d? Its sequence length and inner dimension. So, when the sequence length gets 2x your attention computation complexity gets 4x, hugee!!

Man thats ok, but what does linear attention do to counter this and how lightning attention exploits it to carve best performance of it?

In its basic form, linear attention decomposes the attention mechanism into the inner product of hidden representations, allowing for more efficient computation like this,

$LinearAttention (Q, K, V) = ϕ (Q) (ϕ (K)^{T} V)$

Don't panic, I will give you the easiest example. You might have heard of the "kernel trick" of SVM right? If not you might even get the intuition here! So think of $ϕ$ as a function which helps in generalizing softmax, such that,

$softmax (Q K^{T}) V = ϕ (Q) (ϕ (K)^{T} V)$

This is the kernel trick, to approximate the softmax computation with a function $ϕ$ , similarly you could map a point from lower dimensional space to higher dimension to make it seperable thats kernel trick in SVM.

Theoretically, kernel trick reduces the complexity to O( $n d^{2}$ ). How so? By computing $K^{T} V$ first along dimension N and later an inner product with Q along dimension d, resulting in a complexity of $n d^{2}$ .

But thats Theoretical. It faces a significant challenge in causal (autoregressive) settings when the masking comes into the picture. To compute this efficiently using the linear attention "kernel trick", we need to maintain a running sum of key-value pairs.

$k v_{t} = \sum_{s \leq t} k_{s} v_{s}^{⊤}$

The sequential nature of cumsum prevents parallel computation, negating the theoretical O(nd²) efficiency advantage of linear attention. Each position must wait for all previous positions' computations to complete.

Get to lightning attention already, I'm hyped! Here you go.

But let me give you an intuition for lightning attention. Think of the cumsum problem like trying to maintain a running sum while reading a book - you need to keep updating your "memory" after each word. This is inherently sequential, just like RNNs need to process tokens one by one. Here's how Lightning Attention cleverly works around this,

$[x_{1}, . . ., x_{B}] [x_{B + 1}, . . ., x_{2 B}] . . . [x_{n - B}, . . ., x_{n}]$

Lightning Attention divides the input sequence into blocks. Let's say we have a sequence of length n and divide it into blocks of size B.

For each block t, the output is computed as,

$O_{t} = \underset{intra-block}{\underset{⏟}{[(Q_{t} K_{t}^{⊤}) ⊙ M] V_{t}}} + \underset{inter-block}{\underset{⏟}{Λ Q_{t} (K V)}}$

We had 2 part computations in RNN right? Let's get to the same intuition here,

Short-term memory (intra-block): Within each block, we use regular attention - like how you can easily remember and relate words within the same paragraph.
Long-term memory (inter-block): Between blocks, we use a modified linear attention - like maintaining a summary of previous paragraphs without needing every detail.

The key insight is in how the inter-block computation works. Instead of maintaining a running sum for each position, Lightning Attention updates a single KV matrix for each block,

$K V_{t} = λ^{B} K V_{t - 1} + (λ^{B} Λ^{- 1} K_{t})^{⊤} V_{t}$

here λ is a decay factor and Λ is a diagonal matrix for position-aware scaling, might tell you more as we progress further ;)

And the authors have also highlighted that even when we calculate the $k v_{t}$ with a complexity of O( $n d^{2}$ ), its actually not GPU optimal. Why? Coz we are not able to parallelise across head dim (When it comes to theoretical linear attention).

But these guys have got us! They have implemented tiling first to compute linear attention in a causal setting. They did divide Q, K and V blocks into T blocks such that, blocks $X_{1}, X_{2}, . . . X_{T}$ are of size B × d, B being the sub set of sequence length.

This tiling helps them to get GPU optimal efficiency for Lightning Attention. Here is how the algorithm operates.

lightning_attn_algo

Now lets get to the complexity pov. Is it optimal? Lets find out! For the forward pass as mentioned above, the intra block computation is similar to regular attention so its complexity is along B dimension, so its O( $B^{2} d$ ), again B being a subset of sequence length. And for the inter part we calculate by updating KV (Yup its similate to KV cache intuition, I get you my friend :p), so its O( $B d^{2}$ ), so the computation inside loop is O( $B^{2} d + B d^{2}$ ).

Since we loop for T = n/B times, the total time complexity becomes,

O(( $B^{2} d + B d^{2}$ )n/B) = O( $n d^{2} + n B d$ )

Wow! that was really nice! Though I am yet to figure out how the authors derive the backward pass to be the same complexity, may be my next blog post. Lets see the structural intuition from the paper.

structural_framework_lightning_attn

I'm not over yet!, who's gonna discuss lightning attention 2? Obviously me!

What if during forward pass, we could model how attention should diminish over distance in the sequence. Think of it like how your attention naturally fades when trying to connect words that are far apart in a sentence.

$Initialize 𝐌 \in ℝ^{B \times B}, where 𝐌_{i j} = λ^{i - j}, if i \geq j, else 0 .$

$Initialize Λ = diag {λ, λ^{2}, \dots, λ^{B}} \in ℝ^{B \times B} .$

The above equations initialise the diagonal matrix that contains position-specific scaling factors. The diagonal elements of Λ control how much weight each position gets in the final attention output.

$On chip, compute 𝐎_{inter} = Λ 𝐐_{i} (K V) .$

$On chip, compute K V = λ^{B} K V + (λ^{B} Λ^{- 1} 𝐊_{i})^{⊤} 𝐕_{i} .$

The $λ^{B}$ defines the decay factor at the block level and the diagonal matrix Λ scales this interaction based on position.This scaling ensures that attention decays appropriately with distance.

Here is the structural representation of the Lightning attention 2,

lightning_attn_2_structural

Since its getting a bit late, I will be publishing this blog first and will share the implementation on the go, so you shall better keep yourself updated by following me on X

Here you could find the detailed implementation for lightning attn kernel - enjoy 🥂

See ya!! :p