Where Am I? Positional Encoding
Where Am I? Positional Encoding
The attention mechanism has no idea what order your words are in. Let's fix that.
By now we have built up quite a machine. Logistic regression. Stacked into deep networks. Rolled through time as an RNN. Given a second brain as seq2seq. Taught to focus with attention. And finally, in the last chapter, we stripped out the recurrence entirely and let every word talk to every other word at once — the Transformer.
But the Transformer has a dirty secret. One thing it completely forgot to include.
Position.
The model that forgot to count
Go back to the attention formula. For each pair of tokens, we compute a score:
# attention score between token i and token j score(i, j) = dot(W_Q · x_i, W_K · x_j) / sqrt(d) # notice what is NOT here: no position index, no i, no j in the formula # the formula only sees the CONTENT of the tokens, never their location
The dot product sees only the content of each token. Swap all the tokens around — the same dot products come out, just in a different order. The model is, mathematically, a bag of words.
These two sentences have completely different meanings. But the attention formula produces the same scores for both — it cannot tell the difference.
"The cat ate the mouse" and "The mouse ate the cat" — same words, same vectors, same attention scores. The model cannot tell them apart.
Add a position fingerprint to every word
The fix is simple. Before attention runs, we give each word a second vector that encodes only its position — and we add it to the word vector. Now word number 1 has a different total vector than word number 4, even if they are the same word.
# The complete idea in one line model_input[i] = word_vector[i] + position_vector[i] # word_vector: encodes MEANING (what the word is) # position_vector: encodes LOCATION (where the word sits) # together: the model sees both
Pick a position. See its word vector, position vector, and the combined input the model actually receives.
Dimension labels:
Word vector (the word's meaning):
+
Position vector (just the position):
=
Final input to model:
Now the question becomes: what exactly do we put in the position vector? This is where things get interesting. There are four main answers, and each one is a different architectural choice made by a real system you use every day.
Method 1 — Sinusoidal encoding
used in: original Transformer (Vaswani 2017)
The original Transformer's answer: use a formula based on sine and cosine waves. Pair each dimension with a wave running at a different frequency — some fast, some very slow. Together they give every position a unique fingerprint, like hour and minute hands on a clock.
# The formula (d_model = 4, so 2 pairs) PE(pos, dim 0) = sin(pos / 1) # fast clock PE(pos, dim 1) = cos(pos / 1) # fast clock, shifted 90° PE(pos, dim 2) = sin(pos / 100) # slow clock PE(pos, dim 3) = cos(pos / 100) # slow clock, shifted 90° # Always between -1 and +1. Works for any position. No training needed.
Drag the slider. Watch how the four numbers change. The fast dimensions (0,1) cycle quickly; the slow dimensions (2,3) barely move.
Notice: every position produces a unique combination. No two positions ever have identical 4-number fingerprints.
Method 2 — Learned encoding
used in: BERT, GPT-2
BERT and GPT-2 said: forget the formula. Just create a lookup table — one row per position — and let backpropagation figure out the best values during training. Same mechanism as a word embedding table, but indexed by position instead of word.
# In PyTorch — this is literally all it is position_embedding = nn.Embedding(max_positions, d_model) # At forward pass: pos_vector = position_embedding[3] # look up row 3 for position 3 # These values start random and get updated by backprop # just like any other weight in the network
The table below shows what a learned position table might look like after training. Click any row to highlight it and see that row's position vector get added to the word.
Position vector for position 2:
Method 3 — Relative encoding
used in: T5, DeBERTa
Sinusoidal and learned PE both tell each word "you are at position 5." Relative PE changes the question entirely: instead of absolute location, it tells each pair of words "you are 4 steps apart."
Grammatically this makes more sense. An adjective modifying a noun next to it is the same pattern whether it is at position 3 or position 412. What matters is adjacency, not coordinates.
# Instead of adding PE to the word vector... # ...add a bias directly to the attention score score(i, j) = dot(q_i, k_j) + Bias[j - i] # Bias is a learned table indexed by relative distance # Bias[+1] = adjacent right, Bias[-1] = adjacent left # Bias[+4] = four steps to the right, etc.
Click any word as the query. The table shows the distance from it to every other word. The score for each pair is the dot product plus the bias for that distance.
↑ click to change query (highlighted in gold)
Method 4 — RoPE
used in: LLaMA, Mistral, Gemma, Qwen — the current standard
RoPE is the elegant one. Instead of adding position information, it rotates each word's query and key vectors by an angle proportional to position. Then something beautiful happens in the math of the dot product.
# Rotate query and key before computing attention score q_m = rotate(q, angle = m × speed) k_n = rotate(k, angle = n × speed) # Dot product of two rotated vectors: dot(q_m, k_n) = dot(q, k) × cos((n - m) × speed) # ↑ only the GAP (n-m) survives! # absolute positions m and n cancel out mathematically
Set a query position and a key position. The score depends only on the gap between them — not their absolute values. Keep the gap the same and move both sliders: the score does not change.
All four methods do the same thing: give each position a unique vector and inject it into the token representation, so attention scores can become position-aware. The difference is only in where the position is injected and whether the dot product sees absolute positions or relative distances. RoPE wins because it guarantees relative distances fall out of the math, not as an approximation, but exactly.
Still just logistic regression computing attention scores.
Still just a weighted average combining information.
Pick how you want to end this chapter
Every reader has a different way of internalizing something new. Below are four different endings to this chapter — each one frames the same insight from a different angle. Pick whichever one clicks for you.
The clock metaphor
Think of a clock. It has three hands: hour, minute, second. Each hand moves at a different speed. Together, their combined positions uniquely identify every moment in a 12-hour period.
Sinusoidal PE is exactly this. Dimension 0 is the second hand — it cycles fast, completing a full loop in a few positions. Dimension 2 is the hour hand — it barely moves across a typical sentence. Together they give every position a unique combination of angles.
RoPE takes the metaphor further: instead of reading the clock hands and writing the time on an envelope, it rotates the letter itself by the time. When two rotated letters meet to compare notes, the rotation cancels — and only the time between them is visible. Not when each was sent. Just the gap.
That is why RoPE generalizes better. A sinusoidal model that sees "position 5" during training must generalize to "position 500" at test time. A RoPE model only ever sees gaps — and a gap of 4 is the same whether it appears at positions (1,5) or (100,104).
Drag the position. Watch how the fast hands (pair 0) complete many loops while the slow hands (pair 1) barely move. Every position = a unique combination of clock states.
What survives the dot product
The cleanest way to understand all four methods is to ask: after the dot product, which information survives?
── Sinusoidal PE ────────────────────────────────────────── q = word_q + PE(m) # position m baked into query k = word_k + PE(n) # position n baked into key dot(q, k) = dot(word_q, word_k) + dot(word_q, PE(n)) # cross terms! + dot(PE(m), word_k) # cross terms! + dot(PE(m), PE(n)) # both absolute positions # Result: messy mix of m, n, word meanings ── RoPE ─────────────────────────────────────────────────── q_m = rotate(word_q, m × speed) k_n = rotate(word_k, n × speed) dot(q_m, k_n) = dot(word_q, word_k) × cos((n-m) × speed) # Result: word meaning × f(n-m) — only the gap! # Absolute positions m and n are mathematically gone.
Fix a gap between query and key. Then slide both positions together, keeping the gap constant. RoPE score stays locked. Sinusoidal score drifts.
Sinusoidal score
RoPE score
The post office problem
Imagine a postal sorting room. Letters arrive on a conveyor belt. A sorter must decide which letters are "related" — perhaps they are part of the same correspondence. The sorter reads the content of each letter to judge relevance. This is the attention mechanism.
But here is the problem: if someone shuffles the letters before they arrive, the sorter has no way to tell what order they came in. "Cat ate mouse" and "Mouse ate cat" look like the same pile of letters.
Sinusoidal PE is like stamping each letter with a timestamp before it enters the room. The sorter now sees the content plus the timestamp together. But the timestamp and the content get mixed together in a messy way when the sorter tries to judge relevance.
Learned PE is like letting the postal service design its own stamp system from scratch by learning from millions of letters. Works well, but cannot stamp letters with dates the system has never seen.
Relative PE does not stamp the letters at all. Instead, when the sorter picks up two letters to compare, an assistant whispers "these arrived 4 letters apart." The sorter incorporates this directly into their judgment.
RoPE is the most elegant solution: rotate each letter by its arrival time before handing it to the sorter. When the sorter holds two letters to compare them, the rotations interact — and the only thing left visible is the angle between them. The arrival times themselves cancel out. The sorter sees only the gap, automatically, with no assistant required.
Where does logistic regression fit?
Every chapter in this book has shown the same two moves. Let's make sure positional encoding fits the pattern too.
Positional encoding is not itself a logistic regression. It is a preprocessing step — it modifies the input before the logistic regression runs. But look at what happens after:
# After positional encoding, attention still runs the same way: # Move 1 — Logistic regression learns importance weights score(i, j) = softmax( dot(W_Q·x_i, W_K·x_j) / sqrt(d) ) # Move 2 — Weighted average combines information output_i = sum( score(i, j) × W_V·x_j for all j ) # The x_i now carry BOTH meaning AND position # So the logistic regression in move 1 naturally learns # position-aware importance weights — without any change to itself
That is the beauty of the design. Positional encoding does not add a new type of operation. It simply enriches the input so that the same two logistic-regression moves automatically become position-aware. The architecture does not need to know about position — the data carries it.
Chapter 1: one logistic regression. Chapter 2: stacked. Chapter 3: with memory gates. Chapter 4: two networks. Chapter 5: three LRs — one per role. Chapter 6: all words in parallel. Chapter 7: same formula, every word to every word. Chapter 8: inject position into the input — the LR does the rest.
Still just logistic regression.
Now it knows where it is.
"Every word knows where it sits."
From the series: Everything is Logistic Regression (And Some Extra)
Next: Chapter 09 — Add & Norm · The tricks that make deep transformers trainable
Enjoy Reading This Article?
Here are some more articles you might like to read next: