Where Am I? Positional Encoding

Chapter 08 · Everything is Logistic Regression (And Some Extra)

Where Am I? Positional Encoding

The attention mechanism has no idea what order your words are in. Let's fix that.

input = word_vector + position_vector

By now we have built up quite a machine. Logistic regression. Stacked into deep networks. Rolled through time as an RNN. Given a second brain as seq2seq. Taught to focus with attention. And finally, in the last chapter, we stripped out the recurrence entirely and let every word talk to every other word at once — the Transformer.

But the Transformer has a dirty secret. One thing it completely forgot to include.

Position.

the problem

The model that forgot to count

Go back to the attention formula. For each pair of tokens, we compute a score:

# attention score between token i and token j
score(i, j) = dot(W_Q · x_i, W_K · x_j) / sqrt(d)

# notice what is NOT here: no position index, no i, no j in the formula
# the formula only sees the CONTENT of the tokens, never their location

The dot product sees only the content of each token. Swap all the tokens around — the same dot products come out, just in a different order. The model is, mathematically, a bag of words.

simulation — the bag of words problem

These two sentences have completely different meanings. But the attention formula produces the same scores for both — it cannot tell the difference.

The problem in one sentence

"The cat ate the mouse" and "The mouse ate the cat" — same words, same vectors, same attention scores. The model cannot tell them apart.

the fix

Add a position fingerprint to every word

The fix is simple. Before attention runs, we give each word a second vector that encodes only its position — and we add it to the word vector. Now word number 1 has a different total vector than word number 4, even if they are the same word.

# The complete idea in one line
model_input[i] = word_vector[i] + position_vector[i]

# word_vector: encodes MEANING (what the word is)
# position_vector: encodes LOCATION (where the word sits)
# together: the model sees both

simulation — adding a position vector

Pick a position. See its word vector, position vector, and the combined input the model actually receives.

position 1

Dimension labels:

Word vector (the word's meaning):

Position vector (just the position):

Final input to model:

Now the question becomes: what exactly do we put in the position vector? This is where things get interesting. There are four main answers, and each one is a different architectural choice made by a real system you use every day.

four answers to the same question

Method 1 — Sinusoidal encoding

used in: original Transformer (Vaswani 2017)

The original Transformer's answer: use a formula based on sine and cosine waves. Pair each dimension with a wave running at a different frequency — some fast, some very slow. Together they give every position a unique fingerprint, like hour and minute hands on a clock.

# The formula (d_model = 4, so 2 pairs)
PE(pos, dim 0) = sin(pos / 1)       # fast clock
PE(pos, dim 1) = cos(pos / 1)       # fast clock, shifted 90°
PE(pos, dim 2) = sin(pos / 100)     # slow clock
PE(pos, dim 3) = cos(pos / 100)     # slow clock, shifted 90°

# Always between -1 and +1. Works for any position. No training needed.

simulation — sinusoidal position vectors

Drag the slider. Watch how the four numbers change. The fast dimensions (0,1) cycle quickly; the slow dimensions (2,3) barely move.

position 0

Notice: every position produces a unique combination. No two positions ever have identical 4-number fingerprints.

Method 2 — Learned encoding

used in: BERT, GPT-2

BERT and GPT-2 said: forget the formula. Just create a lookup table — one row per position — and let backpropagation figure out the best values during training. Same mechanism as a word embedding table, but indexed by position instead of word.

# In PyTorch — this is literally all it is
position_embedding = nn.Embedding(max_positions, d_model)

# At forward pass:
pos_vector = position_embedding[3]  # look up row 3 for position 3

# These values start random and get updated by backprop
# just like any other weight in the network

simulation — learned position table

The table below shows what a learned position table might look like after training. Click any row to highlight it and see that row's position vector get added to the word.

select position 2

Position vector for position 2:

The one weakness: the table has a fixed number of rows — the maximum sequence length from training. Feed the model a longer sequence and position 7 has no row. The model has never seen it and has no vector to look up.

Why sinusoidal wins here: the formula works for any position number, even ones never seen during training.

Method 3 — Relative encoding

used in: T5, DeBERTa

Sinusoidal and learned PE both tell each word "you are at position 5." Relative PE changes the question entirely: instead of absolute location, it tells each pair of words "you are 4 steps apart."

Grammatically this makes more sense. An adjective modifying a noun next to it is the same pattern whether it is at position 3 or position 412. What matters is adjacency, not coordinates.

# Instead of adding PE to the word vector...
# ...add a bias directly to the attention score

score(i, j) = dot(q_i, k_j) + Bias[j - i]

# Bias is a learned table indexed by relative distance
# Bias[+1] = adjacent right, Bias[-1] = adjacent left
# Bias[+4] = four steps to the right, etc.

simulation — relative distances

Click any word as the query. The table shows the distance from it to every other word. The score for each pair is the dot product plus the bias for that distance.

↑ click to change query (highlighted in gold)

The cost: the model needs one bias value per (query, key) pair — that is N×N pairs for a sentence of length N. For very long texts this becomes expensive.

Method 4 — RoPE

used in: LLaMA, Mistral, Gemma, Qwen — the current standard

RoPE is the elegant one. Instead of adding position information, it rotates each word's query and key vectors by an angle proportional to position. Then something beautiful happens in the math of the dot product.

# Rotate query and key before computing attention score
q_m = rotate(q, angle = m × speed)
k_n = rotate(k, angle = n × speed)

# Dot product of two rotated vectors:
dot(q_m, k_n) = dot(q, k) × cos((n - m) × speed)
#                              ↑ only the GAP (n-m) survives!
# absolute positions m and n cancel out mathematically

simulation — RoPE: only the gap matters

Set a query position and a key position. The score depends only on the gap between them — not their absolute values. Keep the gap the same and move both sliders: the score does not change.

query pos (m) 1

key pos (n) 4

vs Sinusoidal

Sinusoidal adds position to the word vector. The dot product still sees absolute positions mixed in. RoPE's rotation makes distance fall out exactly.

vs Relative PE

Relative PE needs an N×N bias table. RoPE just rotates each token once — no extra memory, no extra matrix.

the pattern

The unifying insight

All four methods do the same thing: give each position a unique vector and inject it into the token representation, so attention scores can become position-aware. The difference is only in where the position is injected and whether the dot product sees absolute positions or relative distances. RoPE wins because it guarantees relative distances fall out of the math, not as an approximation, but exactly.

Still just logistic regression computing attention scores.

Still just a weighted average combining information.

Extra trick: add a position fingerprint before anything runs.

how to think about this

Pick how you want to end this chapter

Every reader has a different way of internalizing something new. Below are four different endings to this chapter — each one frames the same insight from a different angle. Pick whichever one clicks for you.

Analogy

The clock metaphor

For readers who think in physical intuitions

Mathematical

What survives the dot product

For readers who want to see the algebra

Story

The post office problem

For readers who think in narratives

In this book's language

Where does logistic regression fit?

Connecting back to every previous chapter

The clock metaphor

Think of a clock. It has three hands: hour, minute, second. Each hand moves at a different speed. Together, their combined positions uniquely identify every moment in a 12-hour period.

Sinusoidal PE is exactly this. Dimension 0 is the second hand — it cycles fast, completing a full loop in a few positions. Dimension 2 is the hour hand — it barely moves across a typical sentence. Together they give every position a unique combination of angles.

RoPE takes the metaphor further: instead of reading the clock hands and writing the time on an envelope, it rotates the letter itself by the time. When two rotated letters meet to compare notes, the rotation cancels — and only the time between them is visible. Not when each was sent. Just the gap.

That is why RoPE generalizes better. A sinusoidal model that sees "position 5" during training must generalize to "position 500" at test time. A RoPE model only ever sees gaps — and a gap of 4 is the same whether it appears at positions (1,5) or (100,104).

clock analogy — 4 hands at different speeds

Drag the position. Watch how the fast hands (pair 0) complete many loops while the slow hands (pair 1) barely move. Every position = a unique combination of clock states.

position 0

What survives the dot product

The cleanest way to understand all four methods is to ask: after the dot product, which information survives?

── Sinusoidal PE ──────────────────────────────────────────
q = word_q + PE(m)     # position m baked into query
k = word_k + PE(n)     # position n baked into key

dot(q, k) = dot(word_q, word_k)
          + dot(word_q, PE(n))   # cross terms!
          + dot(PE(m), word_k)   # cross terms!
          + dot(PE(m), PE(n))    # both absolute positions

# Result: messy mix of m, n, word meanings

── RoPE ───────────────────────────────────────────────────
q_m = rotate(word_q, m × speed)
k_n = rotate(word_k, n × speed)

dot(q_m, k_n) = dot(word_q, word_k) × cos((n-m) × speed)

# Result: word meaning × f(n-m) — only the gap!
# Absolute positions m and n are mathematically gone.

dot product anatomy — sinusoidal vs RoPE

Fix a gap between query and key. Then slide both positions together, keeping the gap constant. RoPE score stays locked. Sinusoidal score drifts.

gap (n−m) 3

shift both by 0

Sinusoidal score

—

RoPE score

—

The post office problem

Imagine a postal sorting room. Letters arrive on a conveyor belt. A sorter must decide which letters are "related" — perhaps they are part of the same correspondence. The sorter reads the content of each letter to judge relevance. This is the attention mechanism.

But here is the problem: if someone shuffles the letters before they arrive, the sorter has no way to tell what order they came in. "Cat ate mouse" and "Mouse ate cat" look like the same pile of letters.

Sinusoidal PE is like stamping each letter with a timestamp before it enters the room. The sorter now sees the content plus the timestamp together. But the timestamp and the content get mixed together in a messy way when the sorter tries to judge relevance.

Learned PE is like letting the postal service design its own stamp system from scratch by learning from millions of letters. Works well, but cannot stamp letters with dates the system has never seen.

Relative PE does not stamp the letters at all. Instead, when the sorter picks up two letters to compare, an assistant whispers "these arrived 4 letters apart." The sorter incorporates this directly into their judgment.

RoPE is the most elegant solution: rotate each letter by its arrival time before handing it to the sorter. When the sorter holds two letters to compare them, the rotations interact — and the only thing left visible is the angle between them. The arrival times themselves cancel out. The sorter sees only the gap, automatically, with no assistant required.

Where does logistic regression fit?

Every chapter in this book has shown the same two moves. Let's make sure positional encoding fits the pattern too.

Positional encoding is not itself a logistic regression. It is a preprocessing step — it modifies the input before the logistic regression runs. But look at what happens after:

# After positional encoding, attention still runs the same way:

# Move 1 — Logistic regression learns importance weights
score(i, j) = softmax( dot(W_Q·x_i, W_K·x_j) / sqrt(d) )

# Move 2 — Weighted average combines information
output_i = sum( score(i, j) × W_V·x_j   for all j )

# The x_i now carry BOTH meaning AND position
# So the logistic regression in move 1 naturally learns
# position-aware importance weights — without any change to itself

That is the beauty of the design. Positional encoding does not add a new type of operation. It simply enriches the input so that the same two logistic-regression moves automatically become position-aware. The architecture does not need to know about position — the data carries it.

The full pattern, chapter 8

Chapter 1: one logistic regression. Chapter 2: stacked. Chapter 3: with memory gates. Chapter 4: two networks. Chapter 5: three LRs — one per role. Chapter 6: all words in parallel. Chapter 7: same formula, every word to every word. Chapter 8: inject position into the input — the LR does the rest.

Still just logistic regression.

Now it knows where it is.

Extra trick: positional encoding — the input does the work.

"Every word knows where it sits."

From the series: Everything is Logistic Regression (And Some Extra)

Next: Chapter 09 — Add & Norm · The tricks that make deep transformers trainable

Where Am I? Positional Encoding

The model that forgot to count

Add a position fingerprint to every word

Method 1 — Sinusoidal encoding

Method 2 — Learned encoding

Method 3 — Relative encoding

Method 4 — RoPE

Pick how you want to end this chapter

The clock metaphor

What survives the dot product

The post office problem

Where does logistic regression fit?

Enjoy Reading This Article?