Everything is Logistic Regression (And Some Extra)

A Bottom-Up Guide to Deep Learning

Everything is Logistic Regression & Some Extra

From one equation to the architecture behind ChatGPT — without skipping the fundamentals

y = sigmoid(W·x + b)

↓ scroll ↓

When I first encountered the LSTM diagram — the one with the forget gate, the input gate, the candidate gate, the output gate, the cell state threading through like a conveyor belt — I felt something I can only describe as pre-emptive defeat. Not confusion. Defeat. The kind you feel before you have even started.

I remember thinking: why does learning feel like being handed a circuit board and told to intuit electricity?

LSTM diagram with forget gate, input gate, candidate memory, output gate

the diagram that made me want to close the tab

Over time, I noticed the reason it felt so hard — and it is not the complexity. It is a structural problem in how the field communicates.

Science moves forward by showing what is new. That is the incentive. To publish, to get cited, to be taken seriously, you need to demonstrate novelty. And so every paper presents its contribution as a clean break — a departure, not a continuation. The connections to prior work are mentioned briefly in a related work section, and then deliberately downplayed. Because if you connect too strongly to something that already exists, a reviewer will say: this is not sufficiently different from X. This has been done before.

I respect this deeply — every paper, every small increment, is genuinely moving science forward. But the side effect for learners is severe. When you try to understand attention, nobody tells you it is just a small network that sits between the encoder and decoder and asks "which encoder state matters right now?" Instead, it arrives in the literature as attention — a new word, a new diagram, a new framework — severed from the RNN it grew out of.

what a reviewer says if you connect your work too clearly to what came before

The result is a field that looks, from the outside, like an endless forest of unrelated trees. Each model has its own diagram, its own terminology, its own paper. The shared roots are underground, invisible. Nobody drew the map.

The field is like a Mandelbrot set. Each zoom reveals more complexity. Each paper adds a new layer. Progress is real and extraordinary — but nobody is required to make the whole thing learnable. The stack grows taller. The entry point gets higher. Eventually the foundations disappear into the cloud, and newcomers are handed a ladder that starts thirty floors up.

Someone has to look back. Someone has to dig through the long horizon of architectures and ask: what was the common building block all along? What is the root every branch grew from?

When I really tried to understand — not just memorize diagrams, but actually understand — I found it. The common skeleton is logistic regression. Put any model in an X-ray machine and you will find it everywhere: in every gate, every attention score, every feedforward layer. Each architecture is just logistic regression applied in a more complex configuration, used in a more clever way, stacked a little deeper. The variation is real — but the foundation is always the same.

Is this a camera? Is everything a camera?

me, after understanding what logistic regression actually is

This post builds from the ground up. We start from one equation — the simplest possible decision maker — and add one clever trick at a time. Each chapter is the previous chapter plus a single insight. By the end, the architecture behind GPT-4 will not look like magic. It will look like the logical next step.

The Core Thesis

Every model in this post — RNN, LSTM, GRU, Seq2Seq, Attention, Transformer — is built from two moves: a logistic regression that learns importance, and a weighted average that combines information. That is all.

Chapter 01

The One Equation That Runs The World

Logistic Regression — the simplest decision maker

Before transformers, before attention, before RNNs — there is one equation. You have probably seen it. You may have dismissed it as too simple. That would be a mistake.

Logistic regression takes a bunch of input numbers, multiplies them by learned weights, adds a bias, and squashes the result through a sigmoid function to get a number between 0 and 1.

The Foundation

y = sigmoid(W·x + b)

where:
x        = input vector  (your data)
W        = weight matrix (what the model learns)
b        = bias          (learned offset)
sigmoid  = 1 / (1 + e^-z) (squashes to 0..1)
y        = output        (a decision)

The sigmoid function does something beautiful — it turns any real number into a probability. Feed it a very large number: you get close to 1. Feed it a very negative number: you get close to 0. It is a smooth, differentiable on/off switch.

Training is equally simple. You make a prediction, compare it to the correct answer with a loss function, then use backpropagation to figure out which weights caused the mistake and adjust them a little. Repeat thousands of times.

Training Loop

1. Make a prediction     y = sigmoid(Wx + b)
2. Measure the mistake   Loss = -log(y)  if correct=1
3. Find who caused it    dLoss/dW  (backprop)
4. Fix weights a little  W = W - lr × dLoss/dW
5. Repeat until good

That is it. That is the whole foundation. Everything that follows — every gate in an LSTM, every attention head in a transformer — is this same operation, applied cleverly in different contexts.

"Just logistic regression.
No extra tricks yet."

Chapter 02

Stack It Deep

Deep Learning — logistic regression, layer by layer

One logistic regression makes one decision. It takes inputs, multiplies by weights, squashes through sigmoid. Done. But one decision is rarely enough. Real problems have layers of abstraction — raw pixels become edges, edges become shapes, shapes become faces. How do you build that?

You stack logistic regressions on top of each other. The output of one becomes the input of the next. That is deep learning. Nothing more.

From One Layer To Many

Single logistic regression:
y = sigmoid(W · x + b)
input → one decision

─────────────────────────────────────────────────

Two layers stacked:
h = sigmoid(W1 · x  + b1)    ← layer 1 output
y = sigmoid(W2 · h  + b2)    ← layer 2 takes layer 1 output

─────────────────────────────────────────────────

Three layers — a "deep" network:
h1 = sigmoid(W1 · x  + b1)   ← raw features
h2 = sigmoid(W2 · h1 + b2)   ← combinations of features
y  = sigmoid(W3 · h2 + b3)   ← final decision

Each layer is just: sigmoid(W · previous_output + b)
Same equation. New inputs. That is all.

This is the entire secret of deep learning. Each layer transforms its input into a new representation — a new way of looking at the data. Early layers learn simple features. Later layers combine those features into increasingly abstract concepts. The model figures all of this out automatically through backpropagation.

"Deep learning is not a new idea. It is logistic regression, applied to itself, repeatedly."

Example: recognizing whether a sentence is positive or negative Input: "I love cats" ↓ word embeddings [0.2, 0.8, 0.1, ...] ───────────────────────────────────── Layer 1 ← learns basic word features sigmoid(W1 · embeddings + b1) h1 = [0.7, 0.2, 0.9, ...] "detects: positive word present" ───────────────────────────────────── Layer 2 ← combines features sigmoid(W2 · h1 + b2) h2 = [0.1, 0.8, 0.3, ...] "detects: positive + subject present" ───────────────────────────────────── Layer 3 ← final decision sigmoid(W3 · h2 + b3) y = 0.92 ← 92% positive! ✓

But wait — there is a problem. If every layer is just a linear transformation followed by sigmoid, what stops the whole stack from collapsing into one single layer? Mathematically, two linear transformations composed together are still just one linear transformation.

The sigmoid is what saves us. It introduces nonlinearity — a bend in the function. After each linear step, sigmoid bends the output. The next layer then works with bent data. Stack enough bends and you can approximate any function in the universe. This is the Universal Approximation Theorem.

Why Nonlinearity Matters

WITHOUT nonlinearity:
h = W2 · (W1 · x + b1) + b2
  = (W2·W1) · x + (W2·b1 + b2)
  = W_combined · x + b_combined
Two layers = one layer. Stacking does nothing!

─────────────────────────────────────────────────

WITH nonlinearity (sigmoid or ReLU):
h = sigmoid(W1 · x + b1)
y = sigmoid(W2 · h + b2)
Now you CANNOT collapse this into one layer.
The bend introduced by sigmoid breaks the linearity.
Each layer genuinely adds new representational power.

─────────────────────────────────────────────────

In practice, ReLU is used more than sigmoid in hidden layers:
ReLU(z) = max(0, z)   ← same idea, faster to compute
sigmoid still used at output for probabilities

Training a deep network works the same way as training a single logistic regression — just with more bookkeeping. You make a prediction with a forward pass through all layers, compute the loss, then backpropagate the error signal backwards through every layer, adjusting every weight matrix along the way.

The chain rule of calculus handles the bookkeeping automatically. Each layer's gradient is computed from the layer above it. Frameworks like PyTorch do all of this for you.

Backprop Through Layers — Chain Rule

Forward pass:
x → Layer1 → h1 → Layer2 → h2 → Layer3 → y → Loss

Backward pass (chain rule):
dLoss/dW3 = dLoss/dy  · dy/dW3               ← easy, close to loss
dLoss/dW2 = dLoss/dy  · dy/dh2  · dh2/dW2   ← one step further
dLoss/dW1 = dLoss/dy  · dy/dh2  · dh2/dh1 · dh1/dW1  ← deepest

Each weight matrix gets its own gradient.
Each gradient tells us: how much did this layer's
weights contribute to the final mistake?

Update each weight:
W = W - learning_rate × dLoss/dW

What Each Layer Actually Learns

Nobody tells the layers what to learn. Layer 1 might learn to detect individual word sentiment. Layer 2 might learn to detect negation ("not good"). Layer 3 might learn to combine everything into a final judgment. The structure emerges from training alone. You set up the architecture. Backprop figures out what each layer should do.

This is why the field is called deep learning — depth refers to the number of layers. A shallow network has one or two. A deep network has many. GPT-3 has 96 layers. Each one is the same pattern: linear transformation, nonlinearity, pass to next layer.

The naming changes. The architectures look different. But underneath every layer of every deep learning model, you will always find the same thing: a weight matrix, a bias, an activation function. Logistic regression in a new coat.

"Still just logistic regression.
Extra trick: apply it to itself, layer by layer."

Chapter 03

Add Memory

The Recurrent Neural Network — logistic regression that remembers

Logistic regression has no memory. Feed it the word "bank" and it makes a decision. Feed it "river bank" versus "bank account" and it sees the same word each time — it has no idea what came before.

Language is sequential. Context matters. The fix is elegant: feed the output back in as input.

RNN — The Core Equation

Regular logistic regression:
y = tanh(W_x · x  +  b)

RNN — add previous hidden state:
h_t = tanh(W_x · x_t  +  W_h · h_{t-1}  +  b)
                              ↑
                    memory of what came before!

The RNN processes one word at a time. At each step, it takes the current word embedding and the previous hidden state — the memory — and produces a new hidden state. That hidden state carries information about everything seen so far.

"I" → embedding → RNN → h1 ← knows "I" "love" → embedding → RNN → h2 ← knows "I love" "cats" → embedding → RNN → h3 ← knows "I love cats" At each step: h_t = tanh(W_x · x_t + W_h · h_{t-1} + b) ↑ current word ↑ memory

"The hidden state is a fingerprint of meaning. Not the word itself — a compressed summary of everything seen so far."

The RNN solves the memory problem but introduces a new one: vanishing gradients. When you backpropagate through many time steps, the gradient gets multiplied by small numbers over and over. By the time the signal reaches early time steps, it has nearly disappeared. Early words cannot learn properly.

Think of it like a game of telephone — information degrades over distance.

The Vanishing Gradient Problem

Backprop through time multiplies gradients:

dL/dW_early = dL/dh_n × dh_n/dh_{n-1} × ... × dh_2/dh_1

Each term is a sigmoid derivative: between 0 and 0.25

0.2 × 0.2 × 0.2 × 0.2 × 0.2 = 0.00032   ← nearly zero!

Early words get almost NO learning signal.
Long-distance dependencies are hard to learn.

"Still just logistic regression.
Extra trick: feed the output back as input."

Chapter 04

Fix The Memory

LSTM & GRU — four logistic regressions with different jobs

The RNN leaks memory. Information from early words fades before it can influence late decisions. The solution is not to make one big memory — it is to build a managed memory with explicit controls for what to keep, what to add, and what to output.

This is the LSTM. Forget the intimidating diagrams with arrows going everywhere. It is just four logistic regressions, each with a specific job.

LSTM — Four Logistic Regressions

Input at each step:
x_t      = current word
h_{t-1}  = short-term memory
C_{t-1}  = long-term memory  ← new! the "conveyor belt"

─────────────────────────────────────────────────

LR 1 — Forget Gate   "what % of old memory to erase?"
f = sigmoid(W_f · [h_{t-1}, x_t] + b_f)
output: 0 = forget everything, 1 = keep everything

LR 2 — Input Gate    "how much new info to write?"
i = sigmoid(W_i · [h_{t-1}, x_t] + b_i)

LR 3 — Candidate     "what is the new info?"
g = tanh(W_g · [h_{t-1}, x_t] + b_g)

LR 4 — Output Gate   "what to output right now?"
o = sigmoid(W_o · [h_{t-1}, x_t] + b_o)

─────────────────────────────────────────────────

Then just arithmetic — NOT logistic regression:
C_t = f × C_{t-1}  +  i × g    ← update long-term memory
h_t = o × tanh(C_t)             ← update short-term memory

The Key Insight

The long-term memory update uses addition, not multiplication. Gradients flow through addition with no shrinkage — the vanishing gradient problem is solved. The highway for information is now paved.

Notice what the LSTM is doing at its core: it is computing weighted averages. The forget gate produces a weight for old memory. The input gate produces a weight for new information. Then it takes a weighted average of the two. The weights are themselves produced by logistic regressions.

This is the pattern that will repeat through every architecture we study: logistic regression learns the weights, weighted average combines the information.

GRU — The Simpler Version

GRU asks: do we really need two separate memories and four gates? It merges them into two gates and one memory. Faster to train, slightly less expressive.

GRU Gates

r = sigmoid(W_r·[h,x])
z = sigmoid(W_z·[h,x])

h_t = (1-z)·h_{t-1} + z·g

"Still just logistic regression.
Extra trick: four LRs managing two memories."

Chapter 05

Two Networks Talking

Seq2Seq — variable-length input, variable-length output

A single RNN can process a sequence. But translating "Hi" into "Bonjour mon ami" requires something different — the input and output have different lengths, and one must be fully understood before the other begins.

The solution: two RNNs. One reads. One writes.

ENCODER — reads everything, says nothing "I" → RNN → h1 "love" → RNN → h2 "cats" → RNN → h3 = c ← the context vector. entire meaning, one vector. ───────────────────────────────────────────────────── DECODER — starts from c, generates until done c → h0_decoder h0_d + <START> → RNN → h1_d → "J'aime" h1_d + "J'aime" → RNN → h2_d → "les" h2_d + "les" → RNN → h3_d → "chats" h3_d + "chats" → RNN → h4_d → <END> ← model decides to stop

The encoder throws away all its outputs and keeps only the final hidden state. That one vector — called the context vector — must carry the meaning of the entire input sentence into the decoder.

This works. For short sentences, it works well. For long sentences, it starts to fail. Squeezing "The quick brown fox jumped over the lazy dog and then sat down by the river" into a single fixed-size vector means something will be lost.

"The context vector is a bottleneck. One vector, responsible for carrying everything. This is the problem that attention was born to solve."

"Still just logistic regression.
Extra trick: two RNNs with an intermediate vector."

Chapter 06

Learn What Matters

Attention — a third logistic regression that learns importance

Bahdanau et al. asked a simple question in 2014: why are we passing only the last hidden state to the decoder? We have all the encoder hidden states — h1, h2, h3 — sitting right there. Why not use all of them?

The answer is attention. At each decoder step, instead of using a fixed context vector, the decoder dynamically computes a fresh one by looking at all encoder states and deciding which ones are most relevant right now.

Bahdanau Attention — Step By Step

Setup:
Encoder states: h1_e, h2_e, h3_e   ("I", "love", "cats")
Decoder query:  h0_d               (what decoder needs now)

─────────────────────────────────────────────────
STEP 1 — Score each encoder state (small LR network)

e1 = v^T · tanh(W1·h0_d  +  W2·h1_e)  = 0.31
e2 = v^T · tanh(W1·h0_d  +  W2·h2_e)  = 0.85  ← "love" is relevant!
e3 = v^T · tanh(W1·h0_d  +  W2·h3_e)  = 0.24

W1 brings decoder into shared space
W2 brings encoder into shared space
v collapses to a single relevance score

─────────────────────────────────────────────────
STEP 2 — Softmax → attention weights

softmax([0.31, 0.85, 0.24])
→ α1=0.27,  α2=0.47,  α3=0.26   ← sum = 1.0

─────────────────────────────────────────────────
STEP 3 — Weighted average → context vector

c1 = 0.27·h1_e + 0.47·h2_e + 0.26·h3_e
     ↑ "I"       ↑ "love"     ↑ "cats"
     dominated by "love" — exactly right for "J'aime"!

─────────────────────────────────────────────────
STEP 4 — Decoder generates next word

h1_d = RNN(h0_d, <START>, c1) → "J'aime"  ✓

Repeat steps 1-4 for every decoder step.
Each step gets its own fresh context vector!

Why The Score Network?

The encoder and decoder hidden states come from different networks — they live in different spaces. You cannot just take their dot product. W1 and W2 are a learned bridge that projects both into a common space where comparison is meaningful. This bridge is trained by backpropagation alongside everything else.

Notice something beautiful here. The attention weights are just a weighted average — and the weights are learned by a small logistic regression. Same two moves as before. LR learns importance. Weighted average combines information. The only difference is that now we are computing this dynamically at every decoder step, over all encoder hidden states instead of just one.

Training pushes the network so that similar meanings produce similar hidden states. Nobody explicitly tells the model that "love" and "aime" are related. It figures this out from thousands of examples of English-French pairs. The similarity is entirely learned.

WITHOUT ATTENTION: WITH ATTENTION: ──────────────────── ──────────────────── Decoder sees only Decoder sees ALL encoder c = h3_e (bottleneck!) states at every step Context never changes. Context computed fresh at each step. Long sentences fail. Long sentences work. Information lost. Nothing forgotten.

"Still just logistic regression.
Extra trick: a third LR that learns which input matters."

Chapter 07

Every Word Talks To Every Word

The Transformer — throw away the RNN, keep the attention

Vaswani et al. asked an even more radical question in 2017: do we even need the RNN? Attention already lets every decoder step look at every encoder state. What if we extended that idea — what if every word in a sentence looked at every other word directly, with no recurrence at all?

This is self-attention. And it is the transformer.

Self-Attention — Every Word Attends To Every Word

For each word, create three vectors:
q_i = W_Q · x_i   "what am I looking for?"
k_i = W_K · x_i   "what do I contain?"
v_i = W_V · x_i   "what do I give?"

Honest note: Q and K are just two weight matrices
applied to the same input. The names are borrowed
from database search. The real reason to have
separate matrices: it allows asymmetric attention.
"eat→bread" can be stronger than "bread→eat".

─────────────────────────────────────────────────
Score every word against every other word:

score("eat", "I")     = q_eat · k_I     / √d = 0.1
score("eat", "eat")   = q_eat · k_eat   / √d = 0.8
score("eat", "bread") = q_eat · k_bread / √d = 0.6

softmax → [0.10,  0.50,  0.40]

new_eat = 0.10·v_I + 0.50·v_eat + 0.40·v_bread
          "eat" now contains info about bread (object)
          and I (subject) — without any RNN!

Every word does this simultaneously. "I", "eat", and "bread" all compute their attention scores in parallel, all update their representations at the same time. No sequential processing. No vanishing gradients from long chains of RNN steps. No bottleneck.

One Full Transformer Block

INPUT: sequence of word embeddings + positional encoding
  ↓
MULTI-HEAD SELF-ATTENTION
Run self-attention H times in parallel.
Each head asks a different question:
  Head 1: grammatical relationships?
  Head 2: semantic relationships?
  Head 3: positional relationships?
Concatenate all outputs. One final LR to combine.
  ↓
ADD & NORMALIZE
output = LayerNorm(input + attention_output)
The "input +" is a residual connection.
Gradient flows through addition without shrinking.
  ↓
FEEDFORWARD NETWORK
Each word transformed independently.
Two logistic regressions with nonlinearity.
Prevents multiple attention layers from collapsing.
  ↓
ADD & NORMALIZE
  ↓
OUTPUT: richer word representations
Stack this block 6, 12, 24, 96 times.

The transformer adds a few engineering necessities. Positional encoding because self-attention has no sense of order — "I eat bread" and "bread eat I" look identical to it without position signals. Residual connections because adding the input to the output of each layer creates a gradient highway that allows very deep networks to train. Layer normalization to keep numbers stable. Future masking to prevent the decoder from peeking at words it is supposed to predict.

Future Masking — No Cheating

During training, we have the full answer.
Without masking, "J'aime" can see "les" and "chats".
That is cheating — the model learns to copy, not predict.

Solution: set future attention scores to -infinity

              J'aime   les    chats
J'aime  [     0.6     -inf   -inf ]
les     [     0.4      0.4    -inf ]
chats   [     0.3      0.3     0.4  ]

After softmax: e^(-inf) = 0 → exact zero attention
Future is invisible. Model must actually predict.

The Full Story In One Place

The transformer has three types of attention. Encoder self-attention: every input word understands every other input word. Decoder masked self-attention: every output word sees only previous output words. Cross-attention: output words look at input words — this is Bahdanau's 2014 idea, preserved inside the transformer.

"Still just logistic regression.
Extra trick: throw away the RNN. Let every word talk to every word directly. Stack it deep."

The Full Journey

Two Moves, Six Architectures

Everything we built, everything we used

THE FAMILY TREE Logistic Regression y = sigmoid(Wx + b) 1 decision maker ↓ + memory RNN h_t = tanh(W_x·x_t + W_h·h_{t-1} + b) 1 LR, feeds itself back ↓ + better memory LSTM 4 LRs managing 2 memories forget / input / candidate / output gates ↓ + simplify GRU 2 LRs managing 1 memory reset / update gates ↓ + two networks Seq2Seq Encoder RNN + Decoder RNN variable length in → variable length out bottleneck: c = h_last ↓ + focus Attention (Bahdanau 2014) Encoder + Decoder + Score Network c computed fresh at every decoder step small LR learns which encoder state matters ↓ + remove RNN entirely Transformer (Vaswani 2017) Pure attention, stacked deep every word attends to every word parallel, scalable, no vanishing gradient

The Two Moves — In Every Architecture

MOVE 1: Logistic Regression learns importance
weight = sigmoid(W · input + b)

MOVE 2: Weighted average combines information
output = Σ weight_i × value_i

─────────────────────────────────────────────────
Where these appear:

RNN hidden state    = weighted mix of past + present
LSTM forget gate    = LR weighting old memory
LSTM input gate     = LR weighting new information
Attention scores    = LR weighting encoder states
Self-attention      = LR weighting all words
Multi-head          = multiple LRs, multiple averages
Feedforward         = LR transforming individual words

Every single component. Two moves. Always.