English
Transformer Attention Math: Q/K/V, Softmax Weights, Masks, and KV Cache
Transformer attention can be read as a weighted lookup. Each token uses a Query to score every Key, then uses the resulting weights to combine Values. The formula is compact, but the engineering details matter.
This article hand-calculates scaled dot-product attention for three toy tokens and explains Q/K/V, softmax, masks, multi-head attention, and KV cache.
1. The Core Formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
QK^T produces a token-to-token score matrix. Dividing by sqrt(d_k) keeps dot-product scale under control as the key dimension grows, preventing softmax from saturating too early.
2. Hand Calculate One Softmax Row
The lab uses three tokens: AI, needs, and math. For the Query token AI, the scaled scores are:
[0.579828, 0.353553, 0.820244]
After softmax, the weights are:
[0.325810, 0.259833, 0.414358]
In this toy embedding space, AI attends most strongly to math. Attention is not a complete explanation of model behavior; it is one internal weighted read.
3. Where Masking Enters
An autoregressive language model must not let the current position see future tokens. The usual implementation changes future scores before softmax:
scores = QK^T / sqrt(d_k)
scores[future_positions] = -infinity
weights = softmax(scores)
The future positions then receive nearly zero probability. A mask does not delete tokens; it changes the visible set before probability normalization.
4. What Multi-Head Attention Adds
A single attention head observes token relationships through one set of Q/K/V projections. Multi-head attention runs several such projections in parallel, each in a smaller subspace, then concatenates the results.
scores = (Q @ K.T) / np.sqrt(Q.shape[1])
weights = np.vstack([softmax(row) for row in scores])
context = weights @ V
For token AI, the lab reports context dimensions 0.683279 and 0.399593, computed as a weighted sum of the three Value vectors.
5. Why KV Cache Speeds Up Generation
When generating token t, the Key and Value vectors for previous tokens do not change. KV cache stores those historical K,V tensors. The next step only computes Q/K/V for the new token and appends the new K/V to the cache.
This avoids repeated projections over the full prefix, but memory usage grows with context length.
6. What The Animation Shows
7. Practical Notes
- Check the attention matrix shape first; it is usually
tokens x tokens. - Apply masks before softmax, not after.
- Long context is not free; KV cache consumes memory.
- Attention heatmaps help debugging but are not automatic causal explanations.
The mathematical path now connects matrix calculus, backpropagation, optimization, convolution, and attention.
Chinese
Transformer Attention 数学:Q/K/V、Softmax 权重、Mask 与 KV Cache
Open as a full pageTransformer 的注意力机制可以理解成:每个 token 用 Query 去询问所有 Key,再用得到的权重加权 Value。它的数学形式很短,但工程细节很多。
这一篇用 3 个 token 手算 scaled dot-product attention,解释 Q/K/V、softmax、mask、multi-head 和 KV cache。
一、核心公式
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
QK^T 产生 token-to-token 分数矩阵。除以 sqrt(d_k) 是为了避免维度变大后点积方差太大,导致 softmax 过早饱和。
二、手算一行 softmax
实验包使用 token:AI、needs、math。对于 Query AI,三个缩放分数是:
[0.579828, 0.353553, 0.820244]
softmax 后得到:
[0.325810, 0.259833, 0.414358]
这表示 AI 这个 query 在当前 toy embedding 下更关注 math。注意力不是解释一切的真理,它只是模型内部一次加权读取。
三、mask 在哪里进入
自回归语言模型不能让当前位置看到未来 token。做法是在 softmax 前把未来位置分数设为一个很小的数:
scores = QK^T / sqrt(d_k)
scores[future_positions] = -infinity
weights = softmax(scores)
这样未来位置的权重会接近 0。mask 不是删除 token,而是在概率归一化前改变可见范围。
四、multi-head 是什么
单个 attention head 只能用一套 Q/K/V 投影观察 token 关系。Multi-head attention 会把表示分成多个子空间,并行计算多组注意力,再拼接起来。
scores = (Q @ K.T) / np.sqrt(Q.shape[1])
weights = np.vstack([softmax(row) for row in scores])
context = weights @ V
实验包里 AI 的 context 第一个维度是 0.683279,第二个维度是 0.399593,它来自三个 Value 的加权和。
五、KV cache 为什么能加速生成
生成第 t 个 token 时,历史 token 的 Key 和 Value 不会改变。KV cache 把这些历史 K,V 保存起来,下一步只需要计算新 token 的 Q,K,V,再和缓存拼接。
它节省的是重复投影和重复读取历史上下文的成本,但代价是显存随上下文长度增长。
六、动画看什么
七、实践建议
- 先检查 attention matrix 的 shape:通常是
tokens x tokens。 - mask 必须在 softmax 前应用。
- 长上下文不是免费午餐,KV cache 会占用显存。
- attention heatmap 可以辅助调试,但不能直接等同于因果解释。
到这里,深度学习数学核心的主线已经串起来:矩阵微积分、反向传播、优化、卷积和注意力。
Transformer attention can be read as a weighted lookup. Each token uses a Query to score every Key, then uses the resulting weights to combine Values. The formula is compact, but the engineering details matter.
This article hand-calculates scaled dot-product attention for three toy tokens and explains Q/K/V, softmax, masks, multi-head attention, and KV cache.
1. The Core Formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
QK^T produces a token-to-token score matrix. Dividing by sqrt(d_k) keeps dot-product scale under control as the key dimension grows, preventing softmax from saturating too early.

2. Hand Calculate One Softmax Row
The lab uses three tokens: AI, needs, and math. For the Query token AI, the scaled scores are:
[0.579828, 0.353553, 0.820244]
After softmax, the weights are:
[0.325810, 0.259833, 0.414358]
In this toy embedding space, AI attends most strongly to math. Attention is not a complete explanation of model behavior; it is one internal weighted read.
3. Where Masking Enters
An autoregressive language model must not let the current position see future tokens. The usual implementation changes future scores before softmax:
scores = QK^T / sqrt(d_k)
scores[future_positions] = -infinity
weights = softmax(scores)
The future positions then receive nearly zero probability. A mask does not delete tokens; it changes the visible set before probability normalization.
4. What Multi-Head Attention Adds
A single attention head observes token relationships through one set of Q/K/V projections. Multi-head attention runs several such projections in parallel, each in a smaller subspace, then concatenates the results.
scores = (Q @ K.T) / np.sqrt(Q.shape[1])
weights = np.vstack([softmax(row) for row in scores])
context = weights @ V
For token AI, the lab reports context dimensions 0.683279 and 0.399593, computed as a weighted sum of the three Value vectors.
5. Why KV Cache Speeds Up Generation
When generating token t, the Key and Value vectors for previous tokens do not change. KV cache stores those historical K,V tensors. The next step only computes Q/K/V for the new token and appends the new K/V to the cache.
This avoids repeated projections over the full prefix, but memory usage grows with context length.
6. What The Animation Shows
7. Practical Notes
- Check the attention matrix shape first; it is usually
tokens x tokens. - Apply masks before softmax, not after.
- Long context is not free; KV cache consumes memory.
- Attention heatmaps help debugging but are not automatic causal explanations.
The mathematical path now connects matrix calculus, backpropagation, optimization, convolution, and attention.
Search questions
FAQ
Who is this article for?
This article is for readers who want an intermediate-level guide to Transformer Attention Math. It takes about 14 min and focuses on Transformer, Attention, QKV, KV Cache.
What should I read next?
The recommended next step is Transformer Self-Attention, so the article connects into a longer learning route instead of ending as an isolated note.
Does this article include runnable code or companion resources?
Yes. Use the run notes, resource cards, and download links on the page to reproduce the example or inspect the companion files.
How does this article fit into the larger site?
It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.
Article context
AI Learning Project
A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.
Hand-calculate Q/K/V scores, softmax weights, masks, multi-head structure, and KV cache.
Download share card Open share centerCompanion resources
AI Learning Project / DATASET
Attention weights CSV
Scores, softmax weights, and context vectors for a three-token scaled dot-product attention example.
AI Learning Project / DIAGRAM
Deep learning math figure set
Includes matrix shapes, computation graphs, loss contours, convolution scans, and attention heatmaps.
AI Learning Project / TOOL
Deep learning math interactive visualizer
Browser modules for gradient checking, optimizer paths, convolution output size, and attention heatmaps.
Project timeline
Published posts
- AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
- Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
- Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
- Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
- Matrix Calculus for Neural Networks Derive dL/dW for y = Wx + b and verify it with finite differences.
- Backpropagation as a Computation Graph Trace local gradients through ReLU and softmax cross-entropy in a two-layer MLP.
- Gradient Descent and Optimizer Geometry Compare gradient descent, momentum, and Adam on a visible quadratic loss surface.
- Convolution and Receptive Field Math Compute convolution output size, receptive fields, channel mixing, and im2col layout.
- Transformer Attention Math Hand-calculate Q/K/V scores, softmax weights, masks, multi-head structure, and KV cache.
- NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
- RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
- Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
- Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
- Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
- Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
- Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
- CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
- Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
- High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
- AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
- Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
- Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
- Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
- LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.
Published resources
- Python AI practice code guide The article includes a runnable scikit-learn classification script.
- digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
- train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
- test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
- sample_submission.csv The official submission format example for checking the final output columns.
- submission.csv The prediction file generated by the current C project.
- digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
- digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
- Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
- cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
- model_weights.bin sample weights Model weights generated by one local small-sample run.
- test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
- CNN project explanation PDF Companion explanation material for the CNN project.
- Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
- Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
- Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
- Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
- AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
- AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
- AI security risk register CSV risk register template for AI threat modeling and release review.
- AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
- AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
- FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
- Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
- Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
- RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
- Deep Learning Math Lab README Setup commands, script entry points, generated outputs, and figure notes for the math series.
- Deep learning math full lab bundle Bundles NumPy scripts, CSV outputs, formula diagrams, loss contours, convolution figures, and attention heatmaps.
- Gradient check results CSV Stores MSE analytic gradients, finite-difference gradients, and error norms.
- Optimizer path CSV Step-by-step coordinates and loss for gradient descent, momentum, and Adam on a 2D quadratic.
- Attention weights CSV Scores, softmax weights, and context vectors for a three-token scaled dot-product attention example.
- Deep learning math figure set Includes matrix shapes, computation graphs, loss contours, convolution scans, and attention heatmaps.
- Deep learning math interactive visualizer Browser modules for gradient checking, optimizer paths, convolution output size, and attention heatmaps.
- Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
- Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
- Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
- CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.
Current route
- AI Basics Learning Roadmap Learning path step
- Machine Learning Workflow Learning path step
- Model Training and Evaluation Learning path step
- Neural Network Basics Learning path step
- Matrix Calculus for Neural Networks Learning path step
- Backpropagation as a Computation Graph Learning path step
- Gradient Descent and Optimizer Geometry Learning path step
- Convolution and Receptive Field Math Learning path step
- Transformer Attention Math Learning path step
- Transformer Self-Attention Learning path step
- LLM Visualizer Learning path step
- Python AI Mini Practice Learning path step
- Handwritten Digit Dataset Basics Learning path step
- Handwritten Digit Softmax in C Learning path step
- Handwritten Digit Playground Notes Learning path step
- CIFAR-10 Tiny CNN Tutorial in C Learning path step
- High-Entropy Traffic Defense Notes Learning path step
- AI Security Threat Modeling Learning path step
- Adversarial Examples and Robust Evaluation Learning path step
- Data Poisoning and Backdoor Defense Learning path step
- Model Privacy and Extraction Defense Learning path step
- LLM, RAG, and Agent Security Learning path step
Next notes
- Add more image-classification and error-analysis cases
- Turn common metrics into a quick reference
- Add more AI security defense experiment notes
