Transformer Attention 数学教程：Q/K/V、Mask、多头注意力和 KV Cache

Q: 这篇文章适合谁读？

这篇文章适合想用 进阶 难度理解“Transformer Attention 数学：Q/K/V、Softmax 权重、Mask 与 KV Cache”的读者，预计阅读时间约 14 分钟，重点覆盖 Transformer, Attention, QKV, KV Cache。

阅读信息

难度: 进阶阅读时间: 14 分钟

Transformer
Attention
QKV
KV Cache

打开知识图谱

中文

Transformer Attention 数学：Q/K/V、Softmax 权重、Mask 与 KV Cache

Transformer 的注意力机制（Self-Attention）可以被通俗地理解成：序列中的每个词（token）用自己的查询向量（Query）去评估所有词的键向量（Key），从而决定对整个句子的“注意力分配”，最后再用这个分配权重去加权每个词的信息向量（Value）。它的数学形式仅仅短短一行代码，但在工程落地和模型训练中却暗藏无数细节。

本文将带你从第一性原理出发，使用 3 个 token 纯手工计算 Scaled Dot-Product Attention，并深入解析 Q/K/V 投影、Softmax 饱和、Causal Masking、多头机制（Multi-head）以及推理侧的性能怪兽——KV Cache。

一、核心数学公式解析

这是整个大语言模型时代的基石公式：

Attention(Q, K, V) = softmax((Q @ K^T) / sqrt(d_k)) @ V

其中：

Q @ K^T 产生的是一个 `[seq_len, seq_len]` 大小的注意力分数矩阵。由于是点积，它衡量了每对 token 之间在多维空间中的“相似度”或“关联度”。
为什么必须除以 sqrt(d_k)？假设 Q 和 K 的维度 `d_k = 4096`，且元素服从均值为 0，方差为 1 的独立分布，那么点积的方差会高达 `4096`。方差过大会导致极端的分数值（如 100 和 -100），在经过 Softmax 时就会导致梯度几乎为零（梯度消失），即“Softmax 饱和”。

二、架构图解：数据流与维度变化


graph TD
    Input[Input Sequence: B, L, d_model] --> WQ(W_q Linear)
    Input --> WK(W_k Linear)
    Input --> WV(W_v Linear)
    
    WQ --> Q[Q: B, h, L, d_k]
    WK --> K[K: B, h, L, d_k]
    WV --> V[V: B, h, L, d_v]
    
    Q --> Dot[Dot Product: Q @ K^T]
    K --> Dot
    Dot --> Scale[Scale by 1/sqrt(d_k)]
    Scale --> Mask[Apply Causal Mask]
    Mask --> Softmax[Softmax along dim L]
    Softmax --> AttentionWeights[Attention Weights: B, h, L, L]
    
    AttentionWeights --> MatMulV[MatMul with V]
    V --> MatMulV
    
    MatMulV --> Context[Context Output: B, h, L, d_v]
    Context --> Concat[Concat Heads: B, L, d_model]
    Concat --> Out[W_o Linear]

三、实战演示：用 Numpy 手写自注意力

光看公式太抽象，我们来跑一段可执行的 Numpy 纯手写代码。假设输入是一个只有 3 个 token（例如 "AI", "needs", "math"）的序列，维度为 4：

import numpy as np

# 1. 模拟 Q, K, V 矩阵 (Seq_len=3, d_k=4)
# 代表 "AI", "needs", "math" 三个词
Q = np.array([
    [ 1.0,  0.5, -0.2,  0.1],  # AI
    [-0.5,  1.2,  0.8, -0.4],  # needs
    [ 0.2, -0.1,  1.5,  0.9]   # math
])
K = np.array([
    [ 0.8,  0.4, -0.3,  0.0],
    [-0.2,  1.0,  0.5, -0.1],
    [ 0.1, -0.2,  1.1,  0.7]
])
V = np.array([
    [ 1.0,  0.0],
    [ 0.0,  1.0],
    [-1.0, -1.0]
])

d_k = Q.shape[1]

# 2. 计算打分 (Scores) 并进行缩放 (Scaling)
scores = (Q @ K.T) / np.sqrt(d_k)
print("Scaled Scores:\n", scores)

# 3. 因果掩码 (Causal Mask)
# 屏蔽未来位置，防止模型作弊
mask = np.triu(np.ones((3, 3)), k=1)
scores[mask == 1] = -np.inf

# 4. Softmax 归一化
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

weights = softmax(scores)
print("Attention Weights:\n", np.round(weights, 3))

# 5. 值加权 (Context)
context = weights @ V
print("Context Output:\n", context)

跑完这段代码你会发现：第一行对应 "AI" 这个词，它的注意力权重只会分配给自己；而第三行的 "math" 会将注意力分配给前两个词。这就体现了自回归模型的本质：只能用历史信息生成未来信息。

四、因果掩码（Mask）到底改变了什么？

正如上面代码所示，在自回归（Autoregressive）生成任务中，如果当前在预测第 3 个词，它绝不能“看到”第 4、5 个词的信息。我们在 Softmax 之前，强行把上三角矩阵的注意力分数赋值为负无穷大（-inf）。经过 Softmax 后，这些位置的权重会被精确地压为 0。所以，Mask 不是删除 token，而是在概率层面做切断，让非法的注意力分配变成绝对不可能发生的事。

五、工程师的填坑经验：显存杀手与 KV Cache

实战视角：在书本上你看到的是优雅的矩阵公式，但在工业界部署 LLM 时，你看到的往往是一次次无情的 OOM (Out of Memory) 报错。

在推理阶段，大模型是以逐字生成（Token-by-token）的方式运行的。生成第 $t+1$ 个词时，前面的 $t$ 个词的 K 和 V 都是完全不变的！如果我们每次都用全尺寸的 L x d_model 矩阵去重新乘一遍，那就是巨大的算力浪费。

KV Cache 的本质，就是用空间换时间。

我们会在 GPU 显存里开辟一块连续区域，把历史生成的 K 和 V 保存下来。
每生成一个新词，只需要计算当前这 1 个 token 的 $Q_{new}, K_{new}, V_{new}$，然后把 $K_{new}$ 拼接到显存里。
代价极其高昂：一个稍微长一点的上下文，哪怕只有 10K tokens，单 batch 消耗的 KV Cache 可能就会超过模型权重本身的显存占用！这就是为什么现在工业界会发明 PagedAttention（vLLM 的核心）、MQA (Multi-Query Attention) 和 GQA (Grouped-Query Attention)，全都是为了削减 KV Cache 的显存体积。

六、实现时最容易错的三个 shape

第一处是 batch 维度。教学代码经常写成 Q @ K.T，这只适合单个序列；真实模型通常是 batch x heads x tokens x dim。这时应当转置最后两个维度，而不是把 batch 或 head 维度也混进去。shape 写错时，程序有时不会报错，只会通过广播得到完全错误的注意力矩阵。

第二处是 mask 维度。自回归 mask 应该覆盖 query-token 到 key-token 的二维关系，并且能广播到 batch 和 head。padding mask 则表示哪些 token 是填充位。两类 mask 的语义不同，不能简单相加了事。第三处是 softmax 的轴，必须沿 key 维度归一化；如果沿 query 维度归一化，每一列会变成概率分布，注意力含义就反了。

七、怎么检查 attention 实验结果

最基础的检查是每一行 attention weight 的和是否接近 1。然后检查 mask 后的未来位置是否接近 0。再检查 context 的 shape 是否和 Value 的最后一维一致。对于这篇文章里的三个 token toy example，你还可以手算第一行 softmax，确认权重变化不是因为代码排序错误或 mask 方向写反了。

注意力热力图适合调试，但不等价于完整解释。一个 token 权重高，只表示这一步加权读取更多地使用了某个 Value；它不直接证明模型“因为什么原因”做出最终预测。把 heatmap 当成排查工具，而不是因果证据，能避免很多误读。

八、Attention 验证矩阵

自注意力实现最容易出现“shape 能跑但语义错”的问题。下面的矩阵把检查点固定下来，读者可以用它复核本文的 NumPy toy example，也可以迁移到批量、多头或推理缓存实现中。

检查点	正确证据	常见错误
score 形状	`Q @ K.T` 得到 query-token 到 key-token 的二维矩阵。	转置错维度，把 batch/head 维度混进注意力矩阵。
缩放与 softmax	除以 `sqrt(d_k)` 后沿 key 维度归一化，每行和约等于 1。	沿 query 维度 softmax，或不缩放导致权重过早饱和。
causal mask	未来位置在 softmax 后接近 0，历史位置仍可分配权重。	mask 方向反了，让当前 token 只能看未来而不能看历史。
KV Cache	新 token 只追加 `K_new`、`V_new`，历史缓存不重复计算。	每步重算全部 K/V，或 cache 长度与位置编码不同步。

九、图示与数据流总结

三个 token 的 scaled dot-product attention 权重热力图 — 每一行代表一个 Query token 对所有历史 Key token 的注意力分布。这就是所谓的 Attention Heatmap，模型之所以“懂”语言，就藏在这张图里的每一丝权重变化中。

这套机制看似只是矩阵乘法，但却支撑起了当今很多前沿 AI 系统。下次再遇到 Transformer 报错时，第一反应应该是：打印所有张量的 shape，然后在纸上画一遍矩阵乘法的过程。

英文

Transformer Attention Math: Q/K/V, Softmax Weights, Masks, and KV Cache

在独立页面打开

The core concept of a Transformer's Self-Attention mechanism can be intuitively understood as follows: every token in a sequence uses its own Query vector to evaluate the Key vectors of all other tokens. This process determines how "attention" should be distributed across the sentence, and finally, this weight distribution is used to compute a weighted sum of the information vectors (Values). Its mathematical expression is remarkably concise—just one line of code—but it conceals a staggering amount of engineering depth and model training nuances.

In this article, we will start from first principles. We will manually calculate Scaled Dot-Product Attention for a 3-token sequence and dive deep into Q/K/V projections, Softmax saturation, Causal Masking, Multi-head mechanisms, and the notorious inference performance beast known as the KV Cache.

1. The Core Mathematical Formula

This is the foundational equation of the Large Language Model era:

Attention(Q, K, V) = softmax((Q @ K^T) / sqrt(d_k)) @ V

Breaking it down:

Q @ K^T produces an attention score matrix of size `[seq_len, seq_len]`. Because it's a dot product, it measures the "similarity" or "affinity" between pairs of tokens in a high-dimensional space.
Why must we divide by sqrt(d_k)? Suppose Q and K have a dimension of `d_k = 4096`, and their elements follow an independent distribution with a mean of 0 and a variance of 1. The variance of their dot product will scale up to `4096`. A massive variance creates extreme score values (e.g., 100 vs -100). When these are passed through a Softmax function, it forces the gradients to near-zero (vanishing gradients), a problem known as "Softmax saturation."

2. Architectural Diagram: Data Flow and Dimensions


graph TD
    Input[Input Sequence: B, L, d_model] --> WQ(W_q Linear)
    Input --> WK(W_k Linear)
    Input --> WV(W_v Linear)
    
    WQ --> Q[Q: B, h, L, d_k]
    WK --> K[K: B, h, L, d_k]
    WV --> V[V: B, h, L, d_v]
    
    Q --> Dot[Dot Product: Q @ K^T]
    K --> Dot
    Dot --> Scale[Scale by 1/sqrt(d_k)]
    Scale --> Mask[Apply Causal Mask]
    Mask --> Softmax[Softmax along dim L]
    Softmax --> AttentionWeights[Attention Weights: B, h, L, L]
    
    AttentionWeights --> MatMulV[MatMul with V]
    V --> MatMulV
    
    MatMulV --> Context[Context Output: B, h, L, d_v]
    Context --> Concat[Concat Heads: B, L, d_model]
    Concat --> Out[W_o Linear]

3. Practical Demonstration: Self-Attention in NumPy

Formulas can be abstract. Let's run a highly educational, purely Pythonic NumPy implementation. Imagine our input sequence consists of just 3 tokens (e.g., "AI", "needs", "math") with an embedding dimension of 4:

import numpy as np

# 1. Simulate Q, K, V Matrices (Seq_len=3, d_k=4)
# Representing tokens: "AI", "needs", "math"
Q = np.array([
    [ 1.0,  0.5, -0.2,  0.1],  # AI
    [-0.5,  1.2,  0.8, -0.4],  # needs
    [ 0.2, -0.1,  1.5,  0.9]   # math
])
K = np.array([
    [ 0.8,  0.4, -0.3,  0.0],
    [-0.2,  1.0,  0.5, -0.1],
    [ 0.1, -0.2,  1.1,  0.7]
])
V = np.array([
    [ 1.0,  0.0],
    [ 0.0,  1.0],
    [-1.0, -1.0]
])

d_k = Q.shape[1]

# 2. Calculate Dot-Product Scores and Scale
scores = (Q @ K.T) / np.sqrt(d_k)
print("Scaled Scores:\n", scores)

# 3. Causal Mask
# Mask out future positions to prevent information leakage (cheating)
mask = np.triu(np.ones((3, 3)), k=1)
scores[mask == 1] = -np.inf

# 4. Softmax Normalization
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

weights = softmax(scores)
print("Attention Weights:\n", np.round(weights, 3))

# 5. Value Weighting (Context Output)
context = weights @ V
print("Context Output:\n", context)

If you run this code, you'll observe that the first row (the word "AI") assigns attention weights exclusively to itself. The third row ("math") distributes its attention across the preceding two tokens. This perfectly demonstrates the essence of autoregressive models: they must synthesize historical context without peaking into the future.

4. What Does the Causal Mask Actually Change?

As demonstrated in the code above, in autoregressive generation tasks, if the model is currently predicting the 3rd token, it absolutely cannot "see" tokens 4 or 5. Right before the Softmax operation, we forcefully overwrite the upper triangular matrix of the attention scores to negative infinity (-inf). After passing through the Softmax, these specific weights are mathematically crushed to exactly 0. Therefore, the Mask does not delete tokens; rather, it performs a probabilistic cutoff, ensuring that illegal attention allocation is impossible.

5. An Engineer's Perspective: The VRAM Killer and KV Cache

Real-World Insight: In textbooks, you see elegant matrix multiplication. But in industrial LLM deployment, what you see is a relentless stream of OOM (Out of Memory) exceptions.

During inference, large language models operate token-by-token. When generating token $t+1$, the K and V matrices for the previous $t$ tokens remain completely identical! If we were to naively multiply the full L x d_model matrices over and over, the computational waste would be catastrophic.

The KV Cache is the ultimate space-for-time tradeoff.

We allocate a massive contiguous block in the GPU VRAM to cache historically generated K and V tensors.
For every new token generated, we only compute $Q_{new}, K_{new}, V_{new}$ for that single token, and strictly append $K_{new}$ into the VRAM cache block.
**The Cost is Staggering:** For slightly longer contexts (even just 10K tokens), a single batch's KV Cache footprint can easily exceed the memory required to load the model weights themselves! This is exactly why the industry invented PagedAttention (the core of vLLM), MQA (Multi-Query Attention), and GQA (Grouped-Query Attention)—they are all desperate engineering hacks designed to shrink the KV Cache memory footprint.

6. Shape and Mask Checks

The easiest attention bugs are the ones that still produce a tensor. First, verify that batched attention uses batch x heads x tokens x dim and transposes only the final two dimensions for Q @ K^T. Then verify that the causal mask describes query-token to key-token visibility and broadcasts over batch and head. Finally, confirm that softmax is applied over the key dimension, not over the query dimension.

For the three-token toy example, every attention row should sum to approximately 1, future positions should be zero after masking, and the context output should have the same final dimension as V. A heatmap is useful for debugging, but it is not causal proof of why a model predicted a token; it only shows how the current weighted read used Value vectors.

7. Attention Verification Matrix

Self-attention often fails with code that runs but has the wrong semantics. Use the matrix below to audit the NumPy toy example here and to transfer the checks to batched, multi-head, or cached inference implementations.

Check	Correct evidence	Common mistake
Score shape	`Q @ K.T` produces a query-token by key-token matrix.	Transposing the wrong dimensions and mixing batch or head into attention.
Scaling and softmax	Scores are divided by `sqrt(d_k)`; each row sums to about 1 over keys.	Normalizing over queries, or skipping scaling and saturating attention early.
Causal mask	Future positions are near 0 after softmax while historical positions remain visible.	Reversing the mask so the current token sees the future but not the past.
KV cache	Each new token appends only `K_new` and `V_new`; history is not recomputed.	Recomputing all K/V every step, or letting cache length drift from position encoding.

8. Visualizations and Data Flow Summary

Scaled dot-product attention weight heatmap for three tokens — Each row represents the attention distribution of a Query token over all historical Key tokens. This is the classic Attention Heatmap. The reason the model "understands" human language is encoded entirely within these shifting gradient weights.

This mechanism may seem like simple linear algebra, but it currently supports the absolute frontier of global AI research. The next time your Transformer script crashes, your first instinct should always be: Print the shape of every single tensor, and trace the matrix multiplication on a piece of scrap paper.

代码运行说明

环境: Python 3 + NumPy + Matplotlib

安装

cd deep-learning-math-lab
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

运行

python src/attention_math.py

输入文件: 三 token 的固定 Q/K/V 矩阵
预期输出: 输出 scores、softmax weights、context 向量和 attention heatmap。

安装 cd deep-learning-math-lab
安装 python3 -m venv .venv
安装 source .venv/bin/activate
安装 pip install -r requirements.txt
运行 python src/attention_math.py

一、核心数学公式解析

这是整个大语言模型时代的基石公式：

Attention(Q, K, V) = softmax((Q @ K^T) / sqrt(d_k)) @ V

其中：

Q @ K^T 产生的是一个 `[seq_len, seq_len]` 大小的注意力分数矩阵。由于是点积，它衡量了每对 token 之间在多维空间中的“相似度”或“关联度”。
为什么必须除以 sqrt(d_k)？假设 Q 和 K 的维度 `d_k = 4096`，且元素服从均值为 0，方差为 1 的独立分布，那么点积的方差会高达 `4096`。方差过大会导致极端的分数值（如 100 和 -100），在经过 Softmax 时就会导致梯度几乎为零（梯度消失），即“Softmax 饱和”。

二、架构图解：数据流与维度变化


graph TD
    Input[Input Sequence: B, L, d_model] --> WQ(W_q Linear)
    Input --> WK(W_k Linear)
    Input --> WV(W_v Linear)
    
    WQ --> Q[Q: B, h, L, d_k]
    WK --> K[K: B, h, L, d_k]
    WV --> V[V: B, h, L, d_v]
    
    Q --> Dot[Dot Product: Q @ K^T]
    K --> Dot
    Dot --> Scale[Scale by 1/sqrt(d_k)]
    Scale --> Mask[Apply Causal Mask]
    Mask --> Softmax[Softmax along dim L]
    Softmax --> AttentionWeights[Attention Weights: B, h, L, L]
    
    AttentionWeights --> MatMulV[MatMul with V]
    V --> MatMulV
    
    MatMulV --> Context[Context Output: B, h, L, d_v]
    Context --> Concat[Concat Heads: B, L, d_model]
    Concat --> Out[W_o Linear]

三、实战演示：用 Numpy 手写自注意力

光看公式太抽象，我们来跑一段可执行的 Numpy 纯手写代码。假设输入是一个只有 3 个 token（例如 “AI”, “needs”, “math”）的序列，维度为 4：

import numpy as np

# 1. 模拟 Q, K, V 矩阵 (Seq_len=3, d_k=4)
# 代表 "AI", "needs", "math" 三个词
Q = np.array([
    [ 1.0,  0.5, -0.2,  0.1],  # AI
    [-0.5,  1.2,  0.8, -0.4],  # needs
    [ 0.2, -0.1,  1.5,  0.9]   # math
])
K = np.array([
    [ 0.8,  0.4, -0.3,  0.0],
    [-0.2,  1.0,  0.5, -0.1],
    [ 0.1, -0.2,  1.1,  0.7]
])
V = np.array([
    [ 1.0,  0.0],
    [ 0.0,  1.0],
    [-1.0, -1.0]
])

d_k = Q.shape[1]

# 2. 计算打分 (Scores) 并进行缩放 (Scaling)
scores = (Q @ K.T) / np.sqrt(d_k)
print("Scaled Scores:\n", scores)

# 3. 因果掩码 (Causal Mask)
# 屏蔽未来位置，防止模型作弊
mask = np.triu(np.ones((3, 3)), k=1)
scores[mask == 1] = -np.inf

# 4. Softmax 归一化
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

weights = softmax(scores)
print("Attention Weights:\n", np.round(weights, 3))

# 5. 值加权 (Context)
context = weights @ V
print("Context Output:\n", context)

跑完这段代码你会发现：第一行对应 “AI” 这个词，它的注意力权重只会分配给自己；而第三行的 “math” 会将注意力分配给前两个词。这就体现了自回归模型的本质：只能用历史信息生成未来信息。

四、因果掩码（Mask）到底改变了什么？

五、工程师的填坑经验：显存杀手与 KV Cache

实战视角：在书本上你看到的是优雅的矩阵公式，但在工业界部署 LLM 时，你看到的往往是一次次无情的 OOM (Out of Memory) 报错。

KV Cache 的本质，就是用空间换时间。

我们会在 GPU 显存里开辟一块连续区域，把历史生成的 K 和 V 保存下来。
每生成一个新词，只需要计算当前这 1 个 token 的 $Q_{new}, K_{new}, V_{new}$，然后把 $K_{new}$ 拼接到显存里。
代价极其高昂：一个稍微长一点的上下文，哪怕只有 10K tokens，单 batch 消耗的 KV Cache 可能就会超过模型权重本身的显存占用！这就是为什么现在工业界会发明 PagedAttention（vLLM 的核心）、MQA (Multi-Query Attention) 和 GQA (Grouped-Query Attention)，全都是为了削减 KV Cache 的显存体积。

六、实现时最容易错的三个 shape

七、怎么检查 attention 实验结果

八、Attention 验证矩阵

检查点	正确证据	常见错误
score 形状	`Q @ K.T` 得到 query-token 到 key-token 的二维矩阵。	转置错维度，把 batch/head 维度混进注意力矩阵。
缩放与 softmax	除以 `sqrt(d_k)` 后沿 key 维度归一化，每行和约等于 1。	沿 query 维度 softmax，或不缩放导致权重过早饱和。
causal mask	未来位置在 softmax 后接近 0，历史位置仍可分配权重。	mask 方向反了，让当前 token 只能看未来而不能看历史。
KV Cache	新 token 只追加 `K_new`、`V_new`，历史缓存不重复计算。	每步重算全部 K/V，或 cache 长度与位置编码不同步。

九、图示与数据流总结

搜索问题

常见问题

这篇文章适合谁读？

这篇文章适合想用进阶难度理解“Transformer Attention 数学：Q/K/V、Softmax 权重、Mask 与 KV Cache”的读者，预计阅读时间约 14 分钟，重点覆盖 Transformer, Attention, QKV, KV Cache。

读完后下一步应该看什么？

推荐下一步阅读“LLM/RAG/Agent 安全：Prompt Injection、工具权限和边界感知防护”，这样可以把当前知识点接到更完整的学习路线里。

这篇文章有没有可运行代码或配套资源？

有。页面里的运行说明、资源卡片和下载入口会指向复现实验所需的命令、数据、代码或说明文件。

这篇文章和整个网站的学习路线有什么关系？

它会通过文章上下文、学习路线、资源库和项目时间线连接到同一主题下的其他内容。

文章上下文

人工智能项目

从 AI、机器学习、训练评估、神经网络到 Python 小实战、手写数字识别、CIFAR-10 CNN、对抗性流量防御和 AI 安全攻防，按顺序建立基础。

难度: 进阶阅读时间: 14 分钟

Transformer
Attention
QKV
KV Cache

继续下一步

继续：LLM 可视化教学台

先补基础打开资源

对应语言版本 Transformer Attention Math: Q/K/V, Softmax Weights, Masks, and KV Cache

可分享摘要 Transformer Attention 数学：Q/K/V、Softmax 权重、Mask 与 KV Cache

用 3 个 token 手算 scaled dot-product attention，解释 Q/K/V、softmax、mask、多头注意力和 KV cache。

下载分享图打开分享中心

配套资源

三 token scaled dot-product attention 的 scores、softmax weights 和 context 输出。

打开资源关联文章

包含矩阵形状、计算图、loss contour、卷积扫描和 attention heatmap。

打开资源关联文章

在浏览器里调梯度检查、优化轨迹、卷积输出尺寸和 attention 权重热图。

打开资源关联文章

发表回复取消回复

要发表评论，您必须先登录。

项目时间线

已发布文章

人工智能基础学习路线：先理解什么是 AI、机器学习和深度学习面向有编程基础的读者，梳理 AI、机器学习、深度学习的关系，并给出可执行的人工智能基础学习路线。
机器学习完整流程：从数据、特征到模型预测从工程视角拆解机器学习完整流程：定义问题、理解数据、处理特征、训练模型、预测和评估。
机器学习算法怎么选：分类、回归、聚类和推荐场景对照表用任务类型、数据规模、解释性和部署成本选择机器学习算法，覆盖逻辑回归、决策树、随机森林、K-means 和表格数据基线模型。
特征工程入门实战：用 scikit-learn 处理缺失值、类别变量和数值标准化用 scikit-learn Pipeline 和 ColumnTransformer 完成特征工程，处理缺失值、类别变量、数值标准化，并避免数据泄漏。
模型训练与评估入门：损失函数、过拟合和准确率怎么理解讲清楚模型训练中的参数、损失函数、梯度下降、过拟合，以及准确率、召回率、F1 等分类评估指标。
过拟合和欠拟合怎么解决：机器学习模型调优实战指南用训练分数和验证分数判断过拟合与欠拟合，并通过模型复杂度、正则化、交叉验证和特征工程调整机器学习模型。
神经网络基础：从感知机到多层网络从一个神经元讲起，解释权重、偏置、激活函数、前向传播、反向传播和典型神经网络训练循环。
神经网络矩阵微积分：从 y = Wx + b 推导 MSE 梯度用手算、矩阵形状图、NumPy 代码和梯度检查解释 y = Wx + b 下 dL/dW = (ŷ - y)x^T 的来源。
反向传播计算图：两层 MLP 的前向、局部梯度和反向传播把两层 MLP 拆成计算图，手算 ReLU、softmax cross-entropy、dW2、dW1，并用 NumPy 复现实验结果。
梯度下降与优化器几何：Momentum、Adam 和 loss surface 轨迹在二维二次函数上手算梯度下降前几步，比较 Momentum 和 Adam 的轨迹，并用代码生成 loss contour。
卷积与感受野数学：5×5 输入、3×3 kernel、padding 和 im2col 手算一次 5x5 输入与 3x3 kernel 的离散卷积，解释输出尺寸、padding、stride、感受野和 im2col。
Transformer Attention 数学：Q/K/V、Softmax 权重、Mask 与 KV Cache 用 3 个 token 手算 scaled dot-product attention，解释 Q/K/V、softmax、mask、多头注意力和 KV cache。
Python 人工智能小实战：用 scikit-learn 完成一个分类任务使用 scikit-learn 内置教学数据集跑通一个分类任务，覆盖数据加载、拆分、标准化、训练、预测、评估和实验记录。
手写数字识别项目入门：先读懂 train.csv、test.csv 和标签结构从项目文件结构入手，读懂手写数字训练集、测试集、标签列和 784 维像素输入，为后续 C 分类器和实验台打基础。
用 C 实现手写数字 Softmax 分类器：从 784 维像素到 submission.csv 结合当前项目源码，讲清楚 softmax 多分类、损失函数、梯度更新、混淆矩阵输出，以及 submission.csv 的生成过程。
手写数字实验记录：怎么把离线分类项目接进浏览器实验台解释浏览器实验台为什么采用轻量预训练模型、它和离线 C 项目的关系，以及如何用样本浏览和手绘输入理解预测结果。
CIFAR-10 Tiny CNN 教程：用 C 语言实现小型卷积神经网络图像分类用单文件 C 程序完成 CIFAR-10 小型 CNN 图像分类，讲解数据格式、网络结构、训练命令、loss、accuracy、常见错误和改进方向。
构建高熵流量防御：基于 Python 的连接层白噪声混淆与对抗性机器学习实践以 mld_chaffing_v2.py 虚幻镜项目为例，讲解加密元数据泄漏、信息熵、分布距离、混淆矩阵、空闲窗口微脉冲和性能测试取舍。
AI 安全威胁建模：用 NIST AML、MITRE ATLAS 和 OWASP 建立攻防地图用 NIST Adversarial ML、MITRE ATLAS 和 OWASP LLM Top 10 建立 AI 安全威胁模型，覆盖资产、攻击面、证据和剩余风险。
对抗样本与鲁棒评估：从 FGSM 公式到 scikit-learn 数字分类实验从 FGSM 公式解释对抗样本，用 scikit-learn digits toy 实验评估 clean accuracy、perturbed accuracy 和扰动预算。
数据投毒与后门攻击防御：污染率、触发器和训练管线隔离用 toy digits 实验解释数据投毒、后门触发器、attack success rate、数据来源审计和训练管线隔离。
模型隐私与模型窃取风险：成员推断、模型抽取和输出接口防护用本地 toy 实验解释成员推断、模型抽取、membership AUC、surrogate fidelity、输出最小化和查询治理。
LLM/RAG/Agent 安全：Prompt Injection、工具权限和边界感知防护从 RAG 和 Agent 架构解释 prompt injection、外部数据降权、工具 allowlist、人工审批和边界感知防护。

已公开资源

Python AI 小实战代码说明文章内包含可直接复制运行的 scikit-learn 分类脚本。
digit_softmax_classifier.c 手写数字 softmax 分类器的 C 语言源码。
train.csv.zip 手写数字训练集压缩包，包含 42000 条带标签样本。
test.csv.zip 手写数字测试集压缩包，包含 28000 条待预测样本。
sample_submission.csv 官方提交格式示例，可直接对照最终输出字段。
submission.csv 当前 C 项目跑出的预测结果文件。
digit-playground-model.json 浏览器实验台使用的轻量 softmax 演示模型与样本。
digit-sample-grid.svg 从训练集中抽取的小型手写数字预览网格。
手写数字项目打包下载包含源码、压缩数据、提交文件、浏览器模型和样本预览图。
cifar10_tiny_cnn.c 源码单文件 C 语言 tiny CNN，包含 CIFAR-10 读取、卷积、池化、softmax 和反向传播。
model_weights.bin 样例权重一次本地小样本运行生成的模型权重文件。
test_predictions.csv 预测样例 CIFAR-10 tiny CNN 输出的测试预测样例。
CNN 项目说明 PDF 配套 CNN 项目说明材料。
虚幻镜脱敏代码骨架去除控制口令、真实节点和目标列表后的 mld_chaffing_v2.py 控制流程说明。
虚幻镜压力测试记录模板用于记录 CPU、内存、线程峰值、微脉冲速率、延迟和错误数的脱敏 CSV 模板。
虚幻镜分类器评估模板用于记录 TP、FN、FP、TN、accuracy、precision、recall、F1、ROC-AUC、熵和 JS 散度的 CSV 模板。
虚幻镜资源说明说明公开资源为何只提供脱敏代码、测试模板和架构笔记。
AI Security Lab 说明说明 AI 安全攻防系列的安全边界、安装命令和 quick-run 实验。
AI Security Lab 完整实验包包含安全 toy scripts、结果 CSV、风险登记表、攻防矩阵和架构图。
AI 安全风险登记表面向 AI 威胁建模和上线评审的 CSV 风险登记模板。
AI 攻防矩阵把攻击面、toy demo、指标和防护控制映射到一张 CSV 表。
AI Security Lab 架构图展示威胁建模、鲁棒评估、数据完整性、模型隐私和 RAG 防护之间的关系。
FGSM digits 鲁棒评估脚本本地 digits 分类器的 FGSM-style 扰动和准确率下降实验。
数据投毒与后门 toy 脚本用 digits 数据演示污染率、触发器和 attack success rate。
模型隐私与抽取 toy 脚本输出 membership AUC、target accuracy、surrogate fidelity 和 surrogate accuracy。
RAG prompt injection guard toy 脚本用确定性 toy agent 演示外部数据降权和工具权限阻断。
Deep Learning Math Lab 说明包含安装命令、脚本入口、输出结果和文章图示生成说明。
深度学习数学完整实验包打包 NumPy 脚本、CSV 结果、公式图、loss contour、卷积图和 attention 热图。
梯度检查结果 CSV 保存 MSE 梯度解析值、数值差分值和误差范数。
优化器轨迹 CSV 记录梯度下降、Momentum 和 Adam 在二维二次函数上的逐步坐标与 loss。
Attention 权重 CSV 三 token scaled dot-product attention 的 scores、softmax weights 和 context 输出。
深度学习数学图示目录包含矩阵形状、计算图、loss contour、卷积扫描和 attention heatmap。
深度学习数学交互演示在浏览器里调梯度检查、优化轨迹、卷积输出尺寸和 attention 权重热图。
深度学习专题分享图用于分享深度学习 / CNN 专题页的 1200x630 SVG 图。
从零实现机器学习分享图用于分享 K-means、Iris 和机器学习流程专题页的 1200x630 SVG 图。
学生 AI 项目分享图用于分享手写数字、C 分类器和浏览器实验台专题页的 1200x630 SVG 图。
CNN 卷积扫描动画 Remotion 生成的 8 秒短动画，展示 3x3 卷积核如何扫描输入并形成特征图。

当前学习路线

人工智能基础学习路线学习路线节点
机器学习完整流程学习路线节点
机器学习算法怎么选学习路线节点
特征工程入门实战学习路线节点
模型训练与评估入门学习路线节点
过拟合和欠拟合怎么解决学习路线节点
神经网络基础学习路线节点
神经网络矩阵微积分学习路线节点
反向传播计算图学习路线节点
梯度下降与优化器几何学习路线节点
卷积与感受野数学学习路线节点
Transformer Attention 数学学习路线节点
LLM 可视化教学台学习路线节点
Python 人工智能小实战学习路线节点
手写数字数据结构入门学习路线节点
用 C 实现手写数字 Softmax 分类器学习路线节点
手写数字实验台说明学习路线节点
CIFAR-10 Tiny CNN 教程学习路线节点
高熵流量防御实验学习路线节点
AI 安全威胁建模学习路线节点
对抗样本与鲁棒评估学习路线节点
数据投毒与后门防御学习路线节点
模型隐私与模型抽取防护学习路线节点
LLM/RAG/Agent 安全学习路线节点

下一步计划

补充更多图像分类和误差分析案例
把常见指标整理成速查表
继续补充 AI 安全防御实验记录

一、核心数学公式解析

二、架构图解：数据流与维度变化

三、实战演示：用 Numpy 手写自注意力

四、因果掩码（Mask）到底改变了什么？

五、工程师的填坑经验：显存杀手与 KV Cache

六、实现时最容易错的三个 shape

七、怎么检查 attention 实验结果

八、Attention 验证矩阵

九、图示与数据流总结

1. The Core Mathematical Formula

2. Architectural Diagram: Data Flow and Dimensions

3. Practical Demonstration: Self-Attention in NumPy

4. What Does the Causal Mask Actually Change?

5. An Engineer's Perspective: The VRAM Killer and KV Cache

6. Shape and Mask Checks

7. Attention Verification Matrix

8. Visualizations and Data Flow Summary

一、核心数学公式解析

二、架构图解：数据流与维度变化

三、实战演示：用 Numpy 手写自注意力

四、因果掩码（Mask）到底改变了什么？

五、工程师的填坑经验：显存杀手与 KV Cache

六、实现时最容易错的三个 shape

七、怎么检查 attention 实验结果

八、Attention 验证矩阵

九、图示与数据流总结

这篇文章适合谁读？

读完后下一步应该看什么？

这篇文章有没有可运行代码或配套资源？

这篇文章和整个网站的学习路线有什么关系？

配套资源

Attention 权重 CSV

深度学习数学图示目录

深度学习数学交互演示

发表回复 取消回复

项目时间线

发表回复取消回复