反向传播计算图教程：ReLU、Softmax Cross-Entropy 与两层 MLP 手算

Q: 这篇文章适合谁读？

这篇文章适合想用 进阶 难度理解“反向传播计算图：两层 MLP 的前向、局部梯度和反向传播”的读者，预计阅读时间约 14 分钟，重点覆盖 Backpropagation, Computation Graph, Softmax。

阅读信息

难度: 进阶阅读时间: 14 分钟

Backpropagation
Computation Graph
Softmax

打开知识图谱

中文

反向传播计算图：两层 MLP 的前向、局部梯度和反向传播

反向传播（Backpropagation）常被渲染成深度学习的神秘引擎，但从本质上讲，它只是反向模式自动微分（Reverse-mode Automatic Differentiation）在计算图上的应用。它并不是神经网络专属的魔法，而是一种高度优化、基于微积分链式法则的程序化求导方法。

前向传播是从输入到损失函数计算预测值的过程，而反向传播则在图上逆向行驶。在这篇文章中，我们将严谨地推导一个两层多层感知机（MLP）：x -> W1x+b1 -> ReLU -> W2h+b2 -> softmax cross-entropy，不仅深入探讨其数学理论，还会分享从零实现时的工程实践细节。

一、揭开计算图的面纱

为了系统地计算导数，我们把复杂的神经网络拆解成由基础算子构成的有向无环图（DAG）。每个节点代表一个简单的数学操作（如矩阵乘法或 ReLU），边代表张量（Tensor）的流动。至关重要的是，每个节点必须具备两种能力：在前向时计算输出，在反向时计算局部的向量雅可比乘积（Vector-Jacobian Product, VJP）。

graph TD
    x["输入 x"] --> z1["z1 = x @ W1 + b1"]
    W1["权重 W1"] --> z1
    b1["偏置 b1"] --> z1
    z1 --> h["h = ReLU(z1)"]
    h --> logits["logits = h @ W2 + b2"]
    W2["权重 W2"] --> logits
    b2["偏置 b2"] --> logits
    logits --> p["p = Softmax(logits)"]
    p --> L["Loss = CrossEntropy(p, target)"]
    target["目标 y"] --> L

    classDef fwd fill:#e1f5fe,stroke:#039be5,stroke-width:2px;
    classDef param fill:#fce4ec,stroke:#d81b60,stroke-width:2px;
    class z1,h,logits,p,L fwd;
    class W1,b1,W2,b2 param;

前向传播必须缓存中间值（例如 x、z1 和 h），因为反向传播计算局部梯度时需要用到它们。这就是为什么训练神经网络比推理（Inference）更消耗显存的原因。

两层 MLP 计算图和反向传播路径 — 计算图把复杂网络拆解成局部可求导的小步骤，从而精确计算梯度，避免了有限差分法的数值近似误差。

二、Softmax Cross-Entropy 的关键简化

理论上，你可以先求交叉熵损失对 Softmax 概率的雅可比矩阵，再乘上 Softmax 对 logits 的雅可比矩阵。但在工程实践中，显式地这样做既会引发数值灾难，又会浪费大量算力。

当把 Softmax 和 Cross-Entropy 结合在一起时，数学项会优雅地抵消，得到一个极为简单的对 logits 的梯度公式：

dL/dlogits = p - one_hot(target)

例如，如果目标是第 1 类，模型预测的概率分布是 [0.1, 0.7, 0.2]，那么梯度就是 [0.1, 0.7 - 1.0, 0.2] = [0.1, -0.3, 0.2]。负号会正确地推动正确类别的 logit 升高，而正号则惩罚错误的预测。这种代数上的简化正是 PyTorch 等框架将这两个操作合并为 CrossEntropyLoss 的原因。

三、两层 MLP 的反向公式推导

一旦我们拿到了损失对 logits 的梯度（dlogits），就可以利用链式法则将其向后传播。注意，我们在实际计算中从不实例化完整的雅可比矩阵，而是利用矩阵转置来高效计算 Vector-Jacobian Products。

# 第二层梯度
dW2 = h^T dlogits
db2 = sum(dlogits, axis=0)
dh  = dlogits W2^T

# 第一层梯度
dz1 = dh * ReLU'(z1)  # 逐元素相乘
dW1 = x^T dz1
db1 = sum(dz1, axis=0)

通过检查这些梯度的范数（如 norm_dW1=0.999823，norm_dW2=0.993682），我们可以确认网络没有受到梯度消失或爆炸的影响，两个隐藏层的参数都在有效学习。

四、真实世界的 NumPy 实现

将数学公式转化为可执行代码，能让我们看清深度学习框架的底层运作机制。下面是一个支持批处理、且数值鲁棒的 NumPy 实现：

import numpy as np

def relu(x): 
    return np.maximum(0, x)
def relu_backward(dout, cache_x): 
    return dout * (cache_x > 0).astype(float)

def softmax(x):
    # 减去最大值以保证数值稳定
    exps = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exps / np.sum(exps, axis=-1, keepdims=True)

# 1. 前向传播
x = np.random.randn(32, 10)  # Batch size 32, features 10
W1 = np.random.randn(10, 64) * 0.1
b1 = np.zeros((1, 64))
W2 = np.random.randn(64, 5) * 0.1
b2 = np.zeros((1, 5))
targets = np.random.randint(0, 5, size=(32,))

z1 = x @ W1 + b1
h = relu(z1)
logits = h @ W2 + b2
probs = softmax(logits)

# 2. 反向传播
# Softmax-CE 梯度
batch_size = x.shape[0]
dlogits = probs.copy()
dlogits[np.arange(batch_size), targets] -= 1
dlogits /= batch_size  # 对 batch 求平均

# 第二层参数梯度
dW2 = h.T @ dlogits
db2 = np.sum(dlogits, axis=0, keepdims=True)
dh = dlogits @ W2.T

# 第一层参数梯度
dz1 = relu_backward(dh, z1)
dW1 = x.T @ dz1
db1 = np.sum(dz1, axis=0, keepdims=True)

print(f"Gradient norms: dW1={np.linalg.norm(dW1):.4f}, dW2={np.linalg.norm(dW2):.4f}")

五、工程师的视角：个人踩坑经验

结合多年编写自定义 CUDA Kernel 和优化深度学习架构的经验，以下是我在工程实践中对反向传播的一些切身体会：

显存墙（The Memory Wall）： 初学者常以为训练慢是因为数学计算量大，但实际上瓶颈往往在显存带宽。在前向传播时，我们必须将激活值（如 z1 和 h）保存在显存（HBM）中，以供反向传播使用。这就是为什么会有“梯度检查点（Gradient Checkpointing）”这种技术的存在——它通过在反向时重新计算前向值，用算力（FLOPs）去换取宝贵的显存。

隐式广播（Broadcasting）的陷阱： 在 NumPy 和 PyTorch 中，隐式广播是一个“沉默的杀手”。如果你计算 db1 = dlogits 时忘记在 batch 维度上求和，张量的形状可能会在后续操作中意外广播，导致算出垃圾梯度，而且程序不会报错。永远记得使用 keepdims=True。
梯度校验（Gradient Checking）： 当你用 C++ 或 CUDA 手写反向传播时，第一步永远应该是写一个有限差分（Finite-difference）的梯度校验器。将你的解析梯度与 (f(x + h) - f(x - h)) / 2h 对比。如果误差不能控制在 1e-4 以内，说明反向传播一定有 Bug。
数值稳定性： 永远不要直接计算 np.exp(logits)。在进行指数运算前，务必先减去 logits 的最大值。一个 1000 的 logit 会让 float32 瞬间溢出，产生 NaN 梯度，进而毒害整个网络。

六、如何观看演示动画

动画从 Loss 开始，沿计算图反向点亮 logits、隐藏层激活值和第一层权重的梯度传播路径。

在观看动画时，不要只盯着箭头的方向。仔细观察每个节点在前向时保存了什么值，以及为什么反向传播时必须复用这些值。在下一篇文章中，我们将研究这些梯度是如何驱动参数更新的，以及不同的优化器为什么会走出完全不同的参数轨迹。

七、反向传播验证矩阵

复现这篇文章时，建议把一次前向和一次反向拆成可审计步骤。下面的表格把“公式是否正确”转化为可观测证据，避免只看到 loss 下降就误以为反向传播一定正确。

阶段	必须缓存或检查的值	失败时常见症状
前向缓存	`x`、`z1`、`h`、`logits`、`probs`。	反向时无法计算 ReLU mask，或梯度形状只能靠广播“凑出来”。
Softmax-CE	`dlogits = probs - one_hot(target)`，并按 batch 平均。	loss 正常但梯度过大，训练对 batch size 极其敏感。
矩阵梯度	`dW2 = h.T @ dlogits`、`dW1 = x.T @ dz1`。	权重梯度维度与参数不一致，或者转置方向错误导致学习无效。
数值稳定	softmax 前减最大值，检查 `NaN`、`Inf` 和梯度范数。	训练初期 loss 突然变成 `NaN`，或某层梯度范数异常为 0。

英文

Backpropagation as a Computation Graph: A Two-Layer MLP by Hand

在独立页面打开

Backpropagation is often introduced as the mysterious engine powering deep learning, but at its core, it is simply reverse-mode automatic differentiation applied to a computation graph. It is not a neural-network-specific trick; rather, it is a disciplined, highly optimized way to move local gradients backward through program operations using the chain rule of calculus.

While forward propagation computes the model's prediction by traversing the graph from inputs to the loss, backpropagation traverses the graph in reverse. In this article, we rigorously work through a two-layer Multi-Layer Perceptron (MLP): x -> W1x+b1 -> ReLU -> W2h+b2 -> softmax cross-entropy, expanding on both the mathematical theory and the engineering realities of implementing it from scratch.

1. The Computation Graph Unveiled

To compute derivatives systematically, we decompose the complex neural network into a Directed Acyclic Graph (DAG) of primitive operations. Each node represents a simple mathematical operation (like matrix multiplication or ReLU), and edges represent the flow of tensors. Crucially, each node must be capable of doing two things: calculating its forward output, and calculating its local Vector-Jacobian Product (VJP) during the backward pass.

graph TD
    x["Input x"] --> z1["z1 = x @ W1 + b1"]
    W1["Weights W1"] --> z1
    b1["Bias b1"] --> z1
    z1 --> h["h = ReLU(z1)"]
    h --> logits["logits = h @ W2 + b2"]
    W2["Weights W2"] --> logits
    b2["Bias b2"] --> logits
    logits --> p["p = Softmax(logits)"]
    p --> L["Loss = CrossEntropy(p, target)"]
    target["Target y"] --> L

    classDef fwd fill:#e1f5fe,stroke:#039be5,stroke-width:2px;
    classDef param fill:#fce4ec,stroke:#d81b60,stroke-width:2px;
    class z1,h,logits,p,L fwd;
    class W1,b1,W2,b2 param;

The forward pass must cache intermediate values (like x, z1, and h) because they are required to compute the local gradients during the backward pass. This is why training neural networks is heavily memory-bound compared to inference.

Two-layer MLP computation graph and backward path — A computation graph turns a network into small differentiable steps, calculating exact gradients without the numerical approximation errors of finite differences.

2. The Softmax Cross-Entropy Shortcut

In theory, you could calculate the Jacobian of the Cross-Entropy loss with respect to the Softmax probabilities, and then multiply that by the Jacobian of the Softmax with respect to the logits. In practice, doing this explicitly is a recipe for numerical disaster and wasted compute.

When you combine Softmax and Cross-Entropy, the mathematical terms elegantly cancel out, resulting in a beautifully simple gradient with respect to the logits:

dL/dlogits = p - one_hot(target)

For example, if the target is class 1 and the model predicts probabilities [0.1, 0.7, 0.2], the gradient is simply [0.1, 0.7 - 1.0, 0.2] = [0.1, -0.3, 0.2]. The negative sign correctly pushes the correct logit higher, while the positive signs penalize the incorrect logits. This algebraic simplification is why production frameworks like PyTorch fuse these operations into CrossEntropyLoss.

3. Mathematical Derivations of the MLP

Once we have the gradient of the loss with respect to the logits (dlogits), we propagate it backward using the chain rule. Notice that we never instantiate full Jacobian matrices; instead, we compute Vector-Jacobian Products efficiently using matrix transposes.

# Layer 2 gradients
dW2 = h^T dlogits
db2 = sum(dlogits, axis=0)
dh  = dlogits W2^T

# Layer 1 gradients
dz1 = dh * ReLU'(z1)  # Element-wise multiplication
dW1 = x^T dz1
db1 = sum(dz1, axis=0)

By checking the norm of these gradients (e.g., norm_dW1=0.999823, norm_dW2=0.993682), we can verify that the network isn't suffering from vanishing or exploding gradients.

4. Real-World Numpy Implementation

Translating the math into executable code reveals how frameworks actually operate under the hood. Here is a batched, robust implementation using NumPy:

import numpy as np

def relu(x): 
    return np.maximum(0, x)
def relu_backward(dout, cache_x): 
    return dout * (cache_x > 0).astype(float)

def softmax(x):
    # Subtract max for numerical stability
    exps = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exps / np.sum(exps, axis=-1, keepdims=True)

# 1. Forward Pass
x = np.random.randn(32, 10)  # Batch size 32, features 10
W1 = np.random.randn(10, 64) * 0.1
b1 = np.zeros((1, 64))
W2 = np.random.randn(64, 5) * 0.1
b2 = np.zeros((1, 5))
targets = np.random.randint(0, 5, size=(32,))

z1 = x @ W1 + b1
h = relu(z1)
logits = h @ W2 + b2
probs = softmax(logits)

# 2. Backward Pass
# Softmax-CE gradient
batch_size = x.shape[0]
dlogits = probs.copy()
dlogits[np.arange(batch_size), targets] -= 1
dlogits /= batch_size  # Average over batch

# Layer 2
dW2 = h.T @ dlogits
db2 = np.sum(dlogits, axis=0, keepdims=True)
dh = dlogits @ W2.T

# Layer 1
dz1 = relu_backward(dh, z1)
dW1 = x.T @ dz1
db1 = np.sum(dz1, axis=0, keepdims=True)

print(f"Gradient norms: dW1={np.linalg.norm(dW1):.4f}, dW2={np.linalg.norm(dW2):.4f}")

5. Personal Experience / Engineer's Perspective

From years of writing and debugging custom CUDA kernels and deep learning architectures, here are my takeaways on backpropagation in the wild:

The Memory Wall: Beginners often think training is slow because of math, but it's actually constrained by memory bandwidth. During the forward pass, we have to stash the activations (like z1 and h) in HBM because the backward pass needs them. This is why techniques like Gradient Checkpointing (recomputing activations on the fly) exist—they trade FLOPs to save VRAM.

Broadcasting Bugs: In NumPy and PyTorch, silent broadcasting is a silent killer. If you compute db1 = dlogits without summing over the batch axis, the tensor shapes might accidentally broadcast later, producing garbage gradients without throwing an error. Always use keepdims=True.
Gradient Checking: When writing a custom C++ or CUDA backward pass, your first step should always be writing a finite-difference gradient checker. Compare your analytical gradient against (f(x + h) - f(x - h)) / 2h. If they don't match up to 1e-4, your backprop is wrong.
Numerical Stability: Never compute np.exp(logits) directly. Always subtract the maximum logit first. A logit of 1000 will overflow float32 instantly, resulting in NaN gradients that poison the entire network.

6. Visualizing the Flow

The animation starts at the loss and lights up the gradient path through logits, hidden activations, and the first-layer weights.

When watching the animation, do not only watch arrow direction. Observe which forward values each node must store and reuse during the backward pass. The next article studies how these gradients move parameters and why optimizers take different paths to the minima.

7. Backpropagation Verification Matrix

When reproducing this article, split the forward and backward pass into auditable stages. The table below turns "is the formula correct?" into visible evidence, so a decreasing loss is not mistaken for proof that every gradient is correct.

Stage	Values to cache or inspect	Common symptom when it fails
Forward cache	`x`, `z1`, `h`, `logits`, and `probs`.	The ReLU mask cannot be reconstructed, or gradient shapes are only "fixed" by broadcasting.
Softmax-CE	`dlogits = probs - one_hot(target)`, averaged over the batch.	The loss moves but gradients are too large and training is overly sensitive to batch size.
Matrix gradients	`dW2 = h.T @ dlogits` and `dW1 = x.T @ dz1`.	The gradient shape does not match the parameter, or a transpose error silently prevents learning.
Numerical stability	Subtract max before softmax and inspect `NaN`, `Inf`, and gradient norms.	Loss becomes `NaN` early, or a layer's gradient norm collapses to zero.

代码运行说明

环境: Python 3 + NumPy + Matplotlib

安装

cd deep-learning-math-lab
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

运行

python src/mlp_backprop.py

输入文件: 两层 MLP 的固定权重、输入和标签
预期输出: 输出 loss、类别概率、dW1/dW2 范数和反向传播结果 CSV。

安装 cd deep-learning-math-lab
安装 python3 -m venv .venv
安装 source .venv/bin/activate
安装 pip install -r requirements.txt
运行 python src/mlp_backprop.py

一、揭开计算图的面纱

graph TD
    x["输入 x"] --> z1["z1 = x @ W1 + b1"]
    W1["权重 W1"] --> z1
    b1["偏置 b1"] --> z1
    z1 --> h["h = ReLU(z1)"]
    h --> logits["logits = h @ W2 + b2"]
    W2["权重 W2"] --> logits
    b2["偏置 b2"] --> logits
    logits --> p["p = Softmax(logits)"]
    p --> L["Loss = CrossEntropy(p, target)"]
    target["目标 y"] --> L

    classDef fwd fill:#e1f5fe,stroke:#039be5,stroke-width:2px;
    classDef param fill:#fce4ec,stroke:#d81b60,stroke-width:2px;
    class z1,h,logits,p,L fwd;
    class W1,b1,W2,b2 param;

二、Softmax Cross-Entropy 的关键简化

当把 Softmax 和 Cross-Entropy 结合在一起时，数学项会优雅地抵消，得到一个极为简单的对 logits 的梯度公式：

dL/dlogits = p - one_hot(target)

三、两层 MLP 的反向公式推导

# 第二层梯度
dW2 = h^T dlogits
db2 = sum(dlogits, axis=0)
dh  = dlogits W2^T

# 第一层梯度
dz1 = dh * ReLU'(z1)  # 逐元素相乘
dW1 = x^T dz1
db1 = sum(dz1, axis=0)

四、真实世界的 NumPy 实现

将数学公式转化为可执行代码，能让我们看清深度学习框架的底层运作机制。下面是一个支持批处理、且数值鲁棒的 NumPy 实现：

import numpy as np

def relu(x): 
    return np.maximum(0, x)
def relu_backward(dout, cache_x): 
    return dout * (cache_x > 0).astype(float)

def softmax(x):
    # 减去最大值以保证数值稳定
    exps = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exps / np.sum(exps, axis=-1, keepdims=True)

# 1. 前向传播
x = np.random.randn(32, 10)  # Batch size 32, features 10
W1 = np.random.randn(10, 64) * 0.1
b1 = np.zeros((1, 64))
W2 = np.random.randn(64, 5) * 0.1
b2 = np.zeros((1, 5))
targets = np.random.randint(0, 5, size=(32,))

z1 = x @ W1 + b1
h = relu(z1)
logits = h @ W2 + b2
probs = softmax(logits)

# 2. 反向传播
# Softmax-CE 梯度
batch_size = x.shape[0]
dlogits = probs.copy()
dlogits[np.arange(batch_size), targets] -= 1
dlogits /= batch_size  # 对 batch 求平均

# 第二层参数梯度
dW2 = h.T @ dlogits
db2 = np.sum(dlogits, axis=0, keepdims=True)
dh = dlogits @ W2.T

# 第一层参数梯度
dz1 = relu_backward(dh, z1)
dW1 = x.T @ dz1
db1 = np.sum(dz1, axis=0, keepdims=True)

print(f"Gradient norms: dW1={np.linalg.norm(dW1):.4f}, dW2={np.linalg.norm(dW2):.4f}")

五、工程师的视角：个人踩坑经验

结合多年编写自定义 CUDA Kernel 和优化深度学习架构的经验，以下是我在工程实践中对反向传播的一些切身体会：

显存墙（The Memory Wall）： 初学者常以为训练慢是因为数学计算量大，但实际上瓶颈往往在显存带宽。在前向传播时，我们必须将激活值（如 z1 和 h）保存在显存（HBM）中，以供反向传播使用。这就是为什么会有“梯度检查点（Gradient Checkpointing）”这种技术的存在——它通过在反向时重新计算前向值，用算力（FLOPs）去换取宝贵的显存。

隐式广播（Broadcasting）的陷阱： 在 NumPy 和 PyTorch 中，隐式广播是一个“沉默的杀手”。如果你计算 db1 = dlogits 时忘记在 batch 维度上求和，张量的形状可能会在后续操作中意外广播，导致算出垃圾梯度，而且程序不会报错。永远记得使用 keepdims=True。
梯度校验（Gradient Checking）： 当你用 C++ 或 CUDA 手写反向传播时，第一步永远应该是写一个有限差分（Finite-difference）的梯度校验器。将你的解析梯度与 (f(x + h) - f(x - h)) / 2h 对比。如果误差不能控制在 1e-4 以内，说明反向传播一定有 Bug。
数值稳定性： 永远不要直接计算 np.exp(logits)。在进行指数运算前，务必先减去 logits 的最大值。一个 1000 的 logit 会让 float32 瞬间溢出，产生 NaN 梯度，进而毒害整个网络。

六、如何观看演示动画

动画从 Loss 开始，沿计算图反向点亮 logits、隐藏层激活值和第一层权重的梯度传播路径。

七、反向传播验证矩阵

阶段	必须缓存或检查的值	失败时常见症状
前向缓存	`x`、`z1`、`h`、`logits`、`probs`。	反向时无法计算 ReLU mask，或梯度形状只能靠广播“凑出来”。
Softmax-CE	`dlogits = probs - one_hot(target)`，并按 batch 平均。	loss 正常但梯度过大，训练对 batch size 极其敏感。
矩阵梯度	`dW2 = h.T @ dlogits`、`dW1 = x.T @ dz1`。	权重梯度维度与参数不一致，或者转置方向错误导致学习无效。
数值稳定	softmax 前减最大值，检查 `NaN`、`Inf` 和梯度范数。	训练初期 loss 突然变成 `NaN`，或某层梯度范数异常为 0。

搜索问题

常见问题

这篇文章适合谁读？

这篇文章适合想用进阶难度理解“反向传播计算图：两层 MLP 的前向、局部梯度和反向传播”的读者，预计阅读时间约 14 分钟，重点覆盖 Backpropagation, Computation Graph, Softmax。

读完后下一步应该看什么？

推荐下一步阅读“梯度下降与优化器几何：Momentum、Adam 和 loss surface 轨迹”，这样可以把当前知识点接到更完整的学习路线里。

这篇文章有没有可运行代码或配套资源？

有。页面里的运行说明、资源卡片和下载入口会指向复现实验所需的命令、数据、代码或说明文件。

这篇文章和整个网站的学习路线有什么关系？

它会通过文章上下文、学习路线、资源库和项目时间线连接到同一主题下的其他内容。

文章上下文

人工智能项目

从 AI、机器学习、训练评估、神经网络到 Python 小实战、手写数字识别、CIFAR-10 CNN、对抗性流量防御和 AI 安全攻防，按顺序建立基础。

难度: 进阶阅读时间: 14 分钟

Backpropagation
Computation Graph
Softmax

继续下一步

继续：梯度下降与优化器几何

先补基础打开资源

对应语言版本 Backpropagation as a Computation Graph: A Two-Layer MLP by Hand

可分享摘要 反向传播计算图：两层 MLP 的前向、局部梯度和反向传播

把两层 MLP 拆成计算图，手算 ReLU、softmax cross-entropy、dW2、dW1，并用 NumPy 复现实验结果。

下载分享图打开分享中心

配套资源

包含安装命令、脚本入口、输出结果和文章图示生成说明。

打开资源关联文章

打包 NumPy 脚本、CSV 结果、公式图、loss contour、卷积图和 attention 热图。

打开资源关联文章

包含矩阵形状、计算图、loss contour、卷积扫描和 attention heatmap。

打开资源关联文章

发表回复取消回复

要发表评论，您必须先登录。

项目时间线

已发布文章

人工智能基础学习路线：先理解什么是 AI、机器学习和深度学习面向有编程基础的读者，梳理 AI、机器学习、深度学习的关系，并给出可执行的人工智能基础学习路线。
机器学习完整流程：从数据、特征到模型预测从工程视角拆解机器学习完整流程：定义问题、理解数据、处理特征、训练模型、预测和评估。
机器学习算法怎么选：分类、回归、聚类和推荐场景对照表用任务类型、数据规模、解释性和部署成本选择机器学习算法，覆盖逻辑回归、决策树、随机森林、K-means 和表格数据基线模型。
特征工程入门实战：用 scikit-learn 处理缺失值、类别变量和数值标准化用 scikit-learn Pipeline 和 ColumnTransformer 完成特征工程，处理缺失值、类别变量、数值标准化，并避免数据泄漏。
模型训练与评估入门：损失函数、过拟合和准确率怎么理解讲清楚模型训练中的参数、损失函数、梯度下降、过拟合，以及准确率、召回率、F1 等分类评估指标。
过拟合和欠拟合怎么解决：机器学习模型调优实战指南用训练分数和验证分数判断过拟合与欠拟合，并通过模型复杂度、正则化、交叉验证和特征工程调整机器学习模型。
神经网络基础：从感知机到多层网络从一个神经元讲起，解释权重、偏置、激活函数、前向传播、反向传播和典型神经网络训练循环。
神经网络矩阵微积分：从 y = Wx + b 推导 MSE 梯度用手算、矩阵形状图、NumPy 代码和梯度检查解释 y = Wx + b 下 dL/dW = (ŷ - y)x^T 的来源。
反向传播计算图：两层 MLP 的前向、局部梯度和反向传播把两层 MLP 拆成计算图，手算 ReLU、softmax cross-entropy、dW2、dW1，并用 NumPy 复现实验结果。
梯度下降与优化器几何：Momentum、Adam 和 loss surface 轨迹在二维二次函数上手算梯度下降前几步，比较 Momentum 和 Adam 的轨迹，并用代码生成 loss contour。
卷积与感受野数学：5×5 输入、3×3 kernel、padding 和 im2col 手算一次 5x5 输入与 3x3 kernel 的离散卷积，解释输出尺寸、padding、stride、感受野和 im2col。
Transformer Attention 数学：Q/K/V、Softmax 权重、Mask 与 KV Cache 用 3 个 token 手算 scaled dot-product attention，解释 Q/K/V、softmax、mask、多头注意力和 KV cache。
Python 人工智能小实战：用 scikit-learn 完成一个分类任务使用 scikit-learn 内置教学数据集跑通一个分类任务，覆盖数据加载、拆分、标准化、训练、预测、评估和实验记录。
手写数字识别项目入门：先读懂 train.csv、test.csv 和标签结构从项目文件结构入手，读懂手写数字训练集、测试集、标签列和 784 维像素输入，为后续 C 分类器和实验台打基础。
用 C 实现手写数字 Softmax 分类器：从 784 维像素到 submission.csv 结合当前项目源码，讲清楚 softmax 多分类、损失函数、梯度更新、混淆矩阵输出，以及 submission.csv 的生成过程。
手写数字实验记录：怎么把离线分类项目接进浏览器实验台解释浏览器实验台为什么采用轻量预训练模型、它和离线 C 项目的关系，以及如何用样本浏览和手绘输入理解预测结果。
CIFAR-10 Tiny CNN 教程：用 C 语言实现小型卷积神经网络图像分类用单文件 C 程序完成 CIFAR-10 小型 CNN 图像分类，讲解数据格式、网络结构、训练命令、loss、accuracy、常见错误和改进方向。
构建高熵流量防御：基于 Python 的连接层白噪声混淆与对抗性机器学习实践以 mld_chaffing_v2.py 虚幻镜项目为例，讲解加密元数据泄漏、信息熵、分布距离、混淆矩阵、空闲窗口微脉冲和性能测试取舍。
AI 安全威胁建模：用 NIST AML、MITRE ATLAS 和 OWASP 建立攻防地图用 NIST Adversarial ML、MITRE ATLAS 和 OWASP LLM Top 10 建立 AI 安全威胁模型，覆盖资产、攻击面、证据和剩余风险。
对抗样本与鲁棒评估：从 FGSM 公式到 scikit-learn 数字分类实验从 FGSM 公式解释对抗样本，用 scikit-learn digits toy 实验评估 clean accuracy、perturbed accuracy 和扰动预算。
数据投毒与后门攻击防御：污染率、触发器和训练管线隔离用 toy digits 实验解释数据投毒、后门触发器、attack success rate、数据来源审计和训练管线隔离。
模型隐私与模型窃取风险：成员推断、模型抽取和输出接口防护用本地 toy 实验解释成员推断、模型抽取、membership AUC、surrogate fidelity、输出最小化和查询治理。
LLM/RAG/Agent 安全：Prompt Injection、工具权限和边界感知防护从 RAG 和 Agent 架构解释 prompt injection、外部数据降权、工具 allowlist、人工审批和边界感知防护。

已公开资源

Python AI 小实战代码说明文章内包含可直接复制运行的 scikit-learn 分类脚本。
digit_softmax_classifier.c 手写数字 softmax 分类器的 C 语言源码。
train.csv.zip 手写数字训练集压缩包，包含 42000 条带标签样本。
test.csv.zip 手写数字测试集压缩包，包含 28000 条待预测样本。
sample_submission.csv 官方提交格式示例，可直接对照最终输出字段。
submission.csv 当前 C 项目跑出的预测结果文件。
digit-playground-model.json 浏览器实验台使用的轻量 softmax 演示模型与样本。
digit-sample-grid.svg 从训练集中抽取的小型手写数字预览网格。
手写数字项目打包下载包含源码、压缩数据、提交文件、浏览器模型和样本预览图。
cifar10_tiny_cnn.c 源码单文件 C 语言 tiny CNN，包含 CIFAR-10 读取、卷积、池化、softmax 和反向传播。
model_weights.bin 样例权重一次本地小样本运行生成的模型权重文件。
test_predictions.csv 预测样例 CIFAR-10 tiny CNN 输出的测试预测样例。
CNN 项目说明 PDF 配套 CNN 项目说明材料。
虚幻镜脱敏代码骨架去除控制口令、真实节点和目标列表后的 mld_chaffing_v2.py 控制流程说明。
虚幻镜压力测试记录模板用于记录 CPU、内存、线程峰值、微脉冲速率、延迟和错误数的脱敏 CSV 模板。
虚幻镜分类器评估模板用于记录 TP、FN、FP、TN、accuracy、precision、recall、F1、ROC-AUC、熵和 JS 散度的 CSV 模板。
虚幻镜资源说明说明公开资源为何只提供脱敏代码、测试模板和架构笔记。
AI Security Lab 说明说明 AI 安全攻防系列的安全边界、安装命令和 quick-run 实验。
AI Security Lab 完整实验包包含安全 toy scripts、结果 CSV、风险登记表、攻防矩阵和架构图。
AI 安全风险登记表面向 AI 威胁建模和上线评审的 CSV 风险登记模板。
AI 攻防矩阵把攻击面、toy demo、指标和防护控制映射到一张 CSV 表。
AI Security Lab 架构图展示威胁建模、鲁棒评估、数据完整性、模型隐私和 RAG 防护之间的关系。
FGSM digits 鲁棒评估脚本本地 digits 分类器的 FGSM-style 扰动和准确率下降实验。
数据投毒与后门 toy 脚本用 digits 数据演示污染率、触发器和 attack success rate。
模型隐私与抽取 toy 脚本输出 membership AUC、target accuracy、surrogate fidelity 和 surrogate accuracy。
RAG prompt injection guard toy 脚本用确定性 toy agent 演示外部数据降权和工具权限阻断。
Deep Learning Math Lab 说明包含安装命令、脚本入口、输出结果和文章图示生成说明。
深度学习数学完整实验包打包 NumPy 脚本、CSV 结果、公式图、loss contour、卷积图和 attention 热图。
梯度检查结果 CSV 保存 MSE 梯度解析值、数值差分值和误差范数。
优化器轨迹 CSV 记录梯度下降、Momentum 和 Adam 在二维二次函数上的逐步坐标与 loss。
Attention 权重 CSV 三 token scaled dot-product attention 的 scores、softmax weights 和 context 输出。
深度学习数学图示目录包含矩阵形状、计算图、loss contour、卷积扫描和 attention heatmap。
深度学习数学交互演示在浏览器里调梯度检查、优化轨迹、卷积输出尺寸和 attention 权重热图。
深度学习专题分享图用于分享深度学习 / CNN 专题页的 1200x630 SVG 图。
从零实现机器学习分享图用于分享 K-means、Iris 和机器学习流程专题页的 1200x630 SVG 图。
学生 AI 项目分享图用于分享手写数字、C 分类器和浏览器实验台专题页的 1200x630 SVG 图。
CNN 卷积扫描动画 Remotion 生成的 8 秒短动画，展示 3x3 卷积核如何扫描输入并形成特征图。

当前学习路线

人工智能基础学习路线学习路线节点
机器学习完整流程学习路线节点
机器学习算法怎么选学习路线节点
特征工程入门实战学习路线节点
模型训练与评估入门学习路线节点
过拟合和欠拟合怎么解决学习路线节点
神经网络基础学习路线节点
神经网络矩阵微积分学习路线节点
反向传播计算图学习路线节点
梯度下降与优化器几何学习路线节点
卷积与感受野数学学习路线节点
Transformer Attention 数学学习路线节点
LLM 可视化教学台学习路线节点
Python 人工智能小实战学习路线节点
手写数字数据结构入门学习路线节点
用 C 实现手写数字 Softmax 分类器学习路线节点
手写数字实验台说明学习路线节点
CIFAR-10 Tiny CNN 教程学习路线节点
高熵流量防御实验学习路线节点
AI 安全威胁建模学习路线节点
对抗样本与鲁棒评估学习路线节点
数据投毒与后门防御学习路线节点
模型隐私与模型抽取防护学习路线节点
LLM/RAG/Agent 安全学习路线节点

下一步计划

补充更多图像分类和误差分析案例
把常见指标整理成速查表
继续补充 AI 安全防御实验记录

一、揭开计算图的面纱

二、Softmax Cross-Entropy 的关键简化

三、两层 MLP 的反向公式推导

四、真实世界的 NumPy 实现

五、工程师的视角：个人踩坑经验

六、如何观看演示动画

七、反向传播验证矩阵

1. The Computation Graph Unveiled

2. The Softmax Cross-Entropy Shortcut

3. Mathematical Derivations of the MLP

4. Real-World Numpy Implementation

5. Personal Experience / Engineer's Perspective

6. Visualizing the Flow

7. Backpropagation Verification Matrix

一、揭开计算图的面纱

二、Softmax Cross-Entropy 的关键简化

三、两层 MLP 的反向公式推导

四、真实世界的 NumPy 实现

五、工程师的视角：个人踩坑经验

六、如何观看演示动画

七、反向传播验证矩阵

这篇文章适合谁读？

读完后下一步应该看什么？

这篇文章有没有可运行代码或配套资源？

这篇文章和整个网站的学习路线有什么关系？

配套资源

Deep Learning Math Lab 说明

深度学习数学完整实验包

深度学习数学图示目录

发表回复 取消回复

项目时间线

发表回复取消回复