Matrix Calculus for Neural Networks: MSE Gradient Derivation in NumPy

Reading info

Level: Intermediate Reading time: 13 min

Matrix Calculus
NumPy
Gradient Check

Open knowledge map

English

Matrix Calculus for Neural Networks: Deriving the MSE Gradient

Matrix calculus in deep learning is often perceived as an abstract academic exercise, but it is fundamentally a practical tool. It is not about making notation look difficult; it is a rigorous method to verify tensor shapes, align gradient directions, and validate code correctness. Once you can confidently derive and implement the gradient of a simple linear layer y_hat = Wx + b, complex architectures like backpropagation, convolutional layers, and attention mechanisms become significantly more tractable and much easier to debug.

This article dives deep into the anatomy of a single linear layer paired with a mean squared error (MSE) loss. Our goal is to demystify the mathematical formulas, connect them directly to hand calculations, and finally translate them into runnable, deterministic NumPy code that bridges theory and practice.

1. The Foundation: Dimension and Shape Tracking

In matrix calculus, keeping track of dimensions is half the battle. Let's define our variables:

x: A 3 x 1 column vector (input features).
W: A 2 x 3 weight matrix.
b: A 2 x 1 bias vector.
y: A 2 x 1 column vector (target labels).

The forward pass and loss function are defined as:

y_hat = W x + b
e     = y_hat - y
L     = 1/2 * e^T e

The most crucial habit to develop is shape checking at every step. The matrix multiplication W x yields a 2 x 1 vector. Consequently, the error vector e is also 2 x 1. A fundamental rule of matrix calculus states that the gradient of a scalar loss L with respect to a matrix W, denoted as dL/dW, must possess the exact same shape as W. Thus, dL/dW must be 2 x 3.

Matrix shape diagram for a linear layer and MSE gradient — The error vector times the input transpose produces a gradient with the same shape as the weight matrix.

Visualizing the Forward and Backward Pass

To better conceptualize the flow of data and gradients, consider the following computational graph:

graph TD
    x[Input x: 3x1] --> Mul[Matrix Mul: W*x]
    W[Weights W: 2x3] --> Mul
    Mul --> Add[Add Bias: + b]
    b[Bias b: 2x1] --> Add
    Add --> y_hat[Prediction y_hat: 2x1]
    y_hat --> Error[Error e = y_hat - y]
    y[Target y: 2x1] --> Error
    Error --> Loss[Loss L = 1/2 * e^T * e]
    
    %% Backward pass
    Loss -.->|dL/de = e| Error
    Error -.->|dL/dW = e * x^T| W
    Error -.->|dL/db = e| b

2. Deriving the Gradient by Hand

Let's calculate the analytical gradient. Starting with the loss function L = 1/2 * e^T e, the derivative with respect to the error vector is straightforward: dL/de = e.

Using the multivariate chain rule on e = Wx + b - y, we can derive the gradients for the parameters. The derivative of Wx with respect to W involves an outer product with the input transpose:

dL/dW = e x^T
dL/db = e

Let's plug in some concrete numbers. Suppose the forward pass yields an error vector e = [0.2, 1.25]^T and our input was x = [1.5, -2.0, 0.5]^T. The gradient calculation becomes an outer product:

dL/dW =
[0.2 ] [ 1.5, -2.0, 0.5 ] = [ 0.300, -0.400, 0.100 ]
[1.25]                      [ 1.875, -2.500, 0.625 ]

This simple calculation is the bedrock of backpropagation. Every element W_{ij} is updated based on how much the j-th input feature contributed to the i-th output error.

3. Validating with Code: Numerical vs. Analytical Gradients

To trust our analytical derivation, we must verify it computationally using finite differences. Finite differences perturb one parameter at a time and estimate the loss slope from the change in loss, serving as a ground-truth check.

import numpy as np

def forward(W, b, x, y):
    y_hat = np.dot(W, x) + b
    e = y_hat - y
    loss = 0.5 * np.sum(e ** 2)
    return loss, e

def analytical_gradient(e, x):
    # Outer product: (2x1) * (1x3) -> (2x3)
    dW = np.dot(e, x.T)
    db = np.sum(e, axis=1, keepdims=True)
    return dW, db

def numeric_gradient_W(W, b, x, y, eps=1e-5):
    grad = np.zeros_like(W)
    for row in range(W.shape[0]):
        for col in range(W.shape[1]):
            original = W[row, col]
            
            W[row, col] = original + eps
            plus_loss, _ = forward(W, b, x, y)
            
            W[row, col] = original - eps
            minus_loss, _ = forward(W, b, x, y)
            
            W[row, col] = original # restore
            grad[row, col] = (plus_loss - minus_loss) / (2 * eps)
    return grad

# Setup dummy data
W = np.random.randn(2, 3)
b = np.random.randn(2, 1)
x = np.array([[1.5], [-2.0], [0.5]])
y = np.random.randn(2, 1)

# Compute
_, e = forward(W, b, x, y)
dW_analytical, db_analytical = analytical_gradient(e, x)
dW_numeric = numeric_gradient_W(W, b, x, y)

print("Analytical dW:n", np.round(dW_analytical, 5))
print("Numeric dW:n", np.round(dW_numeric, 5))
print("Max Difference:", np.max(np.abs(dW_analytical - dW_numeric)))
# Output should show Max Difference < 1e-8

When implementing custom CUDA kernels or custom autograd functions in PyTorch, always write a numeric gradient checker. Large disagreements usually point to a chain-rule mistake, an incorrect transpose, a broadcasting bug, or a shape mismatch.

4. Visualizing the Tensor Operations

The animation expands the outer product e x^T into the entries of dL/dW.

Watch the animation closely. Observe how the error vector strictly controls the output dimension (rows of the gradient), the input transpose controls the input dimension (columns of the gradient), and their outer product systematically populates the weight matrix gradient.

5. Engineer's Perspective: Real-World Pitfalls

From the Trenches: When moving from this math to massive production models, the challenges shift from formula derivations to hardware realities.

In a real engineering environment, you rarely write raw NumPy gradient updates, but understanding this math is critical for debugging distributed systems and optimizing memory.

Broadcasting Disasters: In Python, adding a shape (64,) array to a shape (64, 1) array results in a (64, 64) matrix due to broadcasting rules. If your bias vector b is implicitly broadcasted incorrectly, your gradient dL/db will be a massive matrix instead of a vector, instantly triggering an Out of Memory (OOM) error on your GPU. Always use explicit reshapes (e.g., keepdims=True).
Memory Bandwidth vs. Compute: The outer product e x^T is theoretically simple, but in memory-constrained environments (like edge devices or large language model training), instantiating large intermediate gradient matrices is the primary bottleneck. Techniques like gradient accumulation or recomputation (activation checkpointing) exist specifically to manage the memory footprint of these exact mathematical operations.
Numerical Instability (NaNs): Notice our numeric_gradient uses eps=1e-5. In float16 or bfloat16 training regimens commonly used on modern GPUs (like A100s or H100s), small epsilon values result in catastrophic cancellation, while large ones result in inaccurate gradients. Mixed precision training requires careful gradient scaling to prevent the elements of dL/dW from vanishing to zero or exploding to infinity.

6. Engineering Checklist

Write down the exact shape of every tensor before writing a single line of formula or code.
Always use the 1/2 scaling factor in MSE formulations while hand-checking gradients; it cleanly cancels out the square derivative.
Make vector orientations (row vs. column) and bias broadcasting explicit during debugging.
Always run numerical gradient checks on a tiny, deterministic model before initiating training on a larger, stochastic one.

7. Gradient Derivation Audit Table

To keep this article from being only a formula walkthrough, use the table below as a reproduction audit. Each row asks for visible evidence: matching shapes, analytical values, finite-difference agreement, and explicit control of broadcasting. When these checks pass together, the derivation and implementation are genuinely aligned.

Check	Why it fails in practice	How this article verifies it
Tensor shape	Mixing row and column vectors can reverse the outer product.	`x` is `3 x 1`, `e` is `2 x 1`, so `e x^T` must be `2 x 3`.
Analytical gradient	The chain rule may be correct while the matrix multiplication order is wrong.	The numeric example expands the outer product element by element into `dL/dW`.
Numerical gradient	An `eps` that is too large or too small distorts finite differences.	Each `W[row, col]` is perturbed and compared against the analytical gradient.
Engineering boundary	Broadcasting, mixed precision, and memory bandwidth amplify small mistakes.	`keepdims=True`, gradient checking, and tiny deterministic models are treated as preflight checks.

The next article will elevate this foundation, turning the linear layer into a node within a larger computation graph, and will rigorously derive backpropagation for a two-layer Multi-Layer Perceptron (MLP).

Chinese

神经网络矩阵微积分：从 y = Wx + b 推导 MSE 梯度

Open as a full page

深度学习中的“矩阵微积分”常常被视为一种抽象的学术练习，但实际上它是极具实践价值的工具。它不是为了把符号和公式写得复杂难懂，而是为了让你能够严谨地检查张量维度（Tensor Shapes）、对齐梯度更新方向，并最终验证代码实现的绝对正确性。只要你能完全掌握并手动推导一个最简单的线性层 y_hat = Wx + b 的梯度，那么后续的反向传播（Backpropagation）、卷积层（Convolution）甚至注意力机制（Attention），都会变得有迹可循且易于调试。

本文将深入剖析一个包含均方误差（MSE）损失的单层线性网络。我们的目标是揭开数学公式的神秘面纱，将它们与手动计算直接联系起来，并最终转化为可运行、确定性的 NumPy 代码，从而打通理论与工程实践的桥梁。

一、基础：维度与形状追踪

在矩阵微积分中，搞清楚每个变量的维度就等于成功了一半。让我们定义一下变量：

x：一个 3 x 1 的列向量（输入特征）。
W：一个 2 x 3 的权重矩阵。
b：一个 2 x 1 的偏置向量。
y：一个 2 x 1 的列向量（真实标签）。

前向传播和损失函数的定义如下：

y_hat = W x + b
e     = y_hat - y
L     = 1/2 * e^T e

这里最核心的习惯是：在每一步都进行形状检查。矩阵乘法 W x 的结果是 2 x 1。因此，误差向量 e 也是 2 x 1。矩阵微积分的一个基本法则是：标量损失 L 对矩阵 W 的梯度（记为 dL/dW）必须与 W 具有完全相同的形状。因此，dL/dW 必然是 2 x 3。

线性层矩阵形状和 MSE 梯度图 — 线性层的形状检查：误差向量乘以输入转置，得到和权重矩阵同形状的梯度。

计算图与数据流可视化

为了更好地理解数据和梯度的流动，我们可以借助以下计算图：

graph TD
    x[输入 x: 3x1] --> Mul[矩阵乘法: W*x]
    W[权重 W: 2x3] --> Mul
    Mul --> Add[加偏置: + b]
    b[偏置 b: 2x1] --> Add
    Add --> y_hat[预测值 y_hat: 2x1]
    y_hat --> Error[误差 e = y_hat - y]
    y[目标值 y: 2x1] --> Error
    Error --> Loss[损失 L = 1/2 * e^T * e]
    
    %% 反向传播路径
    Loss -.->|dL/de = e| Error
    Error -.->|dL/dW = e * x^T| W
    Error -.->|dL/db = e| b

二、手算解析梯度

让我们来手算解析梯度。从损失函数 L = 1/2 * e^T e 开始，它对误差向量的导数非常直观：dL/de = e。

利用多元链式法则处理 e = Wx + b - y，我们可以推导出参数的梯度。Wx 对 W 的导数涉及到误差向量与输入转置的外积（Outer Product）：

dL/dW = e x^T
dL/db = e

我们代入一些具体的数字来感受一下。假设某次前向计算得到误差 e = [0.2, 1.25]^T，且输入为 x = [1.5, -2.0, 0.5]^T。此时的梯度计算就是一个简单的外积：

dL/dW =
[0.2 ] [ 1.5, -2.0, 0.5 ] = [ 0.300, -0.400, 0.100 ]
[1.25]                      [ 1.875, -2.500, 0.625 ]

这个简单的计算正是反向传播的基石。每一个元素 W_{ij} 的更新幅度，都取决于第 j 个输入特征对第 i 个输出误差的贡献程度。

三、代码验证：数值梯度 vs 解析梯度

为了绝对信任我们的解析推导，我们必须使用有限差分法（Finite Differences）在代码中进行验证。数值梯度检查的思想是：每次对一个参数进行微小的扰动，通过观察损失函数的变化率来估计斜率，这可以作为绝对的“基准事实”（Ground Truth）。

import numpy as np

def forward(W, b, x, y):
    y_hat = np.dot(W, x) + b
    e = y_hat - y
    loss = 0.5 * np.sum(e ** 2)
    return loss, e

def analytical_gradient(e, x):
    # 外积: (2x1) * (1x3) -> (2x3)
    dW = np.dot(e, x.T)
    db = np.sum(e, axis=1, keepdims=True)
    return dW, db

def numeric_gradient_W(W, b, x, y, eps=1e-5):
    grad = np.zeros_like(W)
    for row in range(W.shape[0]):
        for col in range(W.shape[1]):
            original = W[row, col]
            
            W[row, col] = original + eps
            plus_loss, _ = forward(W, b, x, y)
            
            W[row, col] = original - eps
            minus_loss, _ = forward(W, b, x, y)
            
            W[row, col] = original # 还原
            grad[row, col] = (plus_loss - minus_loss) / (2 * eps)
    return grad

# 设置测试数据
W = np.random.randn(2, 3)
b = np.random.randn(2, 1)
x = np.array([[1.5], [-2.0], [0.5]])
y = np.random.randn(2, 1)

# 计算结果
_, e = forward(W, b, x, y)
dW_analytical, db_analytical = analytical_gradient(e, x)
dW_numeric = numeric_gradient_W(W, b, x, y)

print("解析梯度 dW:n", np.round(dW_analytical, 5))
print("数值梯度 dW:n", np.round(dW_numeric, 5))
print("最大误差:", np.max(np.abs(dW_analytical - dW_numeric)))
# 正常情况下最大误差应小于 1e-8

在实际工程中，当你使用 PyTorch 编写自定义 Autograd 函数或手写 CUDA Kernel 时，一定要写一个数值梯度检查器。如果解析梯度和数值梯度差距很大，通常不是优化器的问题，而是链式法则推导错误、矩阵未正确转置、触发了错误的广播机制，或是 Shape 不匹配。

四、动画看什么

动画把 e 和 x^T 的外积展开成 dL/dW 的每个元素。

看动画时请重点观察这三件事：误差向量控制了输出维度（梯度的行），输入转置控制了输入维度（梯度的列），而它们的外积刚好严丝合缝地填满了整个权重矩阵梯度的每一个元素。

五、工程师视角：真实的避坑指南

来自一线的经验： 当把这些数学公式应用到巨大的工业级模型时，挑战就不再是公式推导了，而是要面对硬件的物理限制。

在真实的工程环境中，你很少会去手写纯 NumPy 的梯度更新逻辑，但是深刻理解这些底层数学原理对于调试分布式系统和优化显存占用至关重要。

广播机制（Broadcasting）的灾难： 在 Python 中，如果你把一个 Shape 为 (64,) 的数组加到一个 Shape 为 (64, 1) 的数组上，由于广播机制的存在，结果会变成一个 (64, 64) 的大矩阵！如果你的偏置向量 b 在反向传播中被错误地广播，你的梯度 dL/db 会瞬间膨胀成一个巨大的矩阵，导致 GPU 显存溢出（OOM）。在计算时务必显式处理维度（例如使用 keepdims=True）。
显存带宽 vs 算力瓶颈： 外积 e x^T 在数学上很简单，但在显存受限的环境下（例如边缘设备或训练大规模语言模型 LLM 时），实例化这些巨大的中间梯度矩阵往往是最大的性能瓶颈。像梯度累加（Gradient Accumulation）或重计算（Activation Checkpointing/Rematerialization）等技术的出现，正是为了控制这些数学运算背后的显存足迹。
数值不稳定性（NaN 爆炸）： 注意到我们的 numeric_gradient 函数使用了 eps=1e-5。在现代 GPU（如 A100/H100）上普遍使用的 float16 或 bfloat16 混合精度训练中，过小的 epsilon 会导致灾难性抵消，而过大的 epsilon 则会导致梯度失真。混合精度训练需要精心设计的梯度缩放（Gradient Scaling），以防止 dL/dW 的元素下溢归零或上溢变成无穷大（Inf/NaN）。

六、工程检查清单

在写下任何公式或代码之前，先在纸上写清楚每一个张量的 Exact Shape。
在手算梯度推导时，Loss 函数最好先带上 1/2 系数，这样求导时能恰好消掉平方项的常数 2。
调试梯度时，要极度明确向量的方向（行向量还是列向量）以及偏置项的广播行为。
在上大规模集群训练复杂网络之前，永远先在一个极小的、确定性的模型上跑通数值梯度检查。

七、梯度推导审计表

为了避免矩阵微积分文章停留在“公式展示”，读者复现时可以按下面的审计表逐项检查。每一项都对应一个可观察证据：形状是否一致、数值梯度是否接近解析梯度、广播是否被显式控制。只有这些证据都成立，才说明推导和代码实现真的对齐。

检查项	为什么容易出错	本文中的验证方式
张量形状	行向量/列向量混用会让外积方向反掉。	明确 `x` 为 `3 x 1`、`e` 为 `2 x 1`，所以 `e x^T` 是 `2 x 3`。
解析梯度	链式法则写对但矩阵乘法顺序写错，代码仍可能运行。	用具体数字展开外积，逐元素得到 `dL/dW`。
数值梯度	`eps` 过大或过小都会让有限差分失真。	逐个扰动 `W[row, col]`，比较解析梯度和中心差分。
工程边界	真实训练中广播、混合精度和显存带宽会放大小错误。	用 `keepdims=True`、梯度检查和极小确定性模型作为上线前检查。

下一篇文章我们将进一步提升抽象层级，把这个单层线性层封装成计算图中的一个节点，并严谨推导两层 MLP（多层感知机）的完整反向传播过程。

Run notes

Environment: Python 3 + NumPy + Matplotlib

Install

cd deep-learning-math-lab
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

python src/gradient_check.py

Input: Fixed toy vectors and matrices from the article
Expected output: Writes MSE loss, analytic gradients, finite-difference gradients, and gradient-difference CSV output.

Install cd deep-learning-math-lab
Install python3 -m venv .venv
Install source .venv/bin/activate
Install pip install -r requirements.txt
Run python src/gradient_check.py

1. The Foundation: Dimension and Shape Tracking

In matrix calculus, keeping track of dimensions is half the battle. Let’s define our variables:

x: A 3 x 1 column vector (input features).
W: A 2 x 3 weight matrix.
b: A 2 x 1 bias vector.
y: A 2 x 1 column vector (target labels).

The forward pass and loss function are defined as:

y_hat = W x + b
e     = y_hat - y
L     = 1/2 * e^T e

Visualizing the Forward and Backward Pass

To better conceptualize the flow of data and gradients, consider the following computational graph:

graph TD
    x[Input x: 3x1] --> Mul[Matrix Mul: W*x]
    W[Weights W: 2x3] --> Mul
    Mul --> Add[Add Bias: + b]
    b[Bias b: 2x1] --> Add
    Add --> y_hat[Prediction y_hat: 2x1]
    y_hat --> Error[Error e = y_hat - y]
    y[Target y: 2x1] --> Error
    Error --> Loss[Loss L = 1/2 * e^T * e]
    
    %% Backward pass
    Loss -.->|dL/de = e| Error
    Error -.->|dL/dW = e * x^T| W
    Error -.->|dL/db = e| b

2. Deriving the Gradient by Hand

Let’s calculate the analytical gradient. Starting with the loss function L = 1/2 * e^T e, the derivative with respect to the error vector is straightforward: dL/de = e.

Using the multivariate chain rule on e = Wx + b - y, we can derive the gradients for the parameters. The derivative of Wx with respect to W involves an outer product with the input transpose:

dL/dW = e x^T
dL/db = e

Let’s plug in some concrete numbers. Suppose the forward pass yields an error vector e = [0.2, 1.25]^T and our input was x = [1.5, -2.0, 0.5]^T. The gradient calculation becomes an outer product:

dL/dW =
[0.2 ] [ 1.5, -2.0, 0.5 ] = [ 0.300, -0.400, 0.100 ]
[1.25]                      [ 1.875, -2.500, 0.625 ]

This simple calculation is the bedrock of backpropagation. Every element W_{ij} is updated based on how much the j-th input feature contributed to the i-th output error.

3. Validating with Code: Numerical vs. Analytical Gradients

import numpy as np

def forward(W, b, x, y):
    y_hat = np.dot(W, x) + b
    e = y_hat - y
    loss = 0.5 * np.sum(e ** 2)
    return loss, e

def analytical_gradient(e, x):
    # Outer product: (2x1) * (1x3) -> (2x3)
    dW = np.dot(e, x.T)
    db = np.sum(e, axis=1, keepdims=True)
    return dW, db

def numeric_gradient_W(W, b, x, y, eps=1e-5):
    grad = np.zeros_like(W)
    for row in range(W.shape[0]):
        for col in range(W.shape[1]):
            original = W[row, col]
            
            W[row, col] = original + eps
            plus_loss, _ = forward(W, b, x, y)
            
            W[row, col] = original - eps
            minus_loss, _ = forward(W, b, x, y)
            
            W[row, col] = original # restore
            grad[row, col] = (plus_loss - minus_loss) / (2 * eps)
    return grad

# Setup dummy data
W = np.random.randn(2, 3)
b = np.random.randn(2, 1)
x = np.array([[1.5], [-2.0], [0.5]])
y = np.random.randn(2, 1)

# Compute
_, e = forward(W, b, x, y)
dW_analytical, db_analytical = analytical_gradient(e, x)
dW_numeric = numeric_gradient_W(W, b, x, y)

print("Analytical dW:n", np.round(dW_analytical, 5))
print("Numeric dW:n", np.round(dW_numeric, 5))
print("Max Difference:", np.max(np.abs(dW_analytical - dW_numeric)))
# Output should show Max Difference < 1e-8

4. Visualizing the Tensor Operations

The animation expands the outer product e x^T into the entries of dL/dW.

5. Engineer's Perspective: Real-World Pitfalls

From the Trenches: When moving from this math to massive production models, the challenges shift from formula derivations to hardware realities.

In a real engineering environment, you rarely write raw NumPy gradient updates, but understanding this math is critical for debugging distributed systems and optimizing memory.

Broadcasting Disasters: In Python, adding a shape (64,) array to a shape (64, 1) array results in a (64, 64) matrix due to broadcasting rules. If your bias vector b is implicitly broadcasted incorrectly, your gradient dL/db will be a massive matrix instead of a vector, instantly triggering an Out of Memory (OOM) error on your GPU. Always use explicit reshapes (e.g., keepdims=True).
Memory Bandwidth vs. Compute: The outer product e x^T is theoretically simple, but in memory-constrained environments (like edge devices or large language model training), instantiating large intermediate gradient matrices is the primary bottleneck. Techniques like gradient accumulation or recomputation (activation checkpointing) exist specifically to manage the memory footprint of these exact mathematical operations.
Numerical Instability (NaNs): Notice our numeric_gradient uses eps=1e-5. In float16 or bfloat16 training regimens commonly used on modern GPUs (like A100s or H100s), small epsilon values result in catastrophic cancellation, while large ones result in inaccurate gradients. Mixed precision training requires careful gradient scaling to prevent the elements of dL/dW from vanishing to zero or exploding to infinity.

6. Engineering Checklist

Write down the exact shape of every tensor before writing a single line of formula or code.
Always use the 1/2 scaling factor in MSE formulations while hand-checking gradients; it cleanly cancels out the square derivative.
Make vector orientations (row vs. column) and bias broadcasting explicit during debugging.
Always run numerical gradient checks on a tiny, deterministic model before initiating training on a larger, stochastic one.

7. Gradient Derivation Audit Table

Check	Why it fails in practice	How this article verifies it
Tensor shape	Mixing row and column vectors can reverse the outer product.	`x` is `3 x 1`, `e` is `2 x 1`, so `e x^T` must be `2 x 3`.
Analytical gradient	The chain rule may be correct while the matrix multiplication order is wrong.	The numeric example expands the outer product element by element into `dL/dW`.
Numerical gradient	An `eps` that is too large or too small distorts finite differences.	Each `W[row, col]` is perturbed and compared against the analytical gradient.
Engineering boundary	Broadcasting, mixed precision, and memory bandwidth amplify small mistakes.	`keepdims=True`, gradient checking, and tiny deterministic models are treated as preflight checks.

Search questions

FAQ

Who is this article for?

This article is for readers who want an intermediate-level guide to Matrix Calculus for Neural Networks. It takes about 13 min and focuses on Matrix Calculus, NumPy, Gradient Check.

What should I read next?

The recommended next step is Backpropagation as a Computation Graph, so the article connects into a longer learning route instead of ending as an isolated note.

Does this article include runnable code or companion resources?

Yes. Use the run notes, resource cards, and download links on the page to reproduce the example or inspect the companion files.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

AI Learning Project

A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.

Level: Intermediate Reading time: 13 min

Matrix Calculus
NumPy
Gradient Check

Your next step

Continue: Backpropagation as a Computation Graph

Review the foundation Open resource

Other language version 神经网络矩阵微积分：从 y = Wx + b 推导 MSE 梯度

Share summary Matrix Calculus for Neural Networks

Derive dL/dW for y = Wx + b and verify it with finite differences.

Download share card Open share center

Companion resources

Setup commands, script entry points, generated outputs, and figure notes for the math series.

Open resource Related article

Stores MSE analytic gradients, finite-difference gradients, and error norms.

Open resource Related article

Includes matrix shapes, computation graphs, loss contours, convolution scans, and attention heatmaps.

Open resource Related article

Browser modules for gradient checking, optimizer paths, convolution output size, and attention heatmaps.

Open resource Related article

Project timeline

Published posts

AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
Matrix Calculus for Neural Networks Derive dL/dW for y = Wx + b and verify it with finite differences.
Backpropagation as a Computation Graph Trace local gradients through ReLU and softmax cross-entropy in a two-layer MLP.
Gradient Descent and Optimizer Geometry Compare gradient descent, momentum, and Adam on a visible quadratic loss surface.
Convolution and Receptive Field Math Compute convolution output size, receptive fields, channel mixing, and im2col layout.
Transformer Attention Math Hand-calculate Q/K/V scores, softmax weights, masks, multi-head structure, and KV cache.
Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.

Published resources

Python AI practice code guide The article includes a runnable scikit-learn classification script.
digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
sample_submission.csv The official submission format example for checking the final output columns.
submission.csv The prediction file generated by the current C project.
digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
model_weights.bin sample weights Model weights generated by one local small-sample run.
test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
CNN project explanation PDF Companion explanation material for the CNN project.
Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
AI security risk register CSV risk register template for AI threat modeling and release review.
AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
Deep Learning Math Lab README Setup commands, script entry points, generated outputs, and figure notes for the math series.
Deep learning math full lab bundle Bundles NumPy scripts, CSV outputs, formula diagrams, loss contours, convolution figures, and attention heatmaps.
Gradient check results CSV Stores MSE analytic gradients, finite-difference gradients, and error norms.
Optimizer path CSV Step-by-step coordinates and loss for gradient descent, momentum, and Adam on a 2D quadratic.
Attention weights CSV Scores, softmax weights, and context vectors for a three-token scaled dot-product attention example.
Deep learning math figure set Includes matrix shapes, computation graphs, loss contours, convolution scans, and attention heatmaps.
Deep learning math interactive visualizer Browser modules for gradient checking, optimizer paths, convolution output size, and attention heatmaps.
Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.

Current route

AI Basics Learning Roadmap Learning path step
Machine Learning Workflow Learning path step
Model Training and Evaluation Learning path step
Neural Network Basics Learning path step
Matrix Calculus for Neural Networks Learning path step
Backpropagation as a Computation Graph Learning path step
Gradient Descent and Optimizer Geometry Learning path step
Convolution and Receptive Field Math Learning path step
Transformer Attention Math Learning path step
LLM Visualizer Learning path step
Python AI Mini Practice Learning path step
Handwritten Digit Dataset Basics Learning path step
Handwritten Digit Softmax in C Learning path step
Handwritten Digit Playground Notes Learning path step
CIFAR-10 Tiny CNN Tutorial in C Learning path step
High-Entropy Traffic Defense Notes Learning path step
AI Security Threat Modeling Learning path step
Adversarial Examples and Robust Evaluation Learning path step
Data Poisoning and Backdoor Defense Learning path step
Model Privacy and Extraction Defense Learning path step
LLM, RAG, and Agent Security Learning path step

Next notes

Add more image-classification and error-analysis cases
Turn common metrics into a quick reference
Add more AI security defense experiment notes

1. The Foundation: Dimension and Shape Tracking

Visualizing the Forward and Backward Pass

2. Deriving the Gradient by Hand

3. Validating with Code: Numerical vs. Analytical Gradients

4. Visualizing the Tensor Operations

5. Engineer's Perspective: Real-World Pitfalls

6. Engineering Checklist

7. Gradient Derivation Audit Table

一、基础：维度与形状追踪

计算图与数据流可视化

二、手算解析梯度

三、代码验证：数值梯度 vs 解析梯度

四、动画看什么

五、工程师视角：真实的避坑指南

六、工程检查清单

七、梯度推导审计表

1. The Foundation: Dimension and Shape Tracking

Visualizing the Forward and Backward Pass

2. Deriving the Gradient by Hand

3. Validating with Code: Numerical vs. Analytical Gradients

4. Visualizing the Tensor Operations

5. Engineer's Perspective: Real-World Pitfalls

6. Engineering Checklist

7. Gradient Derivation Audit Table

Who is this article for?

What should I read next?

Does this article include runnable code or companion resources?

How does this article fit into the larger site?

Companion resources

Deep Learning Math Lab README

Gradient check results CSV

Deep learning math figure set

Deep learning math interactive visualizer

Leave a Reply Cancel reply

Project timeline