Model Training and Evaluation Tutorial: Loss, Overfitting, and Metrics

Reading info

Level: Beginner Reading time: 9 min

Model Training
Metrics
Evaluation

Open knowledge map

English

Model Training and Evaluation: Loss, Overfitting, and Accuracy

When people train their first model, they often focus only on accuracy. To understand whether a model is reliable, you need to know what training adjusts, what a loss function measures, why overfitting happens, and why test data must stay separate.

This article explains the basics of model training and evaluation: parameters, loss functions, epochs, overfitting, validation data, test data, and common classification metrics.

If the previous article explained how to organize a machine learning project, this one explains how to decide whether the training process is trustworthy.

1. What Does Training Adjust?

A model can be viewed as a function with parameters:

prediction = model(input_features, parameters)

Before training, parameters may be random or initialized with default values. Training adjusts those parameters so model output becomes closer to the true labels.

A very simple linear model looks like this:

y = w1 * x1 + w2 * x2 + b

Here, w1, w2, and b are parameters. Training tries to find better values for them.

2. What Is a Loss Function?

The model needs a way to measure how wrong a prediction is. That measurement is the loss function.

For regression, a simple loss can be squared error:

loss = (y_true - y_pred) ** 2

For classification, cross-entropy loss is common. You do not need to derive the formula at the beginning, but the intuition matters:

A confidently wrong prediction receives a large loss. A prediction close to the correct answer receives a smaller loss.

During training, the algorithm tries to reduce the overall loss.

3. The Intuition Behind Gradient Descent

Many models use gradient descent or a variant of it to update parameters. Think of it as walking downhill:

The current parameters produce a loss value
The algorithm estimates which direction reduces loss
The parameters move a small step in that direction
The process repeats many times

An important hyperparameter is the learning rate. If it is too small, training is slow. If it is too large, training can bounce around or fail to converge.

new_weight = old_weight - learning_rate * gradient

This is not the full mathematical story, but it explains why training loops repeat parameter updates.

4. Epoch, Batch, and Iteration

Deep learning training often uses these terms:

Epoch: one full pass through the training set
Batch: a small group of samples used for one update step
Iteration: one parameter update

If the training set has 1000 samples and the batch size is 100, one epoch contains 10 iterations.

Traditional machine learning libraries may not expose these terms directly, but the basic idea is similar: the model uses training data to adjust parameters.

5. Why Overfitting Happens

Overfitting means the model performs well on training data but much worse on new data.

Common causes include:

The model is complex enough to memorize noise and details in the training set
The training data is too small to represent the real problem
The features contain information that should not be available, also called data leakage
The model trains for too long without validation monitoring

The danger is that training metrics can look excellent while real-world performance is poor.

6. Training, Validation, and Test Data

For reliable evaluation, data is often split into three parts:

Training set: used to fit parameters
Validation set: used to tune settings, select models, and watch for overfitting
Test set: used at the end to estimate final generalization

For small practice projects, a training/test split can be enough. But remember: the test set should not be used repeatedly for tuning, or it becomes part of the decision process.

7. Common Classification Metrics

Classification should not be judged by accuracy alone. These metrics often appear together:

Accuracy: the proportion of correct predictions
Precision: among predicted positives, how many are truly positive
Recall: among true positives, how many were found
F1-score: a combined measure of precision and recall

For medical screening, missing a real positive case may be costly, so recall may matter more. For automatic account blocking, falsely blocking normal users may be costly, so precision may matter more.

8. Reading a Confusion Matrix

A confusion matrix compares predicted labels with true labels:

                  predicted negative  predicted positive
true negative              TN                  FP
true positive              FN                  TP

TP: a positive sample predicted correctly
TN: a negative sample predicted correctly
FP: a negative sample incorrectly predicted as positive
FN: a positive sample incorrectly predicted as negative

The advantage of a confusion matrix is that it shows not only how many mistakes happened, but also which direction those mistakes went.

9. Evaluation Checklist

When evaluating a model, check these questions:

Was the test set isolated from training?
Are the classes heavily imbalanced?
Did you look beyond accuracy?
Was the model compared with a simple baseline?
Did you inspect some wrong predictions manually?
Is the gap between training performance and test performance too large?

The point of training is not merely to push one metric upward. The point is to build a trustworthy evaluation process and understand when the model is likely to fail.

10. What a Trustworthy Training Record Includes

A useful training record should include at least these details:

How training, validation, and test data were split
The model, important parameters, and random seed
Training metrics and test metrics, not just one final score
Error analysis, especially for the most costly error types
Comparison against a simple baseline model

These notes may look small, but they make the experiment auditable when you return to it later.

11. What to Read Next

The previous article is Machine Learning Workflow. After training and evaluation are clear, continue with Neural Network Basics to connect parameters with multi-layer function composition.

Chinese

模型训练与评估入门：损失函数、过拟合和准确率怎么理解

Open as a full page

很多人第一次训练模型时，会把注意力放在“准确率是多少”。但要真正理解模型是否可靠，需要先知道训练到底在调整什么，损失函数表示什么，过拟合为什么会发生，以及测试集为什么必须保留。

这篇文章围绕模型训练与评估展开，重点讲清楚几个基础概念：参数、损失函数、训练轮次、过拟合、验证集、测试集和常见分类指标。

如果说上一篇机器学习流程解决的是“项目怎么组织”，这一篇解决的就是“训练过程是否可信”。

一、模型训练到底在训练什么

一个模型可以理解成一个带参数的函数：

预测结果 = model(输入特征, 参数)

训练前，参数可能是随机的，也可能有默认初始值。训练的目标就是不断调整这些参数，让模型输出更接近真实标签。

以一个非常简单的线性模型为例：

y = w1 * x1 + w2 * x2 + b

这里的 w1、w2 和 b 就是参数。训练过程会尝试找到更合适的参数组合。

二、损失函数是什么

模型需要一个标准来判断“预测得有多错”。这个标准就是损失函数。

对回归问题，常见损失可以是预测值和真实值之间的平方差：

loss = (y_true - y_pred) ** 2

对分类问题，常用的是交叉熵损失。你不需要一开始就推导公式，但要理解它的直觉：

模型越自信地给出错误答案，损失越大；模型越接近正确答案，损失越小。

训练模型时，算法会尝试让整体损失下降。

三、梯度下降的直觉

很多模型使用梯度下降或它的变体来更新参数。可以把它想成一个下山过程：

当前参数对应一个损失值
计算往哪个方向移动能让损失下降
参数朝这个方向移动一点
重复很多次

这里有一个重要超参数叫学习率。学习率太小，下降很慢；学习率太大，可能来回震荡甚至无法收敛。

new_weight = old_weight - learning_rate * gradient

这不是全部细节，但足够帮助你理解训练循环为什么需要反复执行。

四、epoch、batch 和 iteration

训练深度学习模型时，经常会看到三个词：

epoch：完整看完一遍训练集
batch：每次拿一小批样本计算损失和梯度
iteration：一次参数更新

如果训练集有 1000 条样本，batch size 是 100，那么一个 epoch 里会有 10 次 iteration。

传统机器学习库不一定直接暴露这些词，但背后的思想相似：模型需要通过训练数据反复调整参数。

五、为什么会过拟合

过拟合指的是：模型在训练集上表现很好，但在新数据上表现明显变差。

常见原因包括：

模型太复杂，记住了训练数据里的细节和噪声
训练数据太少，无法代表真实场景
特征里包含了不该使用的信息，也就是数据泄漏
训练太久，但没有监控验证集表现

过拟合的危险在于：你只看训练集指标时会觉得模型很好，但上线或遇到新样本后效果很差。

六、训练集、验证集和测试集

为了更可靠地评估模型，通常会拆出三类数据：

训练集：用于训练参数
验证集：用于调参、选模型、观察是否过拟合
测试集：最后只用一次，用来估计最终泛化效果

如果数据量不大，也可以先用训练集和测试集两份数据。但要记住：测试集不应该被反复用于调参，否则它也会间接参与训练决策。

七、分类任务常见指标

分类问题不能只看准确率。下面几个指标经常一起出现：

Accuracy：预测正确的比例
Precision：预测为正类的样本中，有多少是真的正类
Recall：真实正类中，有多少被模型找出来
F1-score：precision 和 recall 的综合指标

如果是疾病筛查，漏掉真实阳性可能很严重，这时 recall 往往很重要。如果是自动封号，误伤正常用户代价很高，这时 precision 可能更重要。

八、混淆矩阵怎么读

混淆矩阵会把预测结果和真实标签交叉统计：

                 预测负类  预测正类
真实负类            TN       FP
真实正类            FN       TP

TP：正类预测对了
TN：负类预测对了
FP：负类被误判成正类
FN：正类被误判成负类

看混淆矩阵的好处是，你不只知道模型错了多少，还能知道它主要错在哪一种方向。

九、评估模型时的检查清单

每次评估模型时，可以按下面几个问题检查：

测试集是否从训练中隔离出来
类别是否严重不平衡
是否只看了 accuracy
是否和简单基线模型比较过
错误样本是否被人工看过一部分
训练表现和测试表现差距是否过大

模型训练的核心不是把某个指标刷高，而是建立一个可信的判断过程。你需要知道模型为什么看起来有效，也要知道它在哪些情况下可能失效。

十、判断一次训练是否靠谱

一个比较靠谱的训练记录，至少应该包含这些信息：

训练集、验证集和测试集的划分方式
使用的模型、主要参数和随机种子
训练指标和测试指标，而不是只给一个最终分数
错误样本分析，尤其是代价最高的错误类型
和简单基线模型的比较

这些记录看起来琐碎，但它们能让你几天后重新检查实验时，仍然知道结果从哪里来。

十一、下一步读什么

上一篇是机器学习完整流程。理解训练和评估后，可以继续读神经网络基础，把“参数如何通过多层函数组合起来”这件事接上。

This article explains the basics of model training and evaluation: parameters, loss functions, epochs, overfitting, validation data, test data, and common classification metrics.

If the previous article explained how to organize a machine learning project, this one explains how to decide whether the training process is trustworthy.

1. What Does Training Adjust?

A model can be viewed as a function with parameters:

prediction = model(input_features, parameters)

Before training, parameters may be random or initialized with default values. Training adjusts those parameters so model output becomes closer to the true labels.

A very simple linear model looks like this:

y = w1 * x1 + w2 * x2 + b

Here, w1, w2, and b are parameters. Training tries to find better values for them.

2. What Is a Loss Function?

The model needs a way to measure how wrong a prediction is. That measurement is the loss function.

For regression, a simple loss can be squared error:

loss = (y_true - y_pred) ** 2

For classification, cross-entropy loss is common. You do not need to derive the formula at the beginning, but the intuition matters:

A confidently wrong prediction receives a large loss. A prediction close to the correct answer receives a smaller loss.

During training, the algorithm tries to reduce the overall loss.

3. The Intuition Behind Gradient Descent

Many models use gradient descent or a variant of it to update parameters. Think of it as walking downhill:

The current parameters produce a loss value
The algorithm estimates which direction reduces loss
The parameters move a small step in that direction
The process repeats many times

An important hyperparameter is the learning rate. If it is too small, training is slow. If it is too large, training can bounce around or fail to converge.

new_weight = old_weight - learning_rate * gradient

This is not the full mathematical story, but it explains why training loops repeat parameter updates.

4. Epoch, Batch, and Iteration

Deep learning training often uses these terms:

Epoch: one full pass through the training set
Batch: a small group of samples used for one update step
Iteration: one parameter update

If the training set has 1000 samples and the batch size is 100, one epoch contains 10 iterations.

Traditional machine learning libraries may not expose these terms directly, but the basic idea is similar: the model uses training data to adjust parameters.

5. Why Overfitting Happens

Overfitting means the model performs well on training data but much worse on new data.

Common causes include:

The model is complex enough to memorize noise and details in the training set
The training data is too small to represent the real problem
The features contain information that should not be available, also called data leakage
The model trains for too long without validation monitoring

The danger is that training metrics can look excellent while real-world performance is poor.

6. Training, Validation, and Test Data

For reliable evaluation, data is often split into three parts:

Training set: used to fit parameters
Validation set: used to tune settings, select models, and watch for overfitting
Test set: used at the end to estimate final generalization

For small practice projects, a training/test split can be enough. But remember: the test set should not be used repeatedly for tuning, or it becomes part of the decision process.

7. Common Classification Metrics

Classification should not be judged by accuracy alone. These metrics often appear together:

Accuracy: the proportion of correct predictions
Precision: among predicted positives, how many are truly positive
Recall: among true positives, how many were found
F1-score: a combined measure of precision and recall

8. Reading a Confusion Matrix

A confusion matrix compares predicted labels with true labels:

                  predicted negative  predicted positive
true negative              TN                  FP
true positive              FN                  TP

TP: a positive sample predicted correctly
TN: a negative sample predicted correctly
FP: a negative sample incorrectly predicted as positive
FN: a positive sample incorrectly predicted as negative

The advantage of a confusion matrix is that it shows not only how many mistakes happened, but also which direction those mistakes went.

9. Evaluation Checklist

When evaluating a model, check these questions:

Was the test set isolated from training?
Are the classes heavily imbalanced?
Did you look beyond accuracy?
Was the model compared with a simple baseline?
Did you inspect some wrong predictions manually?
Is the gap between training performance and test performance too large?

The point of training is not merely to push one metric upward. The point is to build a trustworthy evaluation process and understand when the model is likely to fail.

10. What a Trustworthy Training Record Includes

A useful training record should include at least these details:

How training, validation, and test data were split
The model, important parameters, and random seed
Training metrics and test metrics, not just one final score
Error analysis, especially for the most costly error types
Comparison against a simple baseline model

These notes may look small, but they make the experiment auditable when you return to it later.

11. What to Read Next

The previous article is Machine Learning Workflow. After training and evaluation are clear, continue with Neural Network Basics to connect parameters with multi-layer function composition.

Search questions

FAQ

Who is this article for?

This article is for readers who want a beginner-level guide to Model Training and Evaluation. It takes about 9 min and focuses on Model Training, Metrics, Evaluation.

What should I read next?

The recommended next step is Neural Network Basics, so the article connects into a longer learning route instead of ending as an isolated note.

Does this article include runnable code or companion resources?

This article is primarily explanatory, but the related tutorials point to runnable examples, resources, and project pages.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

AI Learning Project

A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.

Level: Beginner Reading time: 9 min

Model Training
Metrics
Evaluation

Your next step

Continue: Neural Network Basics

View project

Other language version 模型训练与评估入门：损失函数、过拟合和准确率怎么理解

Share summary Model Training and Evaluation

Understand loss, overfitting, train/test splits, accuracy, recall, and F1.

Download share card Open share center

Project timeline

Published posts

AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.

Published resources

Python AI practice code guide The article includes a runnable scikit-learn classification script.
digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
sample_submission.csv The official submission format example for checking the final output columns.
submission.csv The prediction file generated by the current C project.
digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
model_weights.bin sample weights Model weights generated by one local small-sample run.
test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
CNN project explanation PDF Companion explanation material for the CNN project.
Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
AI security risk register CSV risk register template for AI threat modeling and release review.
AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.

Current route

AI Basics Learning Roadmap Learning path step
Machine Learning Workflow Learning path step
Model Training and Evaluation Learning path step
Neural Network Basics Learning path step
Transformer Self-Attention Learning path step
LLM Visualizer Learning path step
Python AI Mini Practice Learning path step
Handwritten Digit Dataset Basics Learning path step
Handwritten Digit Softmax in C Learning path step
Handwritten Digit Playground Notes Learning path step
CIFAR-10 Tiny CNN Tutorial in C Learning path step
High-Entropy Traffic Defense Notes Learning path step
AI Security Threat Modeling Learning path step
Adversarial Examples and Robust Evaluation Learning path step
Data Poisoning and Backdoor Defense Learning path step
Model Privacy and Extraction Defense Learning path step
LLM, RAG, and Agent Security Learning path step

Next notes

Add more image-classification and error-analysis cases
Turn common metrics into a quick reference
Add more AI security defense experiment notes

1. What Does Training Adjust?

2. What Is a Loss Function?

3. The Intuition Behind Gradient Descent

4. Epoch, Batch, and Iteration

5. Why Overfitting Happens

6. Training, Validation, and Test Data

7. Common Classification Metrics

8. Reading a Confusion Matrix

9. Evaluation Checklist

10. What a Trustworthy Training Record Includes

11. What to Read Next

一、模型训练到底在训练什么

二、损失函数是什么

三、梯度下降的直觉

四、epoch、batch 和 iteration

五、为什么会过拟合

六、训练集、验证集和测试集

七、分类任务常见指标

八、混淆矩阵怎么读

九、评估模型时的检查清单

十、判断一次训练是否靠谱

十一、下一步读什么

1. What Does Training Adjust?

2. What Is a Loss Function?

3. The Intuition Behind Gradient Descent

4. Epoch, Batch, and Iteration

5. Why Overfitting Happens

6. Training, Validation, and Test Data

7. Common Classification Metrics

8. Reading a Confusion Matrix

9. Evaluation Checklist

10. What a Trustworthy Training Record Includes

11. What to Read Next

Who is this article for?

What should I read next?

Does this article include runnable code or companion resources?

How does this article fit into the larger site?

Leave a Reply Cancel reply

Project timeline