English
Machine Learning Workflow: From Data and Features to Predictions
Machine learning is not just sending data into an algorithm. A reproducible machine learning project usually follows a stable workflow: define the problem, inspect the data, build features, train a model, evaluate the result, and then use the model for prediction.
This article does not try to cover every algorithm. Instead, it explains the workflow from an engineering perspective. Once this structure is clear, linear regression, logistic regression, decision trees, and neural networks become much easier to place.
While reading, focus on three questions: what data enters the system, what transformations happen in the middle, and which metrics tell you whether the output is reliable.
1. Define the Problem
Before writing model code, answer this question:
Given which inputs, what output should the model predict?
Common problem types include:
- Classification: predict a category, such as spam or not spam
- Regression: predict a continuous value, such as price, demand, or temperature
- Clustering: group data without labels, such as user segmentation
- Ranking: order candidate results, such as search or recommendation output
If the problem is vague, you may train a model but still have no reliable way to judge whether it is useful.
2. Understand Each Column
For beginners, the most common data shape is a table:
sample feature1 feature2 feature3 label
1 ... ... ... A
2 ... ... ... B
3 ... ... ... A
The key concepts are:
- Sample: usually one row of data
- Feature: an input field used for prediction
- Label: the known answer in supervised learning
Before writing code, understand what each column means, what unit it uses, what range it should have, and whether obvious bad values exist. Many machine learning failures come from misunderstood data rather than weak algorithms.
3. Split Training and Test Data
A model should not be judged only on data it used for training. To check whether it learned a general pattern, split the data:
- Training set: used to fit model parameters
- Test set: used to estimate behavior on new data
A common scikit-learn pattern is:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
random_state fixes the split, which makes experiments easier to reproduce.
4. Process Features
Models usually work with numbers, so raw data often needs conversion. Common feature processing steps include:
- Encoding text categories as numbers
- Handling missing values
- Standardizing numeric features
- Removing fields that are meaningless or leak the answer
Standardization is common for methods that are sensitive to numeric scale, such as logistic regression, K-means, and neural networks:
x_scaled = (x - mean) / std
It does not change the basic relationship between samples, but it puts different numeric features on more comparable scales.
5. Choose a Baseline Model
Do not start with the most complex model. First build a baseline:
- For classification, try logistic regression or a decision tree
- For regression, try linear regression
- For clustering, try K-means
The baseline does not have to be the best model. It gives you a reference point. Later model changes, feature changes, and parameter changes should be compared against it.
6. Train the Model
In scikit-learn, training is usually expressed with a consistent method call:
model.fit(X_train, y_train)
Behind this call, the model adjusts internal parameters so predictions become closer to labels in the training data.
Different algorithms have different parameter meanings, but the goal is the same: find parameters that reduce mistakes on training data without merely memorizing it.
7. Predict and Evaluate
After training, predict on the test set:
y_pred = model.predict(X_test)
Then measure performance. Common classification metrics include:
- Accuracy: the overall proportion of correct predictions
- Precision: among predicted positives, how many are truly positive
- Recall: among true positives, how many the model found
- F1-score: a combined measure of precision and recall
Do not rely on one number. Accuracy can be misleading when classes are imbalanced.
8. The Whole Workflow
Combined, a minimal workflow looks like this:
# 1. Prepare X and y
# 2. Split training and test data
# 3. Process features
# 4. Train a model
# 5. Predict on the test set
# 6. Compute evaluation metrics
Real projects may add logging, cross-validation, model persistence, deployment, and monitoring. But even complex systems still depend on this core sequence.
9. A Good Practice Checklist
When practicing machine learning, write down answers to these questions:
- What are the input features and target label?
- How were training and test data split?
- Which feature processing steps were used?
- What baseline model was chosen?
- Which metric was used, and why?
- What do the model's mistakes have in common?
If you can answer these questions, you are no longer just copying code. You are starting to analyze problems in the machine learning workflow.
10. Common Mistakes
When building a first machine learning project, beginners often run into these problems:
- Processing the full dataset before splitting train and test data, which leaks test information into training
- Skipping a baseline model and jumping directly to complex algorithms
- Printing only accuracy without checking class balance or wrong predictions
- Trusting column names without confirming what each field actually means
If you actively avoid these issues, even a small project becomes much easier to trust.
11. What to Read Next
The previous article is the AI Basics Learning Roadmap. After the full workflow is clear, continue with Model Training and Evaluation to understand loss functions, overfitting, and metrics.
Chinese
机器学习完整流程:从数据、特征到模型预测
Open as a full page机器学习不是把数据丢给一个算法就结束。一个可复查的机器学习项目,通常有一条比较稳定的流程:定义问题、整理数据、构造特征、训练模型、评估结果,最后再把模型用于预测。
这篇文章不追求覆盖所有算法,而是用工程视角把机器学习的完整流程拆开。理解这条流程后,再学习线性回归、逻辑回归、决策树或神经网络时,会更容易知道每一步在做什么。
读这篇时建议把重点放在“输入、处理、输出”三件事上:输入是什么数据,中间做了哪些变换,最后用什么指标判断输出是否可靠。
一、第一步:把问题定义清楚
机器学习项目开始前,先要回答一个问题:
我们希望模型根据什么输入,预测什么输出?
常见问题类型包括:
- 分类:预测类别,例如邮件是否为垃圾邮件
- 回归:预测连续数值,例如房价、销量或温度
- 聚类:没有标签时自动分组,例如把用户分成几类
- 排序:对候选结果排序,例如搜索结果推荐
如果问题没有定义清楚,后面很容易出现“模型训练出来了,但不知道怎么判断是否有用”的情况。
二、第二步:理解数据表里的每一列
对初学者来说,最常见的数据形态是一张表:
样本 特征1 特征2 特征3 标签
1 ... ... ... A
2 ... ... ... B
3 ... ... ... A
这里有两个核心概念:
- 样本:每一行通常是一条样本
- 特征:用于预测的输入字段
- 标签:监督学习中已知的正确答案
写代码前,应该先知道每一列的含义、单位、取值范围,以及有没有明显错误值。很多机器学习问题,失败原因不是算法太弱,而是数据字段理解错了。
三、第三步:划分训练集和测试集
模型不能只在训练数据上表现好。为了检查它有没有真正学到规律,我们会把数据拆成两部分:
- 训练集:用于让模型学习参数
- 测试集:用于模拟模型面对新数据时的表现
常见写法是:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
random_state 的作用是固定随机划分结果,方便你以后复现实验。
四、第四步:特征处理
模型通常只能处理数字,所以原始数据经常需要转换。常见特征处理包括:
- 把文本类别转成数字编码
- 处理缺失值
- 对数值特征做标准化
- 删除明显无意义或泄漏答案的字段
标准化是很常见的一步,尤其是逻辑回归、K-means、神经网络这类对数值尺度敏感的方法:
x_scaled = (x - mean) / std
它不会改变样本之间的相对关系,但会让不同特征处在更接近的尺度上。
五、第五步:选择一个基线模型
入门阶段不要一开始就追复杂模型。更好的做法是先建立一个基线:
- 分类问题可以先试逻辑回归或决策树
- 回归问题可以先试线性回归
- 聚类问题可以先试 K-means
基线模型的意义不是一定要达到最好效果,而是给你一个可比较的起点。后续更换模型、调整特征或改参数,都要和这个起点比较。
六、第六步:训练模型
在 scikit-learn 里,训练过程通常很统一:
model.fit(X_train, y_train)
这行代码背后发生的是:模型根据训练数据不断调整内部参数,让预测结果尽量接近标签。
不同算法的参数含义不同,但从使用者视角看,核心目标都是一样的:找到一组参数,让模型在训练数据上犯的错误更少,同时不要只记住训练数据。
七、第七步:预测和评估
训练完成后,可以对测试集预测:
y_pred = model.predict(X_test)
然后用指标评估结果。分类任务常见指标有:
- Accuracy:整体预测正确比例
- Precision:预测为正类的样本里,有多少真的为正
- Recall:真实正类里,有多少被找出来
- F1-score:precision 和 recall 的折中
不要只看一个数字。尤其是类别不平衡时,准确率可能会误导你。
八、完整流程长什么样
把上面的步骤连起来,一个最小流程可以写成:
# 1. 准备 X 和 y
# 2. 拆分训练集和测试集
# 3. 处理特征
# 4. 训练模型
# 5. 预测测试集
# 6. 计算评估指标
真正的项目可能还会加入日志、交叉验证、模型保存、线上监控等环节。但无论复杂度多高,核心流程都离不开这几步。
九、学习时最应该养成的习惯
建议每次练习机器学习项目时,都记录下面几个问题:
- 输入特征是什么,标签是什么
- 训练集和测试集怎么划分
- 做了哪些特征处理
- 使用了什么基线模型
- 评估指标是什么,为什么选它
- 模型犯错的样本有什么特点
如果你能把这些问题说清楚,就已经不是只会复制代码,而是在按机器学习的方式分析问题了。
十、常见错误
初学者写第一个机器学习项目时,最容易踩下面几个坑:
- 先处理完整数据,再拆分训练集和测试集,导致测试集信息泄漏
- 没有建立基线模型,直接堆复杂算法,最后不知道提升是否真实
- 只打印 accuracy,不看类别分布和错误样本
- 把字段名当成理所当然,没有确认每一列的业务含义
如果你能主动避免这些问题,一个小项目的可信度会明显提高。
十一、下一步读什么
上一篇是 人工智能基础学习路线。理解完整流程后,建议继续读 模型训练与评估入门,把损失函数、过拟合和评估指标补上。
Machine learning is not just sending data into an algorithm. A reproducible machine learning project usually follows a stable workflow: define the problem, inspect the data, build features, train a model, evaluate the result, and then use the model for prediction.
This article does not try to cover every algorithm. Instead, it explains the workflow from an engineering perspective. Once this structure is clear, linear regression, logistic regression, decision trees, and neural networks become much easier to place.
While reading, focus on three questions: what data enters the system, what transformations happen in the middle, and which metrics tell you whether the output is reliable.
1. Define the Problem
Before writing model code, answer this question:
Given which inputs, what output should the model predict?
Common problem types include:
- Classification: predict a category, such as spam or not spam
- Regression: predict a continuous value, such as price, demand, or temperature
- Clustering: group data without labels, such as user segmentation
- Ranking: order candidate results, such as search or recommendation output
If the problem is vague, you may train a model but still have no reliable way to judge whether it is useful.
2. Understand Each Column
For beginners, the most common data shape is a table:
sample feature1 feature2 feature3 label
1 ... ... ... A
2 ... ... ... B
3 ... ... ... A
The key concepts are:
- Sample: usually one row of data
- Feature: an input field used for prediction
- Label: the known answer in supervised learning
Before writing code, understand what each column means, what unit it uses, what range it should have, and whether obvious bad values exist. Many machine learning failures come from misunderstood data rather than weak algorithms.
3. Split Training and Test Data
A model should not be judged only on data it used for training. To check whether it learned a general pattern, split the data:
- Training set: used to fit model parameters
- Test set: used to estimate behavior on new data
A common scikit-learn pattern is:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
random_state fixes the split, which makes experiments easier to reproduce.
4. Process Features
Models usually work with numbers, so raw data often needs conversion. Common feature processing steps include:
- Encoding text categories as numbers
- Handling missing values
- Standardizing numeric features
- Removing fields that are meaningless or leak the answer
Standardization is common for methods that are sensitive to numeric scale, such as logistic regression, K-means, and neural networks:
x_scaled = (x - mean) / std
It does not change the basic relationship between samples, but it puts different numeric features on more comparable scales.
5. Choose a Baseline Model
Do not start with the most complex model. First build a baseline:
- For classification, try logistic regression or a decision tree
- For regression, try linear regression
- For clustering, try K-means
The baseline does not have to be the best model. It gives you a reference point. Later model changes, feature changes, and parameter changes should be compared against it.
6. Train the Model
In scikit-learn, training is usually expressed with a consistent method call:
model.fit(X_train, y_train)
Behind this call, the model adjusts internal parameters so predictions become closer to labels in the training data.
Different algorithms have different parameter meanings, but the goal is the same: find parameters that reduce mistakes on training data without merely memorizing it.
7. Predict and Evaluate
After training, predict on the test set:
y_pred = model.predict(X_test)
Then measure performance. Common classification metrics include:
- Accuracy: the overall proportion of correct predictions
- Precision: among predicted positives, how many are truly positive
- Recall: among true positives, how many the model found
- F1-score: a combined measure of precision and recall
Do not rely on one number. Accuracy can be misleading when classes are imbalanced.
8. The Whole Workflow
Combined, a minimal workflow looks like this:
# 1. Prepare X and y
# 2. Split training and test data
# 3. Process features
# 4. Train a model
# 5. Predict on the test set
# 6. Compute evaluation metrics
Real projects may add logging, cross-validation, model persistence, deployment, and monitoring. But even complex systems still depend on this core sequence.
9. A Good Practice Checklist
When practicing machine learning, write down answers to these questions:
- What are the input features and target label?
- How were training and test data split?
- Which feature processing steps were used?
- What baseline model was chosen?
- Which metric was used, and why?
- What do the model’s mistakes have in common?
If you can answer these questions, you are no longer just copying code. You are starting to analyze problems in the machine learning workflow.
10. Common Mistakes
When building a first machine learning project, beginners often run into these problems:
- Processing the full dataset before splitting train and test data, which leaks test information into training
- Skipping a baseline model and jumping directly to complex algorithms
- Printing only accuracy without checking class balance or wrong predictions
- Trusting column names without confirming what each field actually means
If you actively avoid these issues, even a small project becomes much easier to trust.
11. What to Read Next
The previous article is the AI Basics Learning Roadmap. After the full workflow is clear, continue with Model Training and Evaluation to understand loss functions, overfitting, and metrics.
Search questions
FAQ
Who is this article for?
This article is for readers who want a beginner-level guide to Machine Learning Workflow. It takes about 9 min and focuses on Machine Learning, Features, scikit-learn.
What should I read next?
The recommended next step is Model Training and Evaluation, so the article connects into a longer learning route instead of ending as an isolated note.
Does this article include runnable code or companion resources?
This article is primarily explanatory, but the related tutorials point to runnable examples, resources, and project pages.
How does this article fit into the larger site?
It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.
Article context
AI Learning Project
A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.
Your next step
Continue: Model Training and EvaluationFollow the practical path from data and features to training, prediction, and evaluation.
Download share card Open share centerProject timeline
Published posts
- AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
- Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
- Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
- Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
- NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
- RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
- Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
- Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
- Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
- Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
- Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
- CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
- Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
- High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
- AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
- Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
- Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
- Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
- LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.
Published resources
- Python AI practice code guide The article includes a runnable scikit-learn classification script.
- digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
- train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
- test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
- sample_submission.csv The official submission format example for checking the final output columns.
- submission.csv The prediction file generated by the current C project.
- digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
- digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
- Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
- cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
- model_weights.bin sample weights Model weights generated by one local small-sample run.
- test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
- CNN project explanation PDF Companion explanation material for the CNN project.
- Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
- Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
- Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
- Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
- AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
- AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
- AI security risk register CSV risk register template for AI threat modeling and release review.
- AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
- AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
- FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
- Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
- Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
- RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
- Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
- Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
- Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
- CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.
Current route
- AI Basics Learning Roadmap Learning path step
- Machine Learning Workflow Learning path step
- Model Training and Evaluation Learning path step
- Neural Network Basics Learning path step
- Transformer Self-Attention Learning path step
- LLM Visualizer Learning path step
- Python AI Mini Practice Learning path step
- Handwritten Digit Dataset Basics Learning path step
- Handwritten Digit Softmax in C Learning path step
- Handwritten Digit Playground Notes Learning path step
- CIFAR-10 Tiny CNN Tutorial in C Learning path step
- High-Entropy Traffic Defense Notes Learning path step
- AI Security Threat Modeling Learning path step
- Adversarial Examples and Robust Evaluation Learning path step
- Data Poisoning and Backdoor Defense Learning path step
- Model Privacy and Extraction Defense Learning path step
- LLM, RAG, and Agent Security Learning path step
Next notes
- Add more image-classification and error-analysis cases
- Turn common metrics into a quick reference
- Add more AI security defense experiment notes
