Python AI Classification Tutorial with scikit-learn: Full Training Code

Reading info

Level: Practice Reading time: 10 min

Python
scikit-learn
Classification

Open knowledge map

English

Python AI Mini Practice: A Classification Task with scikit-learn

The previous articles covered AI concepts, the machine learning workflow, model training and evaluation, and neural network basics. This article runs a small end-to-end practice project: a binary classification task with Python and scikit-learn.

The example uses the breast cancer dataset built into scikit-learn, so no external data file is required. The goal is not to chase the highest score. The goal is to walk through loading data, splitting data, standardizing features, training, predicting, and evaluating.

Note: this dataset is used here only for machine learning practice. It should not be used for medical decisions or real diagnosis. The article focuses on the classification workflow, not medical conclusions.

1. Prepare the Environment

Create a virtual environment and install the dependency:

python3 -m venv .venv
source .venv/bin/activate
pip install scikit-learn

This example uses only scikit-learn, not a deep learning framework. That keeps the focus on the basic machine learning workflow.

2. Complete Code

The following script can be run directly:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def main():
    dataset = load_breast_cancer()
    X = dataset.data
    y = dataset.target

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=42,
        stratify=y,
    )

    model = Pipeline(
        steps=[
            ("scaler", StandardScaler()),
            ("classifier", LogisticRegression(max_iter=500)),
        ]
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("Classification report:")
    print(classification_report(y_test, y_pred, target_names=dataset.target_names))


if __name__ == "__main__":
    main()

Save it as ai_classification_demo.py and run:

python ai_classification_demo.py

If dependency import feels slow the first time, confirm that the virtual environment is active and run python -c "import sklearn; print(sklearn.__version__)" to check that scikit-learn is installed.

3. What the Dataset Contains

load_breast_cancer() returns a binary classification dataset. Each sample contains numeric features, and the label indicates which class the sample belongs to.

In the script:

X is the feature matrix, with one row per sample
y is the label array, with one label per sample
dataset.target_names contains the class names

The dataset is already prepared as numeric features, which makes it useful for practicing classification basics.

4. Why Split Training and Test Data?

The script uses train_test_split():

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

test_size=0.2 means 20% of the data is reserved for testing. stratify=y tries to preserve the class ratio after the split, which is useful for classification.

If you evaluate only on training data, the model may have memorized training examples instead of learning a pattern that generalizes.

5. Why Use Pipeline?

The code uses Pipeline instead of manually standardizing first and training later:

model = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("classifier", LogisticRegression(max_iter=500)),
    ]
)

This has two benefits:

Standardization and classification stay in one reproducible workflow
The test set uses scaling parameters learned only from the training set, which avoids data leakage

Data leakage is a common beginner mistake. If you standardize the full dataset before splitting, information from the test set has already influenced training.

6. Why Logistic Regression?

Logistic regression is a classic baseline for classification. It is fast, stable, and easier to explain than many more complex models.

This example does not start with a neural network because running the full workflow is more important at this stage. Once every line in this script is clear, replacing the classifier with a random forest, support vector machine, or neural network becomes more meaningful.

7. How to Read the Evaluation

The script prints three kinds of results:

Accuracy: the overall proportion of correct predictions
confusion_matrix: which classes were predicted incorrectly
classification_report: precision, recall, F1-score, and related metrics

Even if accuracy is high, do not stop there. Check the confusion matrix to see which class causes mistakes, then compare precision and recall to the requirements of the problem.

8. What to Try Next

After the script runs, try a few small experiments:

Change test_size to 0.3 and see whether results stay stable
Remove StandardScaler and compare the metrics
Replace LogisticRegression with RandomForestClassifier
Print dataset.feature_names and read what each feature means
Find the indexes of wrong predictions and inspect those samples

The key to learning AI foundations is to make each example explainable. In this practice project, you did not just run a classifier. You walked through a complete machine learning workflow.

9. Add This Practice to Your Notes

After running the code, record these details:

How many samples, features, and classes the dataset contains
How many samples are in the training and test sets
The accuracy, precision, recall, and F1-score
Which type of mistake appears more often in the confusion matrix
What changes when you remove standardization or switch models

These notes are more useful than saving only one accuracy value because they help you explain the experiment, not just preserve the result.

10. Series Review

This article turns the previous concepts into code. To revisit the foundations, start again from the AI Basics Learning Roadmap, or return to the Blog page for the full series.

Chinese

Python 人工智能小实战：用 scikit-learn 完成一个分类任务

Open as a full page

前面几篇文章讲了人工智能概念、机器学习流程、模型训练评估和神经网络基础。这一篇用一个小实战把流程跑通：使用 Python 和 scikit-learn 完成一个二分类任务。

这个例子使用 scikit-learn 内置的 breast cancer 数据集，不需要额外下载文件。重点不是追求最高分，而是完整经历数据加载、拆分、标准化、训练、预测和评估。

注意：这个数据集只用于机器学习教学练习，不能用于医疗判断或现实诊断。本文关注的是分类流程，而不是医学结论。

一、准备环境

建议先创建虚拟环境，再安装依赖：

python3 -m venv .venv
source .venv/bin/activate
pip install scikit-learn

本文只使用 scikit-learn，不依赖深度学习框架。这样可以把注意力放在机器学习的基本流程上。

二、完整代码

下面是一份可以直接运行的代码：

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def main():
    dataset = load_breast_cancer()
    X = dataset.data
    y = dataset.target

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=42,
        stratify=y,
    )

    model = Pipeline(
        steps=[
            ("scaler", StandardScaler()),
            ("classifier", LogisticRegression(max_iter=500)),
        ]
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("Classification report:")
    print(classification_report(y_test, y_pred, target_names=dataset.target_names))


if __name__ == "__main__":
    main()

保存为 ai_classification_demo.py 后运行：

python ai_classification_demo.py

如果你第一次运行时下载或导入依赖较慢，可以先确认虚拟环境已经启用，并用 python -c "import sklearn; print(sklearn.__version__)" 检查 scikit-learn 是否安装成功。

三、数据集是什么

load_breast_cancer() 会返回一个二分类数据集。每条样本包含多个数值特征，标签表示样本属于哪一类。

在代码里：

X 是特征矩阵，每一行是一条样本
y 是标签数组，每个元素对应一条样本的类别
dataset.target_names 保存类别名称

这个数据集已经被整理成数值特征，适合用来练习基础分类流程。

四、为什么要拆分训练集和测试集

代码中使用了 train_test_split()：

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

这里 test_size=0.2 表示 20% 数据用于测试。stratify=y 表示划分后尽量保持类别比例一致，这对分类任务很有用。

如果不拆分测试集，只在训练集上看结果，模型可能只是记住了训练样本，而不是真的具备泛化能力。

五、为什么使用 Pipeline

代码里没有单独先标准化再训练，而是使用了 Pipeline：

model = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("classifier", LogisticRegression(max_iter=500)),
    ]
)

这样做有两个好处：

标准化和模型训练被放在同一个流程里，不容易漏步骤
测试集会使用训练集上学到的标准化参数，避免数据泄漏

数据泄漏是初学者常见错误。如果你先对完整数据做标准化，再拆分训练集和测试集，测试集的信息就已经提前影响了训练过程。

六、模型为什么选逻辑回归

逻辑回归是分类任务里非常经典的基线模型。它训练速度快、结果稳定、容易解释，适合作为入门模型。

这里没有直接使用神经网络，是因为入门时先跑通完整流程更重要。等你能解释这段代码的每一步，再换成随机森林、支持向量机或神经网络会更自然。

七、怎么看评估结果

代码会输出三类结果：

Accuracy：整体预测正确比例
confusion_matrix：模型把哪些类别预测错了
classification_report：precision、recall、F1-score 等指标

如果准确率很高，也不要马上结束。你还应该看混淆矩阵，确认模型主要错在哪一类；再看 recall 和 precision，判断错误类型是否符合业务要求。

八、可以继续尝试什么

跑通代码后，可以做几个小实验：

把 test_size 改成 0.3，观察结果是否稳定
去掉 StandardScaler，比较指标变化
把 LogisticRegression 换成 RandomForestClassifier
打印 dataset.feature_names，理解每个特征的含义
尝试找出预测错误的样本索引，看看它们有什么特点

人工智能基础学习的关键，是把每个例子都拆成可解释的步骤。你不只是运行了一个分类模型，而是完整走过了一次机器学习工作流。

九、把这个练习写进学习笔记

建议你运行完代码后，记录下面几项：

数据集有多少样本、多少特征、几个类别
训练集和测试集分别有多少样本
准确率、precision、recall 和 F1-score 分别是多少
混淆矩阵里哪一类错误更多
去掉标准化或换模型后，结果有什么变化

这些记录比单纯截图一个准确率更有价值，因为它们能帮助你解释实验，而不是只保存结果。

十、系列回顾

这篇文章把前面的内容落到了代码上。需要回看概念时，可以从人工智能基础学习路线重新开始，也可以回到博客页查看完整系列。

Run notes

Environment: Python 3 + scikit-learn

Install

python3 -m venv .venv
source .venv/bin/activate
pip install scikit-learn

Run

python ai_classification_demo.py

Input: Built-in scikit-learn breast cancer practice dataset
Expected output: Prints accuracy, confusion matrix, and classification report.

Install python3 -m venv .venv
Install source .venv/bin/activate
Install pip install scikit-learn
Run python ai_classification_demo.py

1. Prepare the Environment

Create a virtual environment and install the dependency:

python3 -m venv .venv
source .venv/bin/activate
pip install scikit-learn

This example uses only scikit-learn, not a deep learning framework. That keeps the focus on the basic machine learning workflow.

2. Complete Code

The following script can be run directly:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def main():
    dataset = load_breast_cancer()
    X = dataset.data
    y = dataset.target

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=42,
        stratify=y,
    )

    model = Pipeline(
        steps=[
            ("scaler", StandardScaler()),
            ("classifier", LogisticRegression(max_iter=500)),
        ]
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("Classification report:")
    print(classification_report(y_test, y_pred, target_names=dataset.target_names))


if __name__ == "__main__":
    main()

Save it as ai_classification_demo.py and run:

python ai_classification_demo.py

3. What the Dataset Contains

load_breast_cancer() returns a binary classification dataset. Each sample contains numeric features, and the label indicates which class the sample belongs to.

In the script:

X is the feature matrix, with one row per sample
y is the label array, with one label per sample
dataset.target_names contains the class names

The dataset is already prepared as numeric features, which makes it useful for practicing classification basics.

4. Why Split Training and Test Data?

The script uses train_test_split():

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

test_size=0.2 means 20% of the data is reserved for testing. stratify=y tries to preserve the class ratio after the split, which is useful for classification.

If you evaluate only on training data, the model may have memorized training examples instead of learning a pattern that generalizes.

5. Why Use Pipeline?

The code uses Pipeline instead of manually standardizing first and training later:

model = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("classifier", LogisticRegression(max_iter=500)),
    ]
)

This has two benefits:

Standardization and classification stay in one reproducible workflow
The test set uses scaling parameters learned only from the training set, which avoids data leakage

Data leakage is a common beginner mistake. If you standardize the full dataset before splitting, information from the test set has already influenced training.

6. Why Logistic Regression?

Logistic regression is a classic baseline for classification. It is fast, stable, and easier to explain than many more complex models.

7. How to Read the Evaluation

The script prints three kinds of results:

Accuracy: the overall proportion of correct predictions
confusion_matrix: which classes were predicted incorrectly
classification_report: precision, recall, F1-score, and related metrics

Even if accuracy is high, do not stop there. Check the confusion matrix to see which class causes mistakes, then compare precision and recall to the requirements of the problem.

8. What to Try Next

After the script runs, try a few small experiments:

Change test_size to 0.3 and see whether results stay stable
Remove StandardScaler and compare the metrics
Replace LogisticRegression with RandomForestClassifier
Print dataset.feature_names and read what each feature means
Find the indexes of wrong predictions and inspect those samples

The key to learning AI foundations is to make each example explainable. In this practice project, you did not just run a classifier. You walked through a complete machine learning workflow.

9. Add This Practice to Your Notes

After running the code, record these details:

How many samples, features, and classes the dataset contains
How many samples are in the training and test sets
The accuracy, precision, recall, and F1-score
Which type of mistake appears more often in the confusion matrix
What changes when you remove standardization or switch models

These notes are more useful than saving only one accuracy value because they help you explain the experiment, not just preserve the result.

10. Series Review

This article turns the previous concepts into code. To revisit the foundations, start again from the AI Basics Learning Roadmap, or return to the Blog page for the full series.

Search questions

FAQ

Who is this article for?

This article is for readers who want a practice-level guide to Python AI Mini Practice. It takes about 10 min and focuses on Python, scikit-learn, Classification.

What should I read next?

Use the related tutorials and project links below the article to continue through the closest topic hub.

Does this article include runnable code or companion resources?

Yes. Use the run notes, resource cards, and download links on the page to reproduce the example or inspect the companion files.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

AI Learning Project

A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.

Level: Practice Reading time: 10 min

Python
scikit-learn
Classification

Your next step

Continue: Handwritten Digit Dataset Basics

Review the foundation Open resource

Other language version Python 人工智能小实战：用 scikit-learn 完成一个分类任务

Share summary Python AI Mini Practice

Run a small scikit-learn classification task and read the experiment output.

Download share card Open share center

Companion resources

The article includes a runnable scikit-learn classification script.

Open resource Related article

Project timeline

Published posts

AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.

Published resources

Python AI practice code guide The article includes a runnable scikit-learn classification script.
digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
sample_submission.csv The official submission format example for checking the final output columns.
submission.csv The prediction file generated by the current C project.
digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
model_weights.bin sample weights Model weights generated by one local small-sample run.
test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
CNN project explanation PDF Companion explanation material for the CNN project.
Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
AI security risk register CSV risk register template for AI threat modeling and release review.
AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.

Current route

AI Basics Learning Roadmap Learning path step
Machine Learning Workflow Learning path step
Model Training and Evaluation Learning path step
Neural Network Basics Learning path step
Transformer Self-Attention Learning path step
LLM Visualizer Learning path step
Python AI Mini Practice Learning path step
Handwritten Digit Dataset Basics Learning path step
Handwritten Digit Softmax in C Learning path step
Handwritten Digit Playground Notes Learning path step
CIFAR-10 Tiny CNN Tutorial in C Learning path step
High-Entropy Traffic Defense Notes Learning path step
AI Security Threat Modeling Learning path step
Adversarial Examples and Robust Evaluation Learning path step
Data Poisoning and Backdoor Defense Learning path step
Model Privacy and Extraction Defense Learning path step
LLM, RAG, and Agent Security Learning path step

Next notes

Add more image-classification and error-analysis cases
Turn common metrics into a quick reference
Add more AI security defense experiment notes

1. Prepare the Environment

2. Complete Code

3. What the Dataset Contains

4. Why Split Training and Test Data?

5. Why Use Pipeline?

6. Why Logistic Regression?

7. How to Read the Evaluation

8. What to Try Next

9. Add This Practice to Your Notes

10. Series Review

一、准备环境

二、完整代码

三、数据集是什么

四、为什么要拆分训练集和测试集

五、为什么使用 Pipeline

六、模型为什么选逻辑回归

七、怎么看评估结果

八、可以继续尝试什么

九、把这个练习写进学习笔记

十、系列回顾

1. Prepare the Environment

2. Complete Code

3. What the Dataset Contains

4. Why Split Training and Test Data?

5. Why Use Pipeline?

6. Why Logistic Regression?

7. How to Read the Evaluation

8. What to Try Next

9. Add This Practice to Your Notes

10. Series Review

Who is this article for?

What should I read next?

Does this article include runnable code or companion resources?

How does this article fit into the larger site?

Companion resources

Python AI practice code guide

Leave a Reply Cancel reply

Project timeline