Handwritten Digit Dataset Tutorial: train.csv, test.csv, Labels, and Pixels

Reading info

Level: Beginner Reading time: 8 min

Dataset
CSV
Image Classification

Open knowledge map

English

Handwritten Digit Project Basics: Understanding train.csv, test.csv, and Labels

This handwritten digit project is a good bridge between theory-heavy machine learning notes and a real classification workflow. The input is simple enough to inspect row by row, but the project still forces you to deal with data loading, normalization, model training, and prediction output in a coherent way.

The best place to start is not the training loop. It is the dataset structure. The C classifier, the browser playground, and the final submission file all depend on the same flat 28 by 28 pixel format, so understanding the CSV layout makes the rest of the project much easier to follow.

1. What files are in the project

train.csv: the training set with 42000 labeled samples
test.csv: the test set with 28000 unlabeled samples
sample_submission.csv: the expected output format
submission.csv: the prediction file generated by the current implementation
digit_softmax_classifier.c: the C implementation used on the site

This layout is common in beginner-friendly supervised learning challenges because it keeps the separation of responsibilities clear: one file for learning parameters, one file for final predictions.

2. What one row in `train.csv` means

The first column is the label, which is the true digit for that image. The remaining 784 columns are grayscale pixel intensities between 0 and 255:

label,pixel0,pixel1,pixel2,...,pixel783
5,0,0,0,0,...,0
0,0,0,12,178,...,0
4,0,0,0,0,...,0

The important detail is that the original image has already been flattened into a feature vector. The program does not read image files. It reads numeric rows.

Because 28 x 28 = 784, every sample is effectively:

row 1 pixels + row 2 pixels + ... + row 28 pixels
= one 784-dimensional feature vector

That is why a plain linear classifier can still work on this task. To the model, the image is just a structured numeric input vector.

3. How `test.csv` differs from the training set

test.csv contains only pixels and no labels. That means the program cannot keep training on it. It must use the parameters learned from train.csv and produce predictions directly.

Training: input features plus the correct answer
Inference: input features only, no answer attached

This distinction matters because it forces the implementation to separate training logic from prediction logic. The exported submission.csv is simply the predicted label for each test sample written back into the required output format.

4. How the C program loads the data

The loader is intentionally straightforward. It splits each CSV row by commas, stores the first field as the label, and turns the remaining 784 fields into numeric features.

y_train[sample_count] = atoi(tokens[0]);
for (int j = 0; j < FEATURES; j++) {
    X_train[sample_count][j] = atof(tokens[j + 1]) / 255.0;
}

Two implementation details matter here:

The label is stored separately so the training loop can compute loss and accuracy
The pixels are divided by 255 so the values stay in the 0 to 1 range

If you skip the normalization step and train directly on raw 0 to 255 pixel values, gradient-based optimization becomes less stable. For flat image tables like this one, simple scaling is the right default.

5. Why this format is good for learning

This project is useful because it removes a lot of incidental complexity:

Simple input structure: no image decoding pipeline required
Clear labels: ten classes, one digit per sample
Direct debugging path: any row can be reshaped back into a 28 by 28 grid

That makes it a strong practice task for the full machine learning workflow: load data, normalize features, train parameters, run predictions, and export a CSV result.

6. What to validate before training

If you implement your own version, check these first:

Whether the header row is skipped correctly
Whether the training and test counts are close to 42000 and 28000
Whether each row contains exactly 785 or 784 fields
Whether pixel values have been scaled to 0 to 1
Whether labels still stay in the 0 to 9 range

Those checks matter more than changing the model too early. Many broken training runs come from bad CSV parsing, off-by-one field mistakes, or missing normalization.

7. What to read next

Once the dataset format makes sense, continue with the C softmax classifier article. That article walks through the weight matrix, softmax probabilities, gradient updates, and how the project produces submission.csv.

The downloadable files now live on the downloads page, and the lightweight interactive version is available in the handwritten digit tab inside the playground.

Chinese

手写数字识别项目入门：先读懂 train.csv、test.csv 和标签结构

Open as a full page

这组手写数字项目来自一个非常典型的入门场景：我们拿到一份已经展开成表格的图像数据，目标是根据 28 x 28 像素的灰度值预测数字 0 到 9。和很多只讲模型公式的文章不同，这个项目更适合从“文件结构”和“数据长什么样”开始读，因为后面的 C 程序、浏览器实验台和提交文件都建立在同一套输入格式上。

如果你已经会一点 C 或 Python，这类项目是很好的过渡练习。它既不像纯算法题那样只有抽象状态，也不像完整深度学习项目那样一开始就需要复杂框架。先把数据读懂，后面的训练、预测和调试会顺很多。

一、这个项目里有哪些文件

train.csv：训练集，共 42000 条样本，每条样本包含 1 个标签和 784 个像素值
test.csv：测试集，共 28000 条样本，只包含 784 个像素值，没有标签
sample_submission.csv：官方给出的提交格式示例
submission.csv：当前项目运行后生成的预测结果
digit_softmax_classifier.c：本项目的 C 语言实现

这类结构很适合做监督学习入门，因为训练集和测试集分工很清楚：训练集负责学习参数，测试集负责生成最终预测结果。

二、train.csv 的每一行到底是什么

train.csv 的第一列是标签，也就是这张图片真实对应的数字。后面 784 列是像素值，范围通常在 0 到 255 之间：

label,pixel0,pixel1,pixel2,...,pixel783
5,0,0,0,0,...,0
0,0,0,12,178,...,0
4,0,0,0,0,...,0

这里最重要的理解是：原始图像已经被“拉平”成一个长度为 784 的向量。也就是说，程序读到的不是图片文件，而是一行一行的数字表格。

因为 28 x 28 = 784，所以你可以把它理解成：

第 1 行像素  +  第 2 行像素  +  ...  +  第 28 行像素
= 一条长度为 784 的特征向量

这就是为什么传统的线性分类器也能直接拿它做输入：对模型来说，它只是一组 784 维数值特征。

三、test.csv 和训练集的区别

test.csv 只有像素，没有标签。这意味着程序不能再拿它继续训练，而是要基于已经学到的参数直接做预测：

训练时：输入特征 + 正确答案
预测时：只有输入特征，没有正确答案

这一步在初学者项目里很关键，因为它会逼着你把“训练逻辑”和“推理逻辑”分开写。项目里最后导出的 submission.csv，本质上就是把测试集逐条送进模型之后得到的标签结果。

四、C 程序是怎么把这些数据读进来的

这个项目的读取方式比较直接：先按逗号切开每一行，再把第一个字段当成标签，把后面的 784 个字段当成像素。

y_train[sample_count] = atoi(tokens[0]);
for (int j = 0; j < FEATURES; j++) {
    X_train[sample_count][j] = atof(tokens[j + 1]) / 255.0;
}

这里有两个重要细节：

标签单独保存：便于后续计算损失和判断预测是否正确
像素除以 255：把原始灰度值压到 0 到 1 之间，训练会更稳定

如果你直接把 0 到 255 的原始像素塞给一个梯度下降模型，参数更新会更容易受尺度影响。对这类表格化图像项目来说，做一次简单归一化几乎是默认操作。

五、为什么这种“表格图像”特别适合入门

它有三个优点：

数据结构简单：不需要先学图像文件解码
标签明确：10 个数字类别，适合多分类练习
调试直接：任何一行都能拿出来还原成 28 x 28 网格查看

也正因为这样，这个项目很适合把“机器学习流程”真正串起来：读入数据、归一化、训练参数、输出预测，再把预测写回 CSV 文件。

六、开始训练前最值得先检查什么

如果你准备自己写一个版本，建议先确认下面几件事：

有没有正确跳过表头
训练集行数是不是接近 42000，测试集是不是接近 28000
每行是不是刚好有 785 或 784 个字段
像素值是否已经缩放到 0 到 1
标签是不是仍然保持在 0 到 9 之间

这些检查比换模型更基础。很多训练失败并不是算法错误，而是 CSV 没读对、字段偏移、或者归一化漏掉了。

七、接下来该读哪篇

如果你已经看懂了这份数据长什么样，下一步建议直接读用 C 实现手写数字 Softmax 分类器。那篇会把这 784 维输入如何经过权重矩阵、softmax 和梯度更新，最终变成 submission.csv 讲清楚。

项目文件和压缩数据已经放到下载页的手写数字资源区；如果你想直接试网页上的轻量演示，可以继续打开算法实验台里的手写数字标签页。

1. What files are in the project

train.csv: the training set with 42000 labeled samples
test.csv: the test set with 28000 unlabeled samples
sample_submission.csv: the expected output format
submission.csv: the prediction file generated by the current implementation
digit_softmax_classifier.c: the C implementation used on the site

This layout is common in beginner-friendly supervised learning challenges because it keeps the separation of responsibilities clear: one file for learning parameters, one file for final predictions.

2. What one row in `train.csv` means

The first column is the label, which is the true digit for that image. The remaining 784 columns are grayscale pixel intensities between 0 and 255:

label,pixel0,pixel1,pixel2,...,pixel783
5,0,0,0,0,...,0
0,0,0,12,178,...,0
4,0,0,0,0,...,0

The important detail is that the original image has already been flattened into a feature vector. The program does not read image files. It reads numeric rows.

Because 28 x 28 = 784, every sample is effectively:

row 1 pixels + row 2 pixels + ... + row 28 pixels
= one 784-dimensional feature vector

That is why a plain linear classifier can still work on this task. To the model, the image is just a structured numeric input vector.

3. How `test.csv` differs from the training set

test.csv contains only pixels and no labels. That means the program cannot keep training on it. It must use the parameters learned from train.csv and produce predictions directly.

Training: input features plus the correct answer
Inference: input features only, no answer attached

4. How the C program loads the data

The loader is intentionally straightforward. It splits each CSV row by commas, stores the first field as the label, and turns the remaining 784 fields into numeric features.

y_train[sample_count] = atoi(tokens[0]);
for (int j = 0; j < FEATURES; j++) {
    X_train[sample_count][j] = atof(tokens[j + 1]) / 255.0;
}

Two implementation details matter here:

The label is stored separately so the training loop can compute loss and accuracy
The pixels are divided by 255 so the values stay in the 0 to 1 range

5. Why this format is good for learning

This project is useful because it removes a lot of incidental complexity:

Simple input structure: no image decoding pipeline required
Clear labels: ten classes, one digit per sample
Direct debugging path: any row can be reshaped back into a 28 by 28 grid

That makes it a strong practice task for the full machine learning workflow: load data, normalize features, train parameters, run predictions, and export a CSV result.

6. What to validate before training

If you implement your own version, check these first:

Whether the header row is skipped correctly
Whether the training and test counts are close to 42000 and 28000
Whether each row contains exactly 785 or 784 fields
Whether pixel values have been scaled to 0 to 1
Whether labels still stay in the 0 to 9 range

Those checks matter more than changing the model too early. Many broken training runs come from bad CSV parsing, off-by-one field mistakes, or missing normalization.

7. What to read next

The downloadable files now live on the downloads page, and the lightweight interactive version is available in the handwritten digit tab inside the playground.

Search questions

FAQ

Who is this article for?

This article is for readers who want a beginner-level guide to Handwritten Digit Dataset Basics. It takes about 8 min and focuses on Dataset, CSV, Image Classification.

What should I read next?

The recommended next step is Handwritten Digit Softmax in C, so the article connects into a longer learning route instead of ending as an isolated note.

Does this article include runnable code or companion resources?

Yes. Use the run notes, resource cards, and download links on the page to reproduce the example or inspect the companion files.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

AI Learning Project

A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.

Level: Beginner Reading time: 8 min

Dataset
CSV
Image Classification

Your next step

Continue: Handwritten Digit Softmax in C

Open resource View project

Other language version 手写数字识别项目入门：先读懂 train.csv、test.csv 和标签结构

Share summary Handwritten Digit Dataset Basics

Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.

Download share card Open share center

Companion resources

Compressed handwritten digit training set with 42000 labeled samples.

Open resource Related article

Compressed handwritten digit test set with 28000 unlabeled samples.

Open resource Related article

The official submission format example for checking the final output columns.

Open resource Related article

Project timeline

Published posts

AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.

Published resources

Python AI practice code guide The article includes a runnable scikit-learn classification script.
digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
sample_submission.csv The official submission format example for checking the final output columns.
submission.csv The prediction file generated by the current C project.
digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
model_weights.bin sample weights Model weights generated by one local small-sample run.
test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
CNN project explanation PDF Companion explanation material for the CNN project.
Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
AI security risk register CSV risk register template for AI threat modeling and release review.
AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.

Current route

AI Basics Learning Roadmap Learning path step
Machine Learning Workflow Learning path step
Model Training and Evaluation Learning path step
Neural Network Basics Learning path step
Transformer Self-Attention Learning path step
LLM Visualizer Learning path step
Python AI Mini Practice Learning path step
Handwritten Digit Dataset Basics Learning path step
Handwritten Digit Softmax in C Learning path step
Handwritten Digit Playground Notes Learning path step
CIFAR-10 Tiny CNN Tutorial in C Learning path step
High-Entropy Traffic Defense Notes Learning path step
AI Security Threat Modeling Learning path step
Adversarial Examples and Robust Evaluation Learning path step
Data Poisoning and Backdoor Defense Learning path step
Model Privacy and Extraction Defense Learning path step
LLM, RAG, and Agent Security Learning path step

Next notes

Add more image-classification and error-analysis cases
Turn common metrics into a quick reference
Add more AI security defense experiment notes

1. What files are in the project

2. What one row in train.csv means

3. How test.csv differs from the training set

4. How the C program loads the data

5. Why this format is good for learning

6. What to validate before training

7. What to read next

一、这个项目里有哪些文件

二、train.csv 的每一行到底是什么

三、test.csv 和训练集的区别

四、C 程序是怎么把这些数据读进来的

五、为什么这种“表格图像”特别适合入门

六、开始训练前最值得先检查什么

七、接下来该读哪篇

1. What files are in the project

2. What one row in train.csv means

3. How test.csv differs from the training set

4. How the C program loads the data

5. Why this format is good for learning

6. What to validate before training

7. What to read next

Who is this article for?

What should I read next?

Does this article include runnable code or companion resources?

How does this article fit into the larger site?

Companion resources

train.csv.zip

test.csv.zip

sample_submission.csv

Leave a Reply Cancel reply

Project timeline

2. What one row in `train.csv` means

3. How `test.csv` differs from the training set

2. What one row in `train.csv` means

3. How `test.csv` differs from the training set