English
NLP Basics: Understanding Bag of Words and TF-IDF
After learning about image-based deep learning, the next most common field to explore is Natural Language Processing (NLP). Unlike images, which are fixed grids of pixels, text is a sequence of characters of varying length. Machines cannot read words directly. This brings us to the first core challenge in NLP: how do we convert words into numbers that a computer can process?
This article serves as an introduction to NLP. We will look at two of the most traditional and classic text representation methods: the Bag of Words (BoW) model and TF-IDF. Although they are simple, they remain highly effective for many basic classification tasks.
1. The Most Intuitive Approach: Bag of Words
Imagine a bag filled with words. When you receive a sentence, you only care about which words appear in it and how many times they appear, completely ignoring word order and grammatical structure.
The specific steps are very straightforward:
- Build a Vocabulary: Collect all the unique words that appear across all texts and put them in a fixed order. For example:
["AI", "is", "fun", "learning", "hard"]. - Count Frequencies: For any new sentence, count how many times each word in the vocabulary appears.
For example, if we remove words outside the vocabulary, the frequency vector for the sentence "learning AI is fun and learning is hard" might look like this:
# Vocabulary: ["AI", "is", "fun", "learning", "hard"]
# Vector: [1, 2, 1, 2, 1]
In this way, a piece of text of arbitrary length is converted into a fixed-length numerical vector. It can then be fed into logistic regression or a neural network for classification.
2. Limitations of Bag of Words
Bag of Words is simple and intuitive, but it has several obvious flaws:
- Extremely Sparse Vectors: In real applications, a vocabulary might contain tens of thousands to hundreds of thousands of words, while a single sentence typically contains only a few dozen. The vast majority of positions in the generated vector will be 0, causing serious memory and computation waste.
- Ignores Semantic Relationships: "Good" and "excellent" mean roughly the same thing, but in the BoW model, they are two completely orthogonal dimensions with no connection whatsoever.
- Complete Loss of Order: "Dog bites man" and "Man bites dog" have exactly the same BoW representation, but their meanings in reality are entirely different.
3. Improving Word Frequency: TF-IDF
In the BoW model, common words like "the," "is," and "a" will appear frequently in almost every document. If we only look at word frequency, the algorithm might mistakenly think these words are the most important. TF-IDF was designed to solve this problem.
TF-IDF stands for Term Frequency - Inverse Document Frequency. It considers not only how frequently a word appears in the current document (TF) but also how rare it is across all documents (IDF).
An intuitive understanding of the formula:
- TF (Term Frequency): The number of times the word appears in this specific document. The more frequent, the more it represents the document's topic.
- IDF (Inverse Document Frequency):
log(Total Documents / Documents containing the word). If a word is present in almost all documents, its IDF approaches 0, reducing its weight.
TF-IDF aggressively downweights words like "the" while amplifying the importance of rare but key terms like "machine" or "quantum". It is excellent for keyword extraction or simple text classification.
4. A Stepping Stone to Deep Learning
Whether using the Bag of Words model or TF-IDF, they essentially rely on statistical features of words. The model doesn't truly understand the meaning of the words; it only remembers the probabilities of which words tend to appear together.
Due to the curse of dimensionality, the lack of sequence information, and the inability to comprehend semantics, these classic methods fall short in complex tasks like dialogue understanding or machine translation. To overcome these shortcomings, the NLP field introduced concepts like "Word Embeddings" and "Recurrent Neural Networks (RNNs)," which will be the focus of our next article.
Chinese
人工智能 NLP 基础:词袋模型与 TF-IDF 详解
Open as a full page在学习了基于图像的深度学习之后,下一个最常见的领域就是自然语言处理(NLP)。与图片固定像素的矩阵不同,文本是长短不一的字符序列,机器无法直接看懂文字。这就引出了 NLP 的第一个核心问题:如何把单词变成计算机能处理的数字?
这篇文章是自然语言处理的入门,我们将介绍两种最传统也最经典的文本表示方法:词袋模型(Bag of Words)和 TF-IDF。它们虽然简单,但在许多基础分类任务中依然非常有效。
一、最直观的思路:词袋模型(Bag of Words)
想象一个装满词汇的袋子。当你拿到一句话时,你只关心这句话里出现了哪些词,以及每个词出现了几次,完全忽略词语的前后顺序和语法结构。
具体步骤非常简单:
- 构建词表(Vocabulary):把所有文本中出现的不重复单词收集起来,排个序。比如:
["AI", "is", "fun", "learning", "hard"]。 - 统计词频:对于任何一个新句子,看它在词表中每个位置对应的单词出现了多少次。
比如句子 “learning AI is fun and learning is hard”,如果去掉词表外的词,它的词频统计向量可能长这样:
# 词表: ["AI", "is", "fun", "learning", "hard"]
# 向量: [1, 2, 1, 2, 1]
这样,一段长短不一的文本,就被转换成了一个固定长度的数字向量。接着就可以把它送进逻辑回归或神经网络进行分类了。
二、词袋模型的局限性
词袋模型简单直观,但有几个明显的缺陷:
- 向量极其稀疏:真实应用中,词表可能有几万到几十万个词,但一句话通常只有几十个词。生成的向量绝大多数位置都是 0,造成严重的内存和计算浪费。
- 忽略了语义关系:“good” 和 “excellent” 意思相近,但在词袋模型中,它们是两个完全正交的维度,没有任何联系。
- 完全丢失了顺序:“Dog bites man” 和 “Man bites dog” 的词袋表示完全一样,但这两种情况在现实中的意义截然不同。
三、词频的改进:TF-IDF
在词袋模型中,像 “the”、“is”、“a” 这种词会在每篇文章里大量出现,导致它们的频次很高。如果只看词频,算法会误以为这些词最重要。TF-IDF 旨在解决这个问题。
TF-IDF 的全称是 Term Frequency - Inverse Document Frequency(词频 - 逆文档频率)。它不仅考虑一个词在当前文档里出现的频率(TF),还考虑它在所有文档中有多罕见(IDF)。
计算公式的直观理解:
- TF(词频):这个词在这篇文章里出现的次数。次数越多,越能代表这篇文章的主题。
- IDF(逆文档频率):
log(总文档数 / 包含这个词的文档数)。如果一个词在几乎所有文章里都有,那它的 IDF 就会趋近于 0,从而降低它的权重。
TF-IDF 会把 “the” 的权重压到极低,而把像 “machine”、“quantum” 这种罕见但关键的词权重放大,非常适合用于关键词提取或简单的文本分类。
四、从统计特征到深度学习的铺垫
无论是词袋模型还是 TF-IDF,它们本质上都在利用词的统计特征。模型并没有真正理解词的含义,只是记住了哪些词在一起出现的概率更高。
由于维度灾难、缺乏顺序信息和无法理解语义,这些经典方法在面对复杂的对话理解、机器翻译等任务时显得力不从心。为了克服这些缺点,自然语言处理领域引入了“词嵌入(Word Embeddings)”和“循环神经网络(RNN)”的概念,这将是我们下一篇文章要探讨的核心。
After learning about image-based deep learning, the next most common field to explore is Natural Language Processing (NLP). Unlike images, which are fixed grids of pixels, text is a sequence of characters of varying length. Machines cannot read words directly. This brings us to the first core challenge in NLP: how do we convert words into numbers that a computer can process?
This article serves as an introduction to NLP. We will look at two of the most traditional and classic text representation methods: the Bag of Words (BoW) model and TF-IDF. Although they are simple, they remain highly effective for many basic classification tasks.
1. The Most Intuitive Approach: Bag of Words
Imagine a bag filled with words. When you receive a sentence, you only care about which words appear in it and how many times they appear, completely ignoring word order and grammatical structure.
The specific steps are very straightforward:
- Build a Vocabulary: Collect all the unique words that appear across all texts and put them in a fixed order. For example:
["AI", "is", "fun", "learning", "hard"]. - Count Frequencies: For any new sentence, count how many times each word in the vocabulary appears.
For example, if we remove words outside the vocabulary, the frequency vector for the sentence “learning AI is fun and learning is hard” might look like this:
# Vocabulary: ["AI", "is", "fun", "learning", "hard"]
# Vector: [1, 2, 1, 2, 1]
In this way, a piece of text of arbitrary length is converted into a fixed-length numerical vector. It can then be fed into logistic regression or a neural network for classification.
2. Limitations of Bag of Words
Bag of Words is simple and intuitive, but it has several obvious flaws:
- Extremely Sparse Vectors: In real applications, a vocabulary might contain tens of thousands to hundreds of thousands of words, while a single sentence typically contains only a few dozen. The vast majority of positions in the generated vector will be 0, causing serious memory and computation waste.
- Ignores Semantic Relationships: “Good” and “excellent” mean roughly the same thing, but in the BoW model, they are two completely orthogonal dimensions with no connection whatsoever.
- Complete Loss of Order: “Dog bites man” and “Man bites dog” have exactly the same BoW representation, but their meanings in reality are entirely different.
3. Improving Word Frequency: TF-IDF
In the BoW model, common words like “the,” “is,” and “a” will appear frequently in almost every document. If we only look at word frequency, the algorithm might mistakenly think these words are the most important. TF-IDF was designed to solve this problem.
TF-IDF stands for Term Frequency – Inverse Document Frequency. It considers not only how frequently a word appears in the current document (TF) but also how rare it is across all documents (IDF).
An intuitive understanding of the formula:
- TF (Term Frequency): The number of times the word appears in this specific document. The more frequent, the more it represents the document’s topic.
- IDF (Inverse Document Frequency):
log(Total Documents / Documents containing the word). If a word is present in almost all documents, its IDF approaches 0, reducing its weight.
TF-IDF aggressively downweights words like “the” while amplifying the importance of rare but key terms like “machine” or “quantum”. It is excellent for keyword extraction or simple text classification.
4. A Stepping Stone to Deep Learning
Whether using the Bag of Words model or TF-IDF, they essentially rely on statistical features of words. The model doesn’t truly understand the meaning of the words; it only remembers the probabilities of which words tend to appear together.
Due to the curse of dimensionality, the lack of sequence information, and the inability to comprehend semantics, these classic methods fall short in complex tasks like dialogue understanding or machine translation. To overcome these shortcomings, the NLP field introduced concepts like “Word Embeddings” and “Recurrent Neural Networks (RNNs),” which will be the focus of our next article.
Search questions
FAQ
Who is this article for?
This article is for readers who want a beginner-level guide to NLP Basics: Understanding Bag of Words and TF-IDF. It takes about 8 min and focuses on NLP, Bag of Words, TF-IDF, Machine Learning.
What should I read next?
The recommended next step is RNN Basics: Handling Sequential Data with Memory, so the article connects into a longer learning route instead of ending as an isolated note.
Does this article include runnable code or companion resources?
This article is primarily explanatory, but the related tutorials point to runnable examples, resources, and project pages.
How does this article fit into the larger site?
It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.
Article context
AI Learning Project
A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.
An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
Open share centerProject timeline
Published posts
- AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
- Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
- Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
- Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
- NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
- RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
- Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
- Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
- Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
- Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
- Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
- CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
- Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
- High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
- AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
- Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
- Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
- Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
- LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.
Published resources
- Python AI practice code guide The article includes a runnable scikit-learn classification script.
- digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
- train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
- test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
- sample_submission.csv The official submission format example for checking the final output columns.
- submission.csv The prediction file generated by the current C project.
- digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
- digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
- Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
- cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
- model_weights.bin sample weights Model weights generated by one local small-sample run.
- test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
- CNN project explanation PDF Companion explanation material for the CNN project.
- Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
- Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
- Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
- Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
- AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
- AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
- AI security risk register CSV risk register template for AI threat modeling and release review.
- AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
- AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
- FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
- Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
- Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
- RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
- Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
- Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
- Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
- CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.
Current route
- AI Basics Learning Roadmap Learning path step
- Machine Learning Workflow Learning path step
- Model Training and Evaluation Learning path step
- Neural Network Basics Learning path step
- Transformer Self-Attention Learning path step
- LLM Visualizer Learning path step
- Python AI Mini Practice Learning path step
- Handwritten Digit Dataset Basics Learning path step
- Handwritten Digit Softmax in C Learning path step
- Handwritten Digit Playground Notes Learning path step
- CIFAR-10 Tiny CNN Tutorial in C Learning path step
- High-Entropy Traffic Defense Notes Learning path step
- AI Security Threat Modeling Learning path step
- Adversarial Examples and Robust Evaluation Learning path step
- Data Poisoning and Backdoor Defense Learning path step
- Model Privacy and Extraction Defense Learning path step
- LLM, RAG, and Agent Security Learning path step
Next notes
- Add more image-classification and error-analysis cases
- Turn common metrics into a quick reference
- Add more AI security defense experiment notes
