RNN Sequence Model Tutorial: How Recurrent Neural Networks Use Memory

Reading info

Level: Intermediate Reading time: 9 min

RNN
Sequence Models
Neural Networks

Open knowledge map

English

RNN Basics: Handling Sequential Data with Memory

After exploring the Bag of Words model and TF-IDF, we found that they share a fatal weakness: they discard the sequential information of text. In human language, word order often dictates the entire meaning of a sentence. To handle data with a chronological order or sequential structure, deep learning introduced the Recurrent Neural Network (RNN).

This article will guide you through the fundamental ideas behind RNNs, the role of the hidden state, and why they hold a significant advantage over standard feedforward neural networks when it comes to natural language tasks.

1. Why Regular Neural Networks Fail at Sentences

A standard feedforward neural network (like the fully connected networks we used for handwritten digit classification) has two very strict limitations when processing inputs:

Fixed Input Length: It requires every input vector to be exactly the same size (e.g., exactly 784 dimensions). But a sentence spoken by a human could be 3 words long, or 30 words long.
Inputs are Independent: Forward propagation is a one-off computation. When the model processes the word "today", it does not remember that it just processed the word "weather".

It behaves like an amnesiac reader who can only look at one word at a time, instantly forgetting it right after. Naturally, this mechanism cannot comprehend paragraphs of text. We need a network that can "remember" what came before.

2. The Core of RNN: A Continuous Memory

The breakthrough of the Recurrent Neural Network (RNN) is that it adds an internal Hidden State to the network, acting as short-term memory.

You can think of an RNN as an assembly line. When it processes a long sentence, it reads the vocabulary word by word in sequence (usually as vectors after word embedding). At time step t, the RNN receives two inputs:

The new word at the current time step, x_t.
The hidden state passed from the previous time step, h_{t-1} (which contains a summary of all the words seen so far).

The network combines these two pieces of information, performs a linear calculation and an activation (like using the tanh function), and then generates the latest hidden state h_t for the current time step. This h_t can be used to predict the current output, and it is also passed along to the next time step t+1, repeating the cycle.

# Pseudocode for core RNN logic
h = initial_state
for word in sentence:
    h = tanh( W_hh * h + W_xh * word + bias )
    output = W_hy * h

3. Common RNN Architectures

Because an RNN unrolls along a sequence, it can adapt flexibly to various tasks:

Many-to-One: Input a complete sentence and output a single classification result at the end. Example: Sentiment analysis (judging whether a movie review is positive or negative).
Many-to-Many: Input a sequence, and the network provides an output at every time step. Example: Named Entity Recognition (judging whether each word is a person's name or a location).
Encoder-Decoder: Use one RNN to compress the original sentence into a memory vector, and use another RNN to generate a new sentence word by word based on that memory. Example: Machine Translation.

4. The Dilemma of RNNs: Vanishing Gradients

While the concept of an RNN is elegant, in practical applications, it encounters the infamous Vanishing Gradient problem if the sentence is very long.

During backpropagation, the error needs to travel backwards along the time axis. Because it passes through the same matrix multiplications repeatedly, if the weights are less than 1, the gradient will decay to almost zero after passing through a dozen words. This turns the RNN into a network with a "goldfish memory": it can remember the last two or three words it just read, but is powerless to retain critical information from dozens of words ago.

To solve this, researchers invented the LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These add "gate" structures inside the RNN, allowing it to actively decide which information to remember and which to forget, greatly alleviating the long-range dependency problem.

5. Where to Next?

LSTMs and GRUs dominated the NLP field for many years and are extremely powerful. But the RNN family always had a structural Achilles' heel: it must compute sequentially, word by word. It cannot be highly parallelized on GPUs like a CNN. This makes training massive models on colossal datasets painstakingly slow.

This called for a new architecture that could completely break free from "recurrence" and "sequential reading." In the next article, we will introduce the technology that fundamentally altered the NLP landscape: the Attention mechanism and the Transformer.

Chinese

循环神经网络 (RNN) 基础：处理序列数据的记忆力

Open as a full page

在了解了词袋模型（Bag of Words）和 TF-IDF 后，我们发现它们有一个致命的弱点：丢弃了文本的顺序信息。在人类语言中，词序往往决定了整句话的含义。为了处理带有时间先后顺序或者序列结构的数据，深度学习引入了循环神经网络（RNN）。

这篇文章将带你理解 RNN 的基本思想、隐藏状态的作用，以及它在处理自然语言任务时为什么比普通的前馈神经网络更具优势。

一、为什么普通神经网络处理不了句子？

普通的前馈神经网络（如我们在手写数字分类中使用的全连接网络）在处理输入时，有两个非常严格的限制：

输入长度必须固定：它要求每个输入向量的大小完全一样（比如都是 784 维）。但人类说出的一句话可能有 3 个词，也可能有 30 个词。
输入之间互相独立：前向传播是一次性的计算，模型在处理“今天”这个词时，不会记得刚才它处理过“天气”这个词。

这就像是一个失忆的阅读者，每次只能看一个词，看完立刻忘掉。显然，这种机制无法理解长篇大论。我们需要一个能够“记住”上文的网络。

二、RNN 的核心：记忆的延续

循环神经网络（Recurrent Neural Network, RNN）的突破在于，它为网络增加了一个内部的隐藏状态（Hidden State），作为短期记忆。

你可以把 RNN 想象成一条流水线。当它在处理一个长句子时，它会按顺序一个个读取词汇（通常是经过词嵌入后的向量）。在时刻 t，RNN 会接收两个输入：

当前时刻的新单词 x_t。
上一时刻传递过来的隐藏状态 h_{t-1}（里面包含了之前所有词的总结信息）。

网络会将这两部分信息结合在一起，进行一次线性计算和激活（比如使用 tanh 函数），然后生成当前时刻的最新隐藏状态 h_t。这个 h_t 既可以用来预测当前的结果，又会被传递到下一个时刻 t+1，周而复始。

# RNN 伪代码核心逻辑
h = 初始状态
for word in sentence:
    h = tanh( W_hh * h + W_xh * word + bias )
    output = W_hy * h

三、RNN 的几种常见结构

因为 RNN 是沿着序列展开的，它可以非常灵活地适应不同的任务：

多对一（Many-to-One）：输入一段完整的句子，最后输出一个分类结果。例如：情感分析（判断影评是正面还是负面）。
多对多（Many-to-Many）：输入一个序列，网络在每个时刻都给出一个输出。例如：命名实体识别（判断每个词是不是人名或地名）。
编码器-解码器（Encoder-Decoder）：先用一个 RNN 把原句子压缩成一个记忆向量，再用另一个 RNN 根据这个记忆逐词生成新句子。例如：机器翻译。

四、RNN 面临的困境：梯度消失

虽然 RNN 的理念很完美，但在实际使用中，如果句子很长，它会遇到著名的梯度消失（Vanishing Gradient）问题。

在反向传播时，误差需要沿着时间轴一路往前传。因为中间经过了多次相同的矩阵乘法，如果权重小于 1，梯度在传了十几个词之后就会衰减到几乎为 0。这就导致 RNN 变成了一个“金鱼脑”：它能记住刚刚读过的两三个词，但对几十个词之前的关键信息无能为力。

为了解决这个问题，研究人员发明了 LSTM（长短期记忆网络）和 GRU（门控循环单元）。它们在 RNN 的内部增加了一些“门”结构，能够主动决定哪些信息该记住，哪些信息该遗忘，大大缓解了长距离依赖的问题。

五、下一步走向何方？

LSTM 和 GRU 统治了 NLP 领域很多年，它们非常强大。但 RNN 家族始终有一个结构上的阿喀琉斯之踵：它必须按顺序一个词一个词地计算，无法像 CNN 那样在 GPU 上高度并行化。这使得在海量数据上训练超大模型变得极其耗时。

这就呼唤了一种能彻底摆脱“循环”和“顺序读取”的新架构。下一篇文章，我们将介绍彻底改变了 NLP 格局的技术：注意力机制（Attention）与 Transformer。

1. Why Regular Neural Networks Fail at Sentences

A standard feedforward neural network (like the fully connected networks we used for handwritten digit classification) has two very strict limitations when processing inputs:

Fixed Input Length: It requires every input vector to be exactly the same size (e.g., exactly 784 dimensions). But a sentence spoken by a human could be 3 words long, or 30 words long.
Inputs are Independent: Forward propagation is a one-off computation. When the model processes the word “today”, it does not remember that it just processed the word “weather”.

2. The Core of RNN: A Continuous Memory

The breakthrough of the Recurrent Neural Network (RNN) is that it adds an internal Hidden State to the network, acting as short-term memory.

The new word at the current time step, x_t.
The hidden state passed from the previous time step, h_{t-1} (which contains a summary of all the words seen so far).

# Pseudocode for core RNN logic
h = initial_state
for word in sentence:
    h = tanh( W_hh * h + W_xh * word + bias )
    output = W_hy * h

3. Common RNN Architectures

Because an RNN unrolls along a sequence, it can adapt flexibly to various tasks:

Many-to-One: Input a complete sentence and output a single classification result at the end. Example: Sentiment analysis (judging whether a movie review is positive or negative).
Many-to-Many: Input a sequence, and the network provides an output at every time step. Example: Named Entity Recognition (judging whether each word is a person’s name or a location).
Encoder-Decoder: Use one RNN to compress the original sentence into a memory vector, and use another RNN to generate a new sentence word by word based on that memory. Example: Machine Translation.

4. The Dilemma of RNNs: Vanishing Gradients

While the concept of an RNN is elegant, in practical applications, it encounters the infamous Vanishing Gradient problem if the sentence is very long.

During backpropagation, the error needs to travel backwards along the time axis. Because it passes through the same matrix multiplications repeatedly, if the weights are less than 1, the gradient will decay to almost zero after passing through a dozen words. This turns the RNN into a network with a “goldfish memory”: it can remember the last two or three words it just read, but is powerless to retain critical information from dozens of words ago.

To solve this, researchers invented the LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These add “gate” structures inside the RNN, allowing it to actively decide which information to remember and which to forget, greatly alleviating the long-range dependency problem.

5. Where to Next?

LSTMs and GRUs dominated the NLP field for many years and are extremely powerful. But the RNN family always had a structural Achilles’ heel: it must compute sequentially, word by word. It cannot be highly parallelized on GPUs like a CNN. This makes training massive models on colossal datasets painstakingly slow.

This called for a new architecture that could completely break free from “recurrence” and “sequential reading.” In the next article, we will introduce the technology that fundamentally altered the NLP landscape: the Attention mechanism and the Transformer.

Search questions

FAQ

Who is this article for?

This article is for readers who want an intermediate-level guide to RNN Basics: Handling Sequential Data with Memory. It takes about 9 min and focuses on RNN, Sequence Models, Neural Networks.

What should I read next?

The recommended next step is Transformer Self-Attention, so the article connects into a longer learning route instead of ending as an isolated note.

Does this article include runnable code or companion resources?

This article is primarily explanatory, but the related tutorials point to runnable examples, resources, and project pages.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

AI Learning Project

A practical route from AI concepts to machine learning workflow, evaluation, neural networks, Python practice, handwritten digits, a CIFAR-10 CNN, adversarial traffic-defense notes, and AI security.

Level: Intermediate Reading time: 9 min

RNN
Sequence Models
Neural Networks

Your next step

Continue: Transformers and Self-Attention: A Revolutionary Breakthrough in AI

Review the foundation View project

Other language version 循环神经网络 (RNN) 基础：处理序列数据的记忆力

Share summary RNN Basics: Handling Sequential Data with Memory

Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.

Open share center

Project timeline

Published posts

AI Basics Learning Roadmap Separate AI, machine learning, and deep learning before going into implementation details.
Machine Learning Workflow Follow the practical path from data and features to training, prediction, and evaluation.
Model Training and Evaluation Understand loss, overfitting, train/test splits, accuracy, recall, and F1.
Neural Network Basics Move from perceptrons to activation, forward propagation, backpropagation, and training loops.
NLP Basics: Understanding Bag of Words and TF-IDF An introduction to the most fundamental text representation methods in NLP: Bag of Words (BoW) and TF-IDF.
RNN Basics: Handling Sequential Data with Memory Understand the core concepts of Recurrent Neural Networks (RNN), the role of hidden states, and their application in NLP.
Transformer Self-Attention Read Q/K/V, scaled dot-product attention, multi-head attention, and positional encoding before exploring LLM internals.
Python AI Mini Practice Run a small scikit-learn classification task and read the experiment output.
Handwritten Digit Dataset Basics Read train.csv, test.csv, labels, and the flattened 28 by 28 pixel layout before training the classifier.
Handwritten Digit Softmax in C Follow the C implementation from logits and softmax probabilities to confusion matrices and submission export.
Handwritten Digit Playground Notes See how the offline classifier was adapted into a browser demo with drawing input and probability output.
CIFAR-10 Tiny CNN Tutorial in C Build and train a small convolutional neural network for CIFAR-10 image classification, then read its loss and accuracy output.
Building a Tiny CIFAR-10 CNN in C: Convolution, Pooling, and Backpropagation A source-based walkthrough of cifar10_tiny_cnn.c, covering CIFAR-10 binary input, 3x3 convolution, ReLU, max pooling, fully connected logits, softmax, backpropagation, and local commands.
High-Entropy Traffic Defense Notes Study encrypted metadata leaks, entropy, traffic classifiers, and a defensive Python chaffing prototype.
AI Security Threat Modeling Build a defense map with NIST adversarial ML, MITRE ATLAS, and OWASP LLM risks.
Adversarial Examples and Robust Evaluation Evaluate clean and perturbed accuracy with an FGSM-style digits experiment.
Data Poisoning and Backdoor Defense Study poison rate, trigger behavior, attack success rate, and training pipeline controls.
Model Privacy and Extraction Defense Measure membership inference signal and surrogate fidelity against a local toy model.
LLM, RAG, and Agent Security Separate instructions from data and enforce tool permissions against indirect prompt injection.

Published resources

Python AI practice code guide The article includes a runnable scikit-learn classification script.
digit_softmax_classifier.c The C source for the handwritten digit softmax classifier.
train.csv.zip Compressed handwritten digit training set with 42000 labeled samples.
test.csv.zip Compressed handwritten digit test set with 28000 unlabeled samples.
sample_submission.csv The official submission format example for checking the final output columns.
submission.csv The prediction file generated by the current C project.
digit-playground-model.json The compact softmax demo model and sample set used by the browser playground.
digit-sample-grid.svg A small handwritten digit preview grid extracted from the training set.
Handwritten digit project bundle Contains the source file, compressed datasets, submission files, browser model, and preview grid.
cifar10_tiny_cnn.c source Single-file C tiny CNN with CIFAR-10 loading, convolution, pooling, softmax, and backpropagation.
model_weights.bin sample weights Model weights generated by one local small-sample run.
test_predictions.csv sample predictions Sample test prediction output from the CIFAR-10 tiny CNN.
CNN project explanation PDF Companion explanation material for the CNN project.
Virtual Mirror redacted code skeleton A redacted mld_chaffing_v2.py control-flow skeleton with secrets, node topology, and target lists removed.
Virtual Mirror stress-test template A redacted CSV template for CPU, memory, peak threads, pulse rate, latency, and error measurements.
Virtual Mirror classifier-evaluation template A CSV template for TP, FN, FP, TN, accuracy, precision, recall, F1, ROC-AUC, entropy, and JS divergence.
Virtual Mirror resource notes Notes explaining why the public resources include only redacted code, test templates, and architecture context.
AI Security Lab README Setup, safety boundaries, and quick-run commands for the AI Security series.
AI Security Lab full bundle Includes safe toy scripts, result CSVs, risk register, attack-defense matrix, and architecture diagram.
AI security risk register CSV risk register template for AI threat modeling and release review.
AI attack-defense matrix Maps attack surface, toy demo, metric, and defensive control into one CSV table.
AI Security Lab architecture diagram Shows threat modeling, robustness, data integrity, model privacy, and RAG guardrails.
FGSM digits robustness script FGSM-style perturbation and accuracy-drop experiment for a local digits classifier.
Data poisoning and backdoor toy script Demonstrates poison rate, trigger behavior, and attack success rate on digits.
Model privacy and extraction toy script Outputs membership AUC, target accuracy, surrogate fidelity, and surrogate accuracy.
RAG prompt injection guard toy script Uses a deterministic toy agent to demonstrate external-data demotion and tool-policy blocking.
Deep Learning topic share card A 1200x630 SVG card for sharing the Deep Learning / CNN topic hub.
Machine Learning From Scratch share card A 1200x630 SVG card for the K-means, Iris, and ML workflow topic hub.
Student AI Projects share card A 1200x630 SVG card for handwritten digits, C classifiers, and browser demos.
CNN convolution scan animation An 8-second Remotion animation showing how a 3x3 convolution kernel scans an input and builds a feature map.

Current route

AI Basics Learning Roadmap Learning path step
Machine Learning Workflow Learning path step
Model Training and Evaluation Learning path step
Neural Network Basics Learning path step
Transformer Self-Attention Learning path step
LLM Visualizer Learning path step
Python AI Mini Practice Learning path step
Handwritten Digit Dataset Basics Learning path step
Handwritten Digit Softmax in C Learning path step
Handwritten Digit Playground Notes Learning path step
CIFAR-10 Tiny CNN Tutorial in C Learning path step
High-Entropy Traffic Defense Notes Learning path step
AI Security Threat Modeling Learning path step
Adversarial Examples and Robust Evaluation Learning path step
Data Poisoning and Backdoor Defense Learning path step
Model Privacy and Extraction Defense Learning path step
LLM, RAG, and Agent Security Learning path step

Next notes

Add more image-classification and error-analysis cases
Turn common metrics into a quick reference
Add more AI security defense experiment notes

1. Why Regular Neural Networks Fail at Sentences

2. The Core of RNN: A Continuous Memory

3. Common RNN Architectures

4. The Dilemma of RNNs: Vanishing Gradients

5. Where to Next?

一、为什么普通神经网络处理不了句子？

二、RNN 的核心：记忆的延续

三、RNN 的几种常见结构

四、RNN 面临的困境：梯度消失

五、下一步走向何方？

1. Why Regular Neural Networks Fail at Sentences

2. The Core of RNN: A Continuous Memory

3. Common RNN Architectures

4. The Dilemma of RNNs: Vanishing Gradients

5. Where to Next?

Who is this article for?

What should I read next?

Does this article include runnable code or companion resources?

How does this article fit into the larger site?

Leave a Reply Cancel reply

Project timeline