TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment
TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment

TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment

A web page loads reliably not because the Internet backbone provides lossless transit, but because the TCP state machine strictly enforces sequence continuity through mathematically rigorous congestion control. For High-Frequency Trading (HFT) platforms or massive Content Delivery Networks (CDNs), the standard TCP textbook explanation is inadequate. In these environments, you must tune the Linux kernel’s congestion window (cwnd) and receive window (rwnd) to combat bufferbloat, minimize 99th-percentile tail latency, and optimize pacing algorithms on Long Fat Networks (LFN).

In this deep dive, we will trace the exact lifecycle of TCP segments within the Linux net/ipv4/tcp_input.c source code, analyze the differential equations behind CUBIC and BBR congestion control, and utilize `perf` and flame graphs to debug performance limits at scale.

1. Sequence Space and Kernel State Machines

The 3-way handshake (SYN, SYN-ACK, ACK) is rarely the performance bottleneck unless you suffer from SYN flood attacks (mitigated by net.ipv4.tcp_syncookies). The true complexity of TCP lies in managing the Sequence Space during data transfer.

In the Linux kernel, every TCP connection is represented by a struct tcp_sock. When an ACK is received, the kernel invokes tcp_ack(). This massive function must determine if the ACK is a duplicate, if it contains Selective Acknowledgments (SACK), and whether it should update the cwnd.

/* Excerpt from linux/net/ipv4/tcp_input.c */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
    struct tcp_sock *tp = tcp_sk(sk);
    u32 prior_snd_una = tp->snd_una;
    u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;

    /* If the ACK is out of bounds, drop it */
    if (before(ack_seq, prior_snd_una))
        goto old_ack;

    /* Process SACK blocks to detect out-of-order delivery */
    if (tcp_is_sack(tp) && tcp_check_sack_reneging(sk, flag))
        tcp_retransmit_timer(sk);

    /* Update the congestion control state machine */
    tcp_cong_control(sk, ack_seq, prior_snd_una, flag);
    return 1;
}

If sequence numbers jump out of order, the kernel buffers the out-of-order packets in the Out-Of-Order (OOO) queue and returns a Duplicate ACK. Three Duplicate ACKs bypass the Retransmission Timeout (RTO) timer and trigger Fast Retransmit.

sequenceDiagram
    participant Client (tcp_output)
    participant Server (tcp_input)
    Note over Client,Server: RTT = 20ms, MSS = 1460
    Client->>Server: DATA, seq=1001, len=1460
    Server->>Client: ACK, ack=2461
    Client-xServer: DATA, seq=2461, len=1460 (Loss!)
    Client->>Server: DATA, seq=3921, len=1460
    Server->>Client: DUP ACK, ack=2461 (SACK 3921-5381)
    Client->>Server: DATA, seq=5381, len=1460
    Server->>Client: DUP ACK, ack=2461 (SACK 3921-6841)
    Note over Client: Fast Retransmit Triggered (3 DUP ACKs)
    Client->>Server: DATA, seq=2461, len=1460 (Retransmission)
    Server->>Client: ACK, ack=6841 (Cumulative ACK advances!)

2. The Mathematics of Congestion Control

When packet loss occurs, TCP reacts by modulating its sending window. Standard algorithms like NewReno use an Additive Increase, Multiplicative Decrease (AIMD) mathematical model.

The CUBIC Algorithm

The default in older Linux kernels is CUBIC. CUBIC governs window growth via a cubic function of time since the last congestion event, making it independent of RTT. The window ( W ) at time ( t ) is defined as:

$$ W_{cubic}(t) = C(t – K)^3 + W_{max} $$

Where ( K = sqrt[3]{ frac{W_{max} beta}{C} } ) and ( beta ) is the multiplicative decrease factor (usually 0.7 for CUBIC). When a loss event happens, CUBIC aggressively slashes the window ( W = W times beta ).

The BBR Algorithm (Bottleneck Bandwidth and RTT)

Unlike CUBIC, BBR does not view packet loss as congestion. BBR measures the exact Delivery Rate and Minimum RTT using Little’s Law (( L = lambda W )). It computes the optimal In-Flight data (Bandwidth-Delay Product, BDP):

$$ BDP = BtlBw times RTprop $$

BBR phases through PROBE_BW and PROBE_RTT. To probe bandwidth, it applies a Pacing Gain of ( 1.25 ) (sending 25% faster than the measured bottleneck). When queues build up and latency rises, BBR drops the pacing gain to ( 0.75 ) to drain the bottleneck buffer, completely immunizing the connection against non-congestion random packet loss.

CUBIC vs BBR Congestion Window trajectories
Mathematical trajectories: CUBIC’s saw-tooth penalty from loss vs BBR’s smooth capacity pacing based on RTT measurements.

3. Kernel Instrumentation: `perf` and Flame Graphs

In high-throughput environments, you must profile the TCP stack CPU overhead. Flame graphs generated via perf often reveal bottlenecks in checksum calculations or lock contention.

# Profile TCP retransmissions across the entire kernel
sudo perf record -e tcp:tcp_retransmit_skb -aR -g
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > tcp_retrans_flame.svg

# Monitor active connections using ss (socket statistics)
# Extract precise BBR bandwidth estimates and RTT
ss -nti | grep -A 1 'ESTAB'
# Example Output:
#  cwnd:10 pacing_rate 1200Mbps delivery_rate 950Mbps bbr:(bw:950Mbps,mrtt:15.2ms,pacing_gain:1.25)

By capturing tcp:tcp_retransmit_skb, engineers can correlate retransmission spikes directly with application layer latencies or GC (Garbage Collection) pauses.

4. Production Architecture Post-Mortem

Bufferbloat and the FQ-CoDel Intervention

During the launch of a live-streaming platform, we encountered massive stuttering on client video players. Our load balancers were capable of pushing 40Gbps, but user latency would inexplicably spike from 30ms to 2500ms. Analyzing the ss -nti output revealed that our TCP cwnd was enormous, but the RTT was catastrophic.

This was a classic case of Bufferbloat. The intermediate routers had massive buffers. CUBIC kept increasing its window, filling these buffers until they overflowed. Instead of dropping packets early to signal congestion, the routers held them, introducing astronomical queueing delay.

We resolved this by migrating the kernel Queuing Discipline (Qdisc) to fq_codel (Fair Queueing Controlled Delay) and switching the congestion algorithm to BBR. FQ-CoDel actively drops packets from fat flows if they sit in the queue too long, and BBR paced the TCP packets exactly to the bottleneck bandwidth. Tail latency immediately stabilized under 50ms.

sysctl -w net.core.default_qdisc=fq_codel
sysctl -w net.ipv4.tcp_congestion_control=bbr

5. Automated Event Simulation

For mathematical verification without relying on volatile internet links, we use a Python script running alongside a C11 socket program bound to 127.0.0.1 to export precise event matrices for AIMD and BBR simulations.

python src/tcp_reliability_cwnd.py
cc -std=c11 -Wall -Wextra -O2 src/tcp_loopback_echo.c -o /tmp/tcp_echo
/tmp/tcp_echo

Simulation outputs, including phase shifts and BDP estimates, are exported to tcp-cwnd-results.csv.

6. Animated Walkthrough

Animation detailing BBR’s PROBE_BW phase compared to CUBIC’s AIMD reaction to a dropped packet in the Fast Recovery state.

7. Engineering Heuristics & Anti-Patterns

  • Disabling Hardware Offloading: Never manually disable TSO (TCP Segmentation Offload) or LRO (Large Receive Offload) unless debugging a buggy NIC driver. Offloading saves up to 40% of CPU cycles by avoiding per-packet TCP header processing in the kernel.
  • Misinterpreting 0-Window: A TCP Zero Window packet from the client means the client’s application is not reading data from its socket buffer fast enough (application bottleneck), NOT that the network is congested.
  • Over-tuning Sysctls: Randomly increasing tcp_rmem and tcp_wmem to gigabytes does not increase speed; it only exacerbates bufferbloat and consumes kernel memory. Stick to autotuning defaults unless you are on a massive Long Fat Network (e.g., transatlantic 100Gbps links).

FAQ

Why did BBR completely replace CUBIC at Google?

Loss-based algorithms like CUBIC assume packet loss only happens because of congestion. On modern wireless networks (Wi-Fi, 5G), loss often happens due to radio interference. CUBIC halves its throughput for a random signal drop. BBR recognizes that the RTT didn’t spike, ignores the random loss, and maintains maximum throughput.

Does the 3-Way Handshake guarantee delivery?

Absolutely not. It merely establishes cryptographic sequence initialization and negotiates extensions (SACK, WScale). Actual data delivery is strictly governed by the sliding window and timeout retransmissions.

References

With reliability mathematically guaranteed by TCP, our final frontier is the application layer. The next article explores the protocol that runs atop TCP/QUIC: HTTP/3 and edge CDN caching.

Search questions

FAQ

Who is this article for?

This article is for readers who want an intermediate-level guide to TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment. It takes about 13 min and focuses on TCP, Congestion Control, Python, C sockets.

What should I read next?

The recommended next step is HTTPS and TLS 1.3 Handshake: Keys, Certificates, and RTT in Practice, so the article connects into a longer learning route instead of ending as an isolated note.

Does this article include runnable code or companion resources?

Yes. Use the run notes, resource cards, and download links on the page to reproduce the example or inspect the companion files.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

Network Fundamentals

A reproducible route through DNS, TCP, TLS, HTTP/3, proxy tunnels, load balancing, and shared caches with code and figures.

Level: Intermediate Reading time: 13 min
  • TCP
  • Congestion Control
  • Python
  • C sockets
Other language version TCP 三次握手、重传与拥塞窗口:可运行的序列号实验
Share summary TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment

Track TCP sequence numbers, cumulative ACKs, loss, retransmission, and congestion-window changes with safe local experiments.

Download share card Open share center

Companion resources

Leave a Reply

Project timeline

Published posts

  1. DNS Resolution Explained: Build a TTL Cache and Packet Parser in Python A runnable DNS guide covering resolution paths, response headers, TTL cache latency, and deterministic Python/C experiments.
  2. CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step Calculate CIDR ranges, longest-prefix route choice, and MTU/MSS payload segmentation with runnable Python and C examples.
  3. TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment Track TCP sequence numbers, cumulative ACKs, loss, retransmission, and congestion-window changes with safe local experiments.
  4. HTTPS and TLS 1.3 Handshake: Keys, Certificates, and RTT in Practice Understand TLS 1.3 message flights, certificate authentication, ephemeral key agreement, and handshake latency with a safe teaching model.
  5. HTTP/2, HTTP/3, and CDN Caching: Read Page Speed from a Waterfall A deterministic browser-waterfall model for HTTP/2, HTTP/3, QUIC streams, and CDN cache hits or misses.
  6. Forward Proxy vs Reverse Proxy: Connection Paths, Trust Boundaries, and Latency A reproducible guide to forward proxies, reverse proxies, tunnels, TLS boundaries, and latency segments.
  7. HTTP CONNECT and HTTPS Proxy Tunnels: TLS Boundaries and Handshake Latency An RFC-based explanation of CONNECT tunnels, encrypted HTTPS payloads, and modeled first-request latency.
  8. SOCKS5 Proxy Explained: Protocol Bytes, DNS Resolution Boundaries, and Leakage Risk Decode safe SOCKS5 CONNECT bytes and compare local-DNS and proxy-side hostname resolution boundaries.
  9. Reverse Proxy Load Balancing: Queues, Health Checks, and a Reproducible Scheduler Compare round robin and load-aware queue selection while reasoning about health checks and retry boundaries.
  10. Proxy Cache Revalidation: Cache-Control, ETag, and Observable Correctness Use an RFC 9111 shared-cache model to calculate MISS, HIT, and 304 revalidation latency and correctness boundaries.

Published resources

  1. Network Fundamentals Lab README Setup, no-privilege safety boundary, ten Python experiments, and three C examples.
  2. Network fundamentals full lab bundle Bundles Python/C source, fixed scenarios, ten result CSVs, and protocol/proxy figures.
  3. DNS TTL results CSV HIT/MISS state, expiry, and latency for four fixed lookups.
  4. CIDR and MTU results CSV Longest-prefix route and 3600-byte payload segmentation results.
  5. TCP cwnd events CSV Per-round ACK, window, and deterministic retransmission events.
  6. TLS 1.3 flight results CSV Message direction, timing, and teaching shared value in a fixed RTT model.
  7. HTTP/CDN waterfall results CSV Phase timing for HTTP/2 and HTTP/3 in cold and warm cache models.
  8. Proxy path latency results CSV Phase timing for direct access, forward-proxy tunneling, and reverse-proxy cache paths.
  9. CONNECT/TLS timeline CSV Records CONNECT authority, tunnel establishment, and the encrypted HTTPS-request boundary.
  10. SOCKS5 DNS boundary CSV Stores ATYP, destination bytes, request length, and modeled local DNS counts.
  11. Proxy load-balancing queue CSV Compares backend selection and queue waiting for round robin and least queue.
  12. Proxy cache revalidation CSV Records MISS, HIT, 304 revalidation, object age, and response latency.
  13. Network request path visualizer Adjust TTL, prefixes, loss, handshake RTT, and cache paths in the browser.
  14. Network fundamentals topic share card A 1200x630 SVG card for the DNS, TLS, HTTP/3, proxy tunnel, and caching topic hub.

Next notes

  1. Add IPv6 and QUIC observation notes
  2. Review caching and protocol benefits with real-user metrics
Scroll down