CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step
CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step

CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step

Once DNS provides an IP address, the network layer executes two critical pathing components: Longest Prefix Match (LPM) routing and Maximum Transmission Unit (MTU) adherence. For large-scale production networks—especially when dealing with 100Gbps+ links or heavily encapsulated overlay networks—CIDR, LPM, and MTU transcend administrative basics and become hard constraints dictating CPU cycles, kernel buffer allocations, and data-plane forwarding latency.

In this article, we dive into the Linux kernel networking stack to explore how the Forwarding Information Base (FIB) implements LPM via Level-Compressed Tries (LC-Tries), the mathematics of routing complexity, the mechanics of MTU path discovery (PMTUD), and how eBPF/XDP can bypass the traditional stack for line-rate routing.

1. The Mathematics and Data Structures of LPM

Classless Inter-Domain Routing (CIDR) eliminates fixed class boundaries. A router evaluating a destination IP must find the most specific subnet match among millions of BGP routes. Doing this via linear iteration is computationally impossible at line rate.

Modern Linux kernels use an LC-Trie (Level-Compressed Trie) for the IPv4 routing table. Unlike a standard binary trie which requires (O(W)) lookups where (W) is the IP address length (32 for IPv4), an LC-trie compresses sparse branches and path nodes.

The mathematical representation of path compression in an LC-Trie assumes that for a given node (v) with a branching factor (k), the search time is reduced. If ( n ) is the number of routes, the expected depth of an LC-trie is bounded by ( O(log n) ). The kernel represents routing tables as compressed arrays, pulling entire tree nodes into a single L1/L2 CPU cache line.

/* Excerpt from linux/net/ipv4/fib_trie.c */
struct key_vector {
    t_key key;
    unsigned char pos;      /* 2log(KEYLENGTH) bits needed */
    unsigned char bits;     /* 2log(KEYLENGTH) bits needed */
    unsigned char slen;
    union {
        /* This array's size is 2^bits */
        struct key_vector __rcu *tnode[0];
        struct fib_alias __rcu *leaf;
    };
};

The fib_lookup function traverses this structure. Because of Read-Copy-Update (RCU) synchronization, the data plane can perform lookups entirely lock-free, preventing multi-core contention on the routing table.

Destination IP: 10.22.18.7 (00001010.00010110.00010010.00000111)
LC-Trie Traversal:
- Check root node, shift to branch based on `pos` and `bits`.
- Match against `10.22.16.0/20` (Winner: Specific Service Subnet)
- Exceeds `10.0.0.0/8` precision.
LC-Trie structure and LPM evaluation
Kernel FIB lookup utilizing Level-Compressed Tries to find the longest matching prefix with minimal memory access.

Visualizing Route Selection with eBPF/XDP

In high-performance architectures (e.g., Cloudflare, Meta), packets never reach the standard IP stack. Instead, eBPF (Extended Berkeley Packet Filter) hooks at the XDP (eXpress Data Path) layer, querying the FIB directly from the NIC driver.


flowchart TD
    A[NIC Rx Ring
Dest: 10.22.18.7] --> B{XDP Hook
eBPF Program} B -->|bpf_fib_lookup| C((Kernel FIB
LC-Trie)) C -.->|Match 10.22.16.0/20| B B --> D{MTU Check} D -->|<= MTU| E[XDP_REDIRECT
Zero-Copy Forward] D -->|> MTU| F[XDP_PASS
Pushed to generic Skb stack] F --> G[ip_local_deliver / ip_forward]

2. MTU, MSS, and the Mathematics of Encapsulation

Routing determines the egress interface, which imposes a Maximum Transmission Unit (MTU). Standard Ethernet enforces a 1500-byte MTU. For TCP payloads, we must compute the Maximum Segment Size (MSS).

The equation for calculating MSS in heavily encapsulated data center environments (e.g., VXLAN overlays over IPsec) is:

$$ MSS = MTU_{phys} – H_{eth} – H_{ipsec} – H_{vxlan} – H_{outer_ip} – H_{inner_ip} – H_{tcp} $$

For standard IPsec over IPv4 with AES-GCM:

Base MTU: 1500 bytes
IPv4 Header: 20 bytes
ESP Header + IV + Trailer + ICV: ~40 bytes
Outer IPv4 Header: 20 bytes
TCP Header: 20 bytes
Effective MSS = 1500 - 20 - 40 - 20 - 20 = 1400 Bytes

Relying on the IP layer for fragmentation is a critical anti-pattern. IP fragmentation relies on the ip_fragment() kernel function, which allocates new sk_buff structures, copies payload fragments, and consumes massive CPU cycles. At 10Gbps+, IP fragmentation will immediately saturate kernel softirq (ksoftirqd), leading to packet drops. Transport layer segmentation (via TCP MSS clamping or TSO/GSO offloading) is mandatory for line-rate speeds.

3. Kernel-Level Debugging: `perf` and PMTUD

Path MTU Discovery (PMTUD) relies on ICMP Fragmentation Needed. However, when intermediate firewalls drop ICMP (a PMTU Blackhole), the sender’s TCP stack never receives the signal to shrink the MSS.

We can trace MTU failures and routing drops directly using perf and ftrace.

# Trace ICMP Need Frag reception in the kernel
sudo perf record -e icmp:icmp_unreach -a -g

# Trace TCP MTU probing (RFC 4821) when PMTUD fails
sudo perf probe -a tcp_mtu_probing
sudo perf record -e probe:tcp_mtu_probing -aR

# Investigate XDP redirects and FIB lookup failures
sudo bpftrace -e 'kprobe:bpf_fib_lookup { @[kstack] = count(); }'

In the Linux kernel, when a PMTU blackhole is detected, RFC 4821 Packetization Layer Path MTU Discovery (PLPMTUD) can dynamically adjust the MSS without relying on ICMP. This is enabled via sysctl net.ipv4.tcp_mtu_probing=1.

/* Excerpt from net/ipv4/tcp_timer.c handling PMTU probes */
static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    /* Probe for larger MTU, or drop MSS if blackhole detected */
    if (tcp_mtu_probe(sk)) {
        tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
    }
}

4. Production Architecture Post-Mortem

The eBPF / XDP Routing Bypass

During the design of a multi-terabit Edge CDN, traditional Linux IP forwarding (via ip_forward) bottlenecked at ~2Mpps per core due to sk_buff allocation overhead. We architected an XDP-based fast path. The eBPF program hooks directly into the NIC’s receive ring buffer, reads the destination IP, calls bpf_fib_lookup(), updates the MAC addresses, and issues XDP_REDIRECT to send the packet out the egress interface without ever allocating an sk_buff or triggering a kernel interrupt. This pushed throughput to 14Mpps per core.

However, we hit an MTU trap. XDP programs process raw frames. If a payload exceeded the egress MTU, XDP_REDIRECT would silently drop the frame. We had to implement custom eBPF logic to parse the MTU from the FIB lookup result, manually craft an ICMP Packet Too Big frame in eBPF, and reflect it back to the sender via XDP_TX to ensure PMTUD worked flawlessly.

5. Automated Data-Plane Validation

Modern networking relies on programmatic verification. Below is a Python script using pyroute2 to directly query the Linux Netlink socket for FIB routing decisions and PMTU metrics, bypassing standard shell commands.

from pyroute2 import IPRoute
import math

ipr = IPRoute()
destination = "10.22.18.7"

# Query Netlink socket for exact kernel routing decision
route = ipr.route('get', dst=destination)[0]
attrs = dict(route['attrs'])

egress_iface = attrs.get('RTA_OIF')
gateway = attrs.get('RTA_GATEWAY', 'Direct Connect')
mtu = attrs.get('RTA_METRICS', {}).get('RTAX_MTU', 1500)

print(f"Destination: {destination}")
print(f"Egress Interface Index: {egress_iface}")
print(f"Gateway: {gateway}")
print(f"Path MTU: {mtu} bytes")

# Hardware Segmentation Math (TSO)
ip_tcp_headers = 40
payload_size = 65535 # Max GSO Super-frame
segments = math.ceil(payload_size / (mtu - ip_tcp_headers))
print(f"NIC TSO will slice super-frame into {segments} hardware segments.")

Simulation profiles of zero-copy routing latency and MTU penalty drops can be found in cidr-mtu-results.csv.

6. Animated Walkthrough

The animation illustrates an LC-Trie lookup resolving the LPM, followed by zero-copy DPDK/XDP redirection, segmenting packets to strictly adhere to the egress MTU.

7. Engineering Heuristics & Anti-Patterns

  • Overusing Host Routes: Injecting millions of /32 host routes destroys LC-Trie efficiency and evicts L2/L3 CPU caches, causing massive lookup latency. Aggregate where possible.
  • Disabling ICMP Globally: Blocking ICMP Type 3 Code 4 breaks PMTUD, resulting in silent blackholes on overlay networks. Always allow PMTU ICMP packets through security groups.
  • Ignoring Jumbo Frames: In backend storage networks (e.g., Ceph, iSCSI), leaving MTU at 1500 bytes causes immense TCP header overhead and interrupts. Enable Jumbo Frames (9000 bytes) to dramatically boost Gbps throughput.
  • Misunderstanding TSO/GSO: Tools like tcpdump might show 65KB TCP packets exiting your server. This is not an MTU violation; it is TCP Segmentation Offload (TSO) passing a massive super-frame to the NIC hardware, which slices it into MTU-compliant packets at the physical layer.

FAQ

Why doesn’t the kernel just use a Hash Table for routing?

Hash tables map exact keys to values (O(1)), but IP routing requires Longest Prefix Match, where the destination address is not known to precisely match a specific prefix length. Tries naturally support hierarchical prefix matching. Hash tables are only used for exact-match flow caching (like the Conntrack table).

How does MTU affect BGP?

BGP runs over TCP. If your BGP routers establish peering over an IPsec tunnel or GRE with a mismatched MTU, the BGP OPEN messages (which are small) will succeed, but the subsequent BGP UPDATE messages carrying full routing tables will exceed the MTU and drop, causing BGP sessions to constantly flap.

References

With routing hardware offloads and MTU mechanics exposed, our next article explores how the TCP state machine reacts mathematically to the congestion signals these networks generate.

Search questions

FAQ

Who is this article for?

This article is for readers who want an intermediate-level guide to CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step. It takes about 12 min and focuses on IPv4, CIDR, MTU, Python.

What should I read next?

The recommended next step is TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment, so the article connects into a longer learning route instead of ending as an isolated note.

Does this article include runnable code or companion resources?

Yes. Use the run notes, resource cards, and download links on the page to reproduce the example or inspect the companion files.

How does this article fit into the larger site?

It is connected to the article context block, learning routes, resources, and project timeline so readers can move from concept to implementation.

Article context

Network Fundamentals

A reproducible route through DNS, TCP, TLS, HTTP/3, proxy tunnels, load balancing, and shared caches with code and figures.

Level: Intermediate Reading time: 12 min
  • IPv4
  • CIDR
  • MTU
  • Python
  • C
Other language version CIDR、子网掩码与最长前缀匹配:用代码算清 IP 路由和 MTU
Share summary CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step

Calculate CIDR ranges, longest-prefix route choice, and MTU/MSS payload segmentation with runnable Python and C examples.

Download share card Open share center

Companion resources

Leave a Reply

Project timeline

Published posts

  1. DNS Resolution Explained: Build a TTL Cache and Packet Parser in Python A runnable DNS guide covering resolution paths, response headers, TTL cache latency, and deterministic Python/C experiments.
  2. CIDR, Longest Prefix Match, and MTU: Calculate IP Routing Step by Step Calculate CIDR ranges, longest-prefix route choice, and MTU/MSS payload segmentation with runnable Python and C examples.
  3. TCP Reliability and Congestion Window: A Runnable Sequence Number Experiment Track TCP sequence numbers, cumulative ACKs, loss, retransmission, and congestion-window changes with safe local experiments.
  4. HTTPS and TLS 1.3 Handshake: Keys, Certificates, and RTT in Practice Understand TLS 1.3 message flights, certificate authentication, ephemeral key agreement, and handshake latency with a safe teaching model.
  5. HTTP/2, HTTP/3, and CDN Caching: Read Page Speed from a Waterfall A deterministic browser-waterfall model for HTTP/2, HTTP/3, QUIC streams, and CDN cache hits or misses.
  6. Forward Proxy vs Reverse Proxy: Connection Paths, Trust Boundaries, and Latency A reproducible guide to forward proxies, reverse proxies, tunnels, TLS boundaries, and latency segments.
  7. HTTP CONNECT and HTTPS Proxy Tunnels: TLS Boundaries and Handshake Latency An RFC-based explanation of CONNECT tunnels, encrypted HTTPS payloads, and modeled first-request latency.
  8. SOCKS5 Proxy Explained: Protocol Bytes, DNS Resolution Boundaries, and Leakage Risk Decode safe SOCKS5 CONNECT bytes and compare local-DNS and proxy-side hostname resolution boundaries.
  9. Reverse Proxy Load Balancing: Queues, Health Checks, and a Reproducible Scheduler Compare round robin and load-aware queue selection while reasoning about health checks and retry boundaries.
  10. Proxy Cache Revalidation: Cache-Control, ETag, and Observable Correctness Use an RFC 9111 shared-cache model to calculate MISS, HIT, and 304 revalidation latency and correctness boundaries.

Published resources

  1. Network Fundamentals Lab README Setup, no-privilege safety boundary, ten Python experiments, and three C examples.
  2. Network fundamentals full lab bundle Bundles Python/C source, fixed scenarios, ten result CSVs, and protocol/proxy figures.
  3. DNS TTL results CSV HIT/MISS state, expiry, and latency for four fixed lookups.
  4. CIDR and MTU results CSV Longest-prefix route and 3600-byte payload segmentation results.
  5. TCP cwnd events CSV Per-round ACK, window, and deterministic retransmission events.
  6. TLS 1.3 flight results CSV Message direction, timing, and teaching shared value in a fixed RTT model.
  7. HTTP/CDN waterfall results CSV Phase timing for HTTP/2 and HTTP/3 in cold and warm cache models.
  8. Proxy path latency results CSV Phase timing for direct access, forward-proxy tunneling, and reverse-proxy cache paths.
  9. CONNECT/TLS timeline CSV Records CONNECT authority, tunnel establishment, and the encrypted HTTPS-request boundary.
  10. SOCKS5 DNS boundary CSV Stores ATYP, destination bytes, request length, and modeled local DNS counts.
  11. Proxy load-balancing queue CSV Compares backend selection and queue waiting for round robin and least queue.
  12. Proxy cache revalidation CSV Records MISS, HIT, 304 revalidation, object age, and response latency.
  13. Network request path visualizer Adjust TTL, prefixes, loss, handshake RTT, and cache paths in the browser.
  14. Network fundamentals topic share card A 1200x630 SVG card for the DNS, TLS, HTTP/3, proxy tunnel, and caching topic hub.

Next notes

  1. Add IPv6 and QUIC observation notes
  2. Review caching and protocol benefits with real-user metrics
Scroll down