DIY container networking: veths, netnses, and iproute2, oh my!

Container networking seems complicated. In fact, often times, container networking is complicated. Who'da thunk it? Luckily, we can peel back the onion to get a pretty solid conceptual understanding of how container networking actually works under the hood.

A brief digression: containers aren't fancy

Containers often seem mystical and incomprehensible. But, like most things in science, life, and engineering, it turns out to be just a little bit of novelty on top of lots of already established knowledge and practice.

Containers are, at their core, just a collection of Linux kernel features presented in a logical way. The most important feature for the purpose of this article is Linux namespaces, and more specifically, network namespaces.

Other important features for containers include:

  • chroot — containers can only access a subset of the filesystem allowed by the host
  • control groups (cgroup) — containers can be limited to a subset of the CPU and memory of the host
  • process (PID) namespaces — containers have a limited view of the process tree, usually limited to only viewing processes from the same container
  • Linux capabilities — containers can be limited in the ways their allowed to perform privileged operations with the kernel (such as changing the network configuration)

This isn't an exhaustive list, and container runtimes are not exactly simple, but at the very least, they build logically on many of these Linux primitives.

Network namespaces and me

A network namespace (NetNS) provides an isolated view of the network configuration. Each network namespace has its own interfaces, IP addresses, routing rules, and firewall configuration (we'll talk about each of these!).

Conceptually, each network namespace is like having a different computer in the same room. Just like in real life, these different namespaces can't communicate with each other unless we arrange for that (in much the same way that two computers can't communicate with each other unless with connect them somehow).

To see how network namespaces work, we'll use the ip command which is provided by iproute2. The ip command comes on most Linux distributions nowadays and is essentially the Swiss army knife of Linux networking.

Creating a network namespace is a single command.

$ ip netns add container-one

We can execute commands within the namespace with the ip netns exec command. For example, let's look at all the configured interfaces (links in iproute2 terminology) in the new namespace.

$ ip netns exec container-one ip link list
  1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

# For ip commands, we can also use the `-n` flag to specify the network
# namespace to execute in. The command above is equivalent to:
$ ip -n container-one ip link list
  1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

By default, new network namespaces only contain a loopback (i.e., localhost) interface. Importantly, this is actually a different loopback interface than the host's loopback interface, so a service listening on localhost:8888 on the host won't be able accessible from within the container (and vice versa).

Two network namespaces: the host namespace contains a loopback interface and an ethernet interface named ens4 that is connected to the internet; the container namespace only contains a loopback interface and is not connected to the host namespace.

The new network namespace container-one isn't connected to anything else. It can't access the internet or anything else. I promise it gets more interesting!

Veths and connecting namespaces

A veth is a virtual ethernet device. You can think of this like an ethernet cable. Critically, just like ethernet cables have two ends, veths always come in pairs.

We often use veths to connect two network namespaces together, just as if we were connecting two computers in the same room together with an ethernet cable. In this case, we're connecting the two machines directly (e.g., without using a router or bridge — we'll get there I promise!).

To do this, we need a few ip commands. Note that the argument syntax is a bit nonstandard: instead of the usual convention of --type=eth0, the ip command usually uses the notation type eth0. Since long strings of arguments can be hard to read, I've added line breaks to the commands below to make it easier to read.

# Create a new veth pair.
# One side of the veth will be named veth1 and the other will be named veth1a.
$ ip link add veth1 \
  type veth \
  peer name veth1a

# Move the veth1a device to the container namespace. This effectively connects
# the host namespace to the container-one namespace.
$ ip link set veth1a \
  netns container-one

# We usually set the interface name to something standard in the container
# namespace. Interface names only need to be unique within a namespace, so
# multiple containers can each have their own eth0.
# Note the `-n container-one` syntax means we'll perform the operation in the
# container namespace; this is necessary since that side of the veth only
# exists in that namespace now!
$ ip -n container-one link set veth1a \
  name eth0

# Finally, we need to set the veth up. This tells the kernel to start
# considering this device for routing packets (though we'll need to assign
# IP addresses first!).
$ ip link set veth1 up
$ ip -n container-one link set eth0 up

We can see the state of the world as seen by the host and container network namespaces:

# From the host's perspective...
# (note: this output will depend on your host's network configuration)
$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:c0:a8:1f:64 brd ff:ff:ff:ff:ff:ff
    altname enp0s4
82: veth1@if81: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 72:4e:e3:d6:cf:6e brd ff:ff:ff:ff:ff:ff link-netns container-one

# From the container's perspective...
$ ip -n container-one link list
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
81: eth0@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0e:98:71:83:72:81 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Graphically, we have something like this:

The host namespace now contains a device named veth1 that is connected to a device named eth0 in the container's network namespace.

The container-one NetNS is connected via a veth pair to the host NetNS. If the routing configuration is setup correctly in each namespace (hey! keep reading!), the two machines can communicate.

Adding IP addresses and routes

We need to assign IP address to the interfaces so they can communicate. In the real world, outside of contrived blog posts about container networking, this will all be setup by your operating system, but hey, you just so happen to be reading a contrived blog post about container networking. When in Rome!

Let's use the 192.168.99.0/24 network to choose IP addresses from (for the uninitiated, the 192.168.XXX.XXX range is reserved for local networking — like your home LAN — and the /24 means that any address in the 192.168.99.0 belongs to our network).

# Add 192.168.99.1 to the host's interface
$ ip addr add 192.168.99.1/24 dev veth1
# Add 192.168.99.2 to the container's interface
$ ip -n container-one addr add 192.168.99.2/24 dev eth0

Okay, cool, wonderful and awesome! Now that we have IP addresses assigned to the interfaces, we can ping it!

# Ping the container from the host
$ ping -c 1 192.168.99.2
PING 192.168.99.2 (192.168.99.2) 56(84) bytes of data.
64 bytes from 192.168.99.2: icmp_seq=1 ttl=64 time=0.062 ms

--- 192.168.99.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.062/0.062/0.062/0.000 ms

# Ping the host from the container
$ ip netns exec container-one ping -c 1 192.168.99.1
PING 192.168.99.1 (192.168.99.1) 56(84) bytes of data.
64 bytes from 192.168.99.1: icmp_seq=1 ttl=64 time=0.042 ms

--- 192.168.99.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.042/0.042/0.042/0.000 ms

A brief (but luckily for you, optional!) overview of kernel packet routing

By default, your operating system will configure network routes to reach a few places: localhost, your local area network (i.e., everything on this side of your router), and the internet (via your router). We can see these routes (excluding localhost, which is routed via a different table — that's outside the scope of this discussion, but you can see these routes with ip route list table local).

$ ip route list
default via 192.168.16.1 dev ens4
192.168.16.0/24 via 192.168.16.0/24 dev ens4 scope link
192.168.99.0/24 dev veth1 proto kernel scope link src 192.168.99.1

The first route is the default route. It's used if no other more specific route is available, and in this case, tells the kernel to forward that traffic to the router (which usually knows how to access the broader internet).

The second route is the LAN route. It says that anything in the 192.168.16.XXX range is reachable directly via the ens4 network device. This traffic skips the router hop since all IP addresses are directly reachable (excluding a long and not super interesting digression into ARP).

Finally, the third route exists because we added the address 192.168.99.1/24 to the device veth1 (and since the device is in state UP): the kernel implicitly adds a routing rule to route all packets in the 192.168.99.0/24 subnet through veth1 (that is, if a packet is to a 192.168.99.XXX address, it is sent out through the veth1 device). Assigning an entire /24 subnet to the veth might seem like overkill, but theoretically, we could assign multiple IP addresses to the container's interface.

Bridge interfaces

We're back to the non-optional stuff here, pay attention kids!

Usually we don't create veths directly between the host and the container (back to our physical compute analogy: we don't usually plug an ethernet cable directly into two computers; usually it goes through some kind of router which can be connected to other computers!).

Instead, we create a bridge1 and assign the host end of the veth (remember -- veths always come in pairs!) to the bridge. This allows multiple containers to communicate (if they're connected to the same bridge) which is usually pretty useful.

We can create a bridge in just a few commands:

$ ip link add bridge0 type bridge
$ ip link set bridge0 up

$ ip link
# ...<snip>...
83: bridge0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether aa:a8:75:ec:75:2f brd ff:ff:ff:ff:ff:ff

Back to our physical world analogies, this bridge isn't connected to anything: it's just sitting on your desk with no cables connected. Now we need to recreate the veth pair and configure it to be a member of the bridge (this is analogous to plugging one end of the veth into the bridge).

# Delete the veth (this deletes both ends)
$ ip link del veth1

# Create the veth...
$ ip link add veth1 \
  type veth \
  peer name veth1a

# ...and add it to the bridge
$ ip link set veth1 master bridge0

# And as before, configure the various ends of the veth
$ ip link set veth1a netns container-one
$ ip -n container-one link set veth1a name eth0
$ ip link set veth1 up
$ ip -n container-one link set eth0 up

Assigning addresses is just a little bit different. We assign the host end of the IP address to the bridge rather than the veth:

$ ip addr add 192.168.99.1/24 dev bridge0
$ ip -n container-one addr add 192.168.99.2/24 dev eth0

# Let's confirm that we can still reach the container...
$ ping -c 1 192.168.99.2
# ...<snip>...

# and that the container can still reach the host....
$ ip netns exec container-one ping -c 1 192.168.99.1
# ...<snip>....

Let's also quickly create another network namespace and connect it to the bridge!

$ ip netns add container-two
$ ip link add veth2 \
  type veth \
  peer name veth2a
$ ip link set veth2 master bridge0
$ ip link set veth2a netns container-two
$ ip -n container-two link set veth2a name eth0
$ ip link set veth2 up
$ ip -n container-two link set eth0 up

# Add a different ip address to the container's interface
ip -n container-two addr add 192.168.99.3/24 dev eth0

# Confirm that the host can reach the container and vice versa...
$ ping -c 1 192.168.99.3
$ ip netns exec container-two ping -c 1 192.168.99.1

We have something that looks like this:

Three network namespaces: the host namespace, which contains an interface that is connected to the internet as well as a bridge which the veth1 and veth2 devices are connected to; and two container namespaces which each contain an eth0 device connected to the respective veth device in the host namespace.

Both container-one and container-two are connected to the same bridge (which lives in the hosts network namespace). When a packet reaches the bridge, it is routed to the correct veth if there is one that matches the address2.

Routing traffic to the internet

Finally, we probably want to let our containers reach the internet. Since our host only has one IP address connected to the LAN, we'll need to use NAT (network address translation). Put simply, NAT intercepts outgoing packets from an internal IP address, routes them to the internet using the host's IP address, and re-routes incoming packets back to the original internal IP address. Effectively, as far as the external internet is concerned, traffic from both containers will look like it's coming from the host's IP address.

To do this, we'll need to do a few things to setup the host to forward packets between interfaces and then NAT packets coming from the bridge device.

Forwarding has a very particular meaning in the context of Linux packet routing. There are generally three pathways for a packet to flow through a network namespace (these correspond to the various iptables chains):

  • INPUT — packets that are received by the network namespace and are destined for a process that lives within that namespace (e.g., destined for a webserver listening on port 8080).
  • OUTPUT — packets that are leaving the network namespace (via some device within it) that originated from a processes that lives within the network namespace (e.g., a curl request that is destined for the external internet).
  • FORWARD — packets that are neither being sent nor received by the network namespace, but are being routed through the namespace (e.g., a router receives a packet from a LAN that is destined for the internet: the packet is received on an ethernet port and then is forwarded to the internet via another ethernet port, often with NAT applied).

By default, the kernel will not forward packets between interfaces, so we need to tell the kernel that it should (and configure which pathways should be allowed for forwarding).

# First, we'll need to tell the kernel that it should forward packets between
# interfaces. By default, the kernel sees a packet coming in on `bridge0` and
# won't automatically forward it out of `ens4`.
$ sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1

# Second, we'll need to configure the iptables firewall rules to allow
# forwarding packets that come in on `bridge0` and are destined for the
# internet (which is reachable on ens4).
iptables -I FORWARD -i bridge0 -o ens4 -j ACCEPT

# And finally, we'll need to configure iptables to NAT packets. MASQUERADE
# here refers to the fact that packets from the containers appear to be coming
# from ens4's IP address.
iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE

We also need to make one minor modification to the container networks. The default IP routing rules only know how to route traffic destined for the 192.168.99.0/24 network. We need to install a default route that sends all other traffic to the gateway (in this case, the bridge):

# Let's confirm that we don't know how to route to the internet
$ ip -n container-one route list
192.168.99.0/24 dev eth0 proto kernel scope link src 192.168.99.3

# And now let's add a default rule to forward the packet to the bridge
# interface where the host's networking configuration will NAT it to the
# external internet
$ ip -n container-one route add default via 192.168.99.1

And finally, we can route traffic to the Internet from our container:

$ ip netns exec container-one ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=18.5 ms
The bridge device is now connected inside the host namespace to the ens4 device which is able to reach the internet.

Both containers are connected to a single bridge. The dashed line between the bridge and the ens4 device (which is the route to the broader Internet) represents our IP forwarding config that we setup above. Packets from a container are NAT-ed when they exit the ens4 device (since the host only has one external IP address to use).

Appendix: Firewalling containers and routing out of different network interfaces

The interaction between IP routing (iproute2) and the firewall (iptables) is non-obvious. Critically, some iptables rules are applied before making a decision on how to route the packet and some are made after. See this lovely diagram that describes where in the process of handling a packet the routing and firewall decisions are made (note: filter FORWARD and filter INPUT are where the ACCEPT/REJECT/DROP rules will be applied).

The origin of this blog post involved your handsome and charming author attempting to figure out how to route packets from a container out of one of two network devices on the host while denying access to any internal IP addresses, so here are a few lessons learned.

  • To filter packets originating from containers, use iptables -A FORWARD -i bridge0 ... rules. All your firewalling should happen within the FORWARD chain (never the nat/POSTROUTING chain, especially since it's impossible to filter based on input device in that chain).

  • It's helpful to have a dedicated routing table for the container traffic. You'll need to create the routing rules that were added by default manually. For example,

    # 88 is an arbitrary number; custom routing tables are identified by integers
    # in the range [1,252]. You can define a name by adding an entry to
    # /etc/iproute2/rt_tables.
    ip rule add oif bridge0 lookup 88
    ip rule add iif bridge0 lookup 88
    ip rule add to 192.168.99.0/24 lookup 88
    ip rule add from 192.168.99.0/24 lookup 88
    
    # This should be the gateway for the ens5 device
    ip route add default via 10.32.0.1 dev ens5 table 88
    # And we need to add a route to reach other containers on the bridge
    ip route add 192.168.99.0/24 dev bridge0 table 88
    

    Depending on your network configuration, you may need to add more routes.

  • If you, like me, wish to filter access to internal subnets, the following should work:

    RANGES="10.0.0.0/8 100.64.0.0/10 169.254.0.0/16 172.16.0.0/12 192.168.0.0/16"
    for cidr in $RANGES; do
      iptables -A FORWARD -i bridge0 -d $cidr -j REJECT
    done
    iptables -A FORWARD -i bridge0 -o ens5 -j ACCEPT
    
  • Make debugging easier by adding a iptables -I FORWARD -j LOG --log-prefix "[FORWARD] " rule. You can view these logs with dmesg.

    Make sure that your container network namespaces are still able to reach a DNS server (since it's often configured to a local IP address).

Footnotes

  1. A bridge is a device (either physical or virtual) that connects several subnetworks (though not necessarily discrete subnets &mdash). Oftentimes, bridges just connect directly to individual devices (e.g., your computer is plugged directly into your router, which is acting as a bridge), though bridges can be connected to other devices, such as other bridges (e.g., your computer is connected to a bridge on your desk which is itself connected to your home's internet router).

  2. Technically (to the best of my very limited understanding), the bridge doesn't know which veth to send a given packet to. It maintains a list of IP-to-MAC-address associations using ARP and each end of the veth has a unique MAC address.