Google Cloud NAT (network address translation) is a service that allows Cloud workloads, running on both servers and serverless environments, to access the Internet without the need for an external IP address. While the configuration is straightforward, there are some significant differences between Cloud NAT and a traditional NAT gateway that become important for a successful Cloud NAT configuration.
How traditional NAT gateways work
For two computers to communicate over the Internet, each needs a public Internet IP address. The sending computer creates a TCP or UDP packet with its IP address as the source, and the IP address of the computer it is communicating with as the destination. For example, let’s imagine that you, as “end user”, are using a web browser on your home computer to visit Google’s home page. Your computer is directly connected to the Internet and has a public IP address of 126.96.36.199, which has been assigned by your Internet Service Provider (ISP). The Google web server has an IP address of 188.8.131.52. The packet looks like:
The source port is selected by your operating system, and the Google server port is 443 (the designated port for HTTPS traffic). When Google’s server responds, the response looks like:
Notice that the source information from the request becomes the destination information for the response, and the destination information becomes the source.
As simple and straightforward as this may be, it’s actually not a common configuration. Most end user computers do not use the IP address assigned by their ISP directly. Instead, they use private IP addresses, also known as RFC 1918 IP addresses, and they use a NAT gateway.
In a typical home network, a router, provided by the ISP or purchased by the end user, has a network interface that connects to the Internet. This interface is assigned a public IP address by the ISP. The router also has an interface that connects to the local area network, or LAN. This interface is assigned an IP address from one of the private IP addresses.
When a computer wants to connect to a server on the Internet, that traffic traverses the router. But instead of simply passing the packet through to the Internet, the router rewrites the packet header that holds the address information. This process is referred to as network address translation: a home router is not just a router, it’s also a NAT gateway.
Below is the same request to Google for their default web page that we saw earlier, but this time processed by a NAT gateway.
Notice that in this example, our end user has an IP address of 192.168.1.10. This is an RFC 1918 private address, which is not a routable Internet address and cannot be used directly to target an Internet server. But because the NAT gateway receives all traffic from the local network that is destined for the Internet, it is able to intercede. Instead of simply forwarding the packet, the NAT gateway creates a new packet, using its own external, public IP address as the source of the traffic, and sends this new packet from the external, Internet-facing interface.
When Google’s server receives this request, it creates its response packet as follows:
Notice that in this case, the source is the Google server’s IP address, and the destination is the NAT gateway’s public IP address. When the NAT gateway receives the packet, it creates a new packet with the client’s private RFC 1918 IP address as destination and sends this new packet through the local port to the end user computer.
To the end user computer, it appears as though it had a conversation directly with the Google server. To the Google server, it appears as though it had a conversation with the NAT gateway. The packet translation was invisible to both sender and recipient.
You may also have noticed that the port number remained 443 for the Google server, but was altered for the end user computer. This is because the NAT gateway is sharing its single public IP address with all the devices on the local network. To ensure that the source port is unique across TCP connections, the NAT gateway maps each request to one of its available ports. This will become very important when configuring Google Cloud NAT.
How is Cloud NAT different from a router’s NAT?
Depending on how many devices are on the local network, the NAT gateway may be processing a significant amount of traffic. This has the potential to be a performance bottleneck, as multiple devices can easily generate more traffic than a single NAT gateway can handle. In addition, a traditional NAT gateway appliance represents a single point of failure. Should the NAT gateway go offline, all the devices on the local network will lose their connection to the Internet.
While Cloud NAT may perform the same function as a traditional NAT gateway, its architecture is significantly different. Unlike a home router, it is not a single device. Actually, it is not a device at all. It is part of Google Andromeda, which is Google Cloud’s network virtualization stack. Network Address Translation services are “baked into” the network that your cloud workloads use to communicate (referred to as the VPC, or Virtual Private Cloud). This distributed architecture means that there are no single points of failure, and no chokepoints to traffic or bottlenecks to performance. It is scalable and reliable in a way that home or office routers could never be.
Cloud NAT configuration
While Cloud NAT’s distributed architecture offers significant advantages, it also presents some configuration challenges.
Each public IP address for a NAT gateway has access to 65,536 TCP source ports, and 65,536 UDP source ports. The first 1,024 of each of these are considered “well-known” and are reserved for specific applications. That leaves 64,512 TCP and UDP ports available. With a traditional NAT gateway, all of these ports are given out on a “first come, first served” basis. No thought needs to be given to how many ports any individual device on the local network may require. But due to its distributed nature, Cloud NAT allocates ports differently. By default, it assigns 64 ports to each device that requests connections to the Internet, regardless of how many ports that device actually requires. This means that if a single VM creates more than 64 simultaneous TCP connections, any connections beyond that will fail. While it is possible to increase the number of ports assigned, it cannot be done for an individual device. The same number of ports will be assigned to all devices, regardless of need. For example, let’s say you have twenty VMs that are sharing a Cloud NAT gateway. One of them needs to be able to open 10,000 simultaneous connections, and the rest need 10 to 20 connections. If you raise the number of ports per instance to 10,000, you will only be able to support six VMs with a single external IP address (64,512 total ports / 10,000 ports per VM = 6 VMs).
If you set Cloud NAT to automatically assign IP addresses, it will add IP addresses as necessary, increasing the total number of ports available. In the above example, you would need four external IP addresses to provide 10,000 ports for 20 devices. Keep in mind that as of February 1, 2024, Google will assess a charge for external IP addresses used by a NAT gateway, so you should consider this when allocating IP addresses. Also remember that if you choose to manually specify static IP addresses for Cloud NAT, it will not auto scale additional external IP addresses. So you need to be sure to create enough static IP addresses to accommodate your requirements.
Recently, Cloud NAT has added the concept of dynamic port allocation. Setting this configuration allows Cloud NAT to allocate a different number of ports to different VM instances based on the VM’s usage. This configuration requires you to set a minimum and maximum number of ports to be allocated to each VM. When a VM initially creates a connection to the Internet, it will be assigned the minimum number of ports you defined. As the VM creates additional connections and gets closer to exhausting the ports allocated to it, Cloud NAT will double the number of ports assigned. It will continue to do this until it reaches the maximum number of ports you defined. When a VMs port usage significantly decreases, the ports are deallocated and made available to other VMs that use the NAT gateway. Keep in mind that this does not happen immediately, so there may still be situations where your workloads could temporarily run out of ports.
Once you have configured the port allocation method and limits based on your workloads, you should consider creating an alert to monitor for out-of-resources situations. You can do this in Google Cloud Monitoring by creating an alert that triggers on the “Cloud NAT Gateway - Sent packets dropped count” metric. This metric will increase when Cloud NAT does not have an available source port for a connection.
Google Cloud NAT is a good solution for workloads that require outbound Internet connectivity. It provides security, scalability, and fault tolerance, and is compatible with Compute Engine instances, Kubernetes clusters, and serverless workloads (Cloud Run, App Engine, Cloud Functions). However, planning around port usage is necessary to ensure that workloads do not experience connection failures.