Trouble shooting GCP HA VPN connections can be a wild goose chase if you don't take a pragmatic approach. There are two essentials to keep in mind before you begin troubleshooting. First, having a network diagram along with subnets and CIDR ranges of what it is you are trying to connect is a must. Second, focusing your efforts up the OSI model is a tried and proven framework for resolving networking issues. This article will walk discuss both of these.
Step 1 - Have a Network Diagram
Not everyone has this available or even thought out, especially for non production environments. However, it is the actual goal of what you are trying to build, its your end state. If you don't know what that looks like at a minimum with pencil and paper it will make trouble shooting a cumbersome process of re-asking yourself or others for source and destination IP's, gateways as well as CIDR ranges. Instead, everyone can reference this diagram as they troubleshoot. Diagram examples and some AWS requirements can be found here.
If you don't have a network diagram, begin drawing one. Ask yourself or your customer the following questions and begin to draw boundaries for each of these. Asking these questions up front at a minimum before troubleshooting is a good start.
- Where are your two networks?
Example (Network A = GCP us-east, B = AWS us-east)
- What are the non overlapping subnet IP ranges?
Example (GCP 10.0.1.0/24, AWS 10.0.2.0.24)
- What are the IP's of servers in each of the subnets that you would ideally like to make a connection between?
Example (GCP 10.0.1.10, AWS 10.0.2.10)
- What are the gateway IP address's for GCP Cloud VPN and AWS gateway?
Step 2 - Troubleshoot up the layers
A word of caution here. This isn't a be all end all guide that will provide precise steps to perform. Also, this guide isn't focused on addressing network delays or poor performance. The goal here is basically can a server in network A make a connection to a server in network B through GCP HA VPN. These steps are geared towards making sure you start at the bottom of your networking stack and work up the OSI layers.
Work your way through these questions in order and you will pinpoint where exactly your problem exist.
Are the peering and VPC communication established?
- Network logs and cloud consoles should not have errors for KEv2/BGP handshakes and authentication.
- The correct ASN's should be used (AWS expects a default from Google to be set with 65000).
- Are the gateways on both networks set to the correct IP that is in use by the other network.
- If you did this correctly you would see all tunnels in the AWS and GCP console with a successfull connection.
- AWS should have "Status" as "UP" in the console
- GCP will have VPN tunnel "Status" as "Established" along with "BGP session status" as "BGP established".
Are routes importing and exporting as expected?
- Have you enable routes to propagate in AWS?
- GCP subnets listed in the AWS console will have a value of "YES" for Propagated.
- If that isn't the case then your ping on a Layer 3 connection wont know how to find the destination.
Can you ping an IP from one subnet to the next subnet without DNS?
- Try pinging by IP instead of DNS since that can resolve in ways that complicate your troubleshooting.
Is the GCP network tag applied and permitting ingress/egress?
- Do your tags have the correct ports enabled for ingress and egress?
Are AWS firewall and subnet ACL permitting ingress/egress?
- Are your AWS Security Groups set correctly and your VM/VPC using that security group?
- Are there any AWS networking ACL's restricting ingress or egress?
Can you run tcmp dump on both VM's
- This will help you pin down firewall or ACL issues. You should see a total of 4 entries for each ping and have tcmp dump running on both VM's with the following being visible in the output. Those entries should be one for each of...
- Leaving GCP VM
- Entering AWS VM
- Leaving AWS VM
- Entering GCP VM