Troubleshooting Guide: Amazon EKS Networking

TL;DR

Amazon EKS is a fully managed Kubernetes service that abstracts away much of the complexity of Kubernetes cluster management. This allows you to focus on their applications rather than the underlying infrastructure. Yet, like all abstraction layers EKS introduces complexity - particularly for networking and authorization. In this blog post, we cover:

  • EKS networking considerations,
  • EKS VPC considerations,
  • Troubleshooting scenarios,
  • EKS networking recommendations.

Abstract

No one can deny the fact that Kubernetes has proved to be one of the most reputable DevOps Tools during the past few years. Thanks to its advantages, it has become the ultimate container orchestration solution for several companies. With its excessive usage comes a set of possible struggles that Kubernetes administrators may face while deploying their applications. Among these possible issues, there are those related to networking configurations and protocols. In this article, we will shed light on this category of issues, explain the reasons why they may appear to stop or warn users, and will discuss possible troubleshooting solutions.

About Amazon EKS

Before jumping into troubleshooting networking issues and how to troubleshoot them, lets first talk about Amazon EKS.
Amazon developed Amazon EKS in response to the growing popularity of Kubernetes for container orchestration. Kubernetes has become the de facto standard for container orchestration, with many organizations adopting it for their containerized workloads. Since deploying and managing Kubernetes clusters can be complex and time-consuming, especially when running in a cloud environment where there are many infrastructure components to manage. Amazon came up with EKS to simplify these processes by providing a fully managed service that abstracts away much of the complexity of Kubernetes cluster management. This allows customers to focus on their applications rather than the underlying infrastructure.

EKS Cluster Architecture. Image from Amazon

Amazon EKS: Networking Considerations

Before diving into troubleshooting networking issues with Amazon EKS, let's first discuss considerations related to VPCs, subnetting, etc.

VPC Considerations

Amazon VPC

A Virtual Private Network (VPC) is simply your network on the cloud. Each VPC has an allocated range of IP addresses and lies in an AWS region. A region is a separate geographic area that contains several isolated data centers called Availability Zones. Customers build their own subnets where AWS services will be hosted. Customers have a default VPC in each region. They can create, manage and delete other customized VPCs depending on their usage of resources.

Amazon EKS VPC considerations

While creating a Kubernetes cluster from scratch, you have to select any VPC and any subnet you may desire to host both the K8S control nodes and their related worker nodes. However, in Amazon EKS, the k8s control nodes are hosted in AWS VPC while the worker nodes are hosted in the VPC specified by the customer. AWS will be managing the whole control plane and its related tasks such as upgrading the cluster, scaling, replacement of unhealthy control plane instances, etc. Unlike any other traditional cluster that you may create on the cloud, Amazon EKS urges additional considerations that you need to obey while building it. Here is a short list of the most crucial considerations:  

• The selected Amazon VPC needs to have enough IP addresses to host the expected components of the Amazon EKS cluster.  

• If you suspect that the chosen Amazon VPC doesn’t have enough vacant IP addresses for the cluster, you can associate up to five additional IPv4 CIDR blocks with your VPC to fill your need for IP addresses.

• To resolve IP address deficiency, you can take advantage of a shared services VPC. You can simply couple your actual VPC to it via a transit gateway.  

• The VPC must have DNS hostname and DNS resolution support. Otherwise, nodes can’t register to your cluster.  

• DNS hostname and DNS resolution are both required for the VPC where Amazon EKS will be hosted. By using them, the cluster nodes will be able to recognize the    cluster and join it.

Subnet Considerations

To create a healthy Amazon EKS, there is a list of requirements related to the subnet that will host the cluster that needs to be respected. In this paragraph, we will detail and explain these specifications. We really recommend that you make sure that your subnet fulfils these requirements, especially if you are using an existing subnet hosting other services.  

• Hosting the Amazon EKS cluster in a subnet with sufficient free IP addresses (not used by other services) to be exploited by the upcoming Kubernetes cluster:  

At least six IP addresses are required to build a healthy Amazon EKS cluster. However, to avoid an eventual shortage of IP addresses, we mainly recommend hosting the Amazon EKS cluster in a subnet having a minimum of sixteen IP addresses.  

• Building Amazon EKS into either an IPv4-only subnets or dual-stack (IPv4+IPv6) subnets:
The subnet hosting of the cluster should support IP-based naming because Amazon EKS doesn’t support Amazon EC2 resource-based naming. Therefore, you cannot build an Amazon EKS cluster in IPv6-only subnets because, by default, they use the Hostname type of Resource name.  
However, in both dual-stack subnets and IPv4-only subnets you can select either the IP name or resource name.

IP-Based Naming vs Resource-Based Naming

IP-based and resource-based naming are both manners used to define a hostname type for AWS resources. The first one relies mainly on the IP address of the resource to allocate a meaningful and unique name and the second one uses the resource name itself.  
In our context, we mainly focus on the usage of IP-based addressing rather than Resource-based naming. When you launch an instance in any region, the private IPv4 address of the instance is included in the hostname of the instance while relying upon these structures:  
– IP-Based Naming Structure for us-east-1 Region: private-ipv4-address.ec2.internal  
– IP-Based Naming Structure any other AWS Region: private-ipv4-address.region.compute.internal  

An example of IP-Based addressing is ip-10-128-0-15.us-east-2.compute.internal. From it, you can conclude that the resource is an Elastic Compute Cloud (EC2 instance). It has a private IP address corresponding to 10.128.0.15, it’s hosted in the region us-east-2. Amazon EKS relies on IP-Based Naming rather than Resource-Based Naming for its useful significance on the service, IP address and region.

• Amazon EKS managed node and AWS Fargate Network Restrictions

It’s important to be aware that AWS Fargate, Amazon EKS Managed nodes, and Self-managed nodes may all be used together to schedule pods on an Amazon EKS cluster. Each type of node can be used in specific use case scenarios. In fact, each type of node has its own set of privileges. For instance, AWS EKS Managed nodes automate the provisioning and lifecycle management of nodes. However, to run containers on EC2 dedicated hosts you have to use Self-managed nodes.
If you intend to take advantage of Amazon EKS Managed Nodes or AWS Fargate you cannot host your cluster in AWS Wavelength, an AWS Local Zone or AWS Outposts. Otherwise, you can usually build your cluster by relying on self-managed nodes. This last choice could be challenging especially for Kubernetes and AWS fresh learners.

Private Cluster Considerations

In this section, we will shed some light on hosting Amazon EKS Cluster on Private subnets:  Although the Amazon EKS clusters work perfectly on public subnets, it’s recommended to build them in private subnets. In AWS, all the created subnets are by default private. Once you allocate an Internet gateway to the subnet it turns into a public subnet since the routing table grants it access to the internet. The main reason why AWS urges this recommendation is to grant the safety and security of the Worker nodes hosted in the subnet (nested in the customer-managed VPC). In other words, keeping the cluster in the private subnet is the equivalent of keeping it secure.

Amazon EKS Troubleshooting Scenarios

After specifying the requirements and recommendations for a healthy Amazon EKS Cluster, we will discover in this section the possible error confronted while building or managing a Kubernetes cluster on Amazon Web Services public cloud.  

Node Creation Failure

Before diving deep into the next troubleshooting, let’s first discover the node concept in Kubernetes.

Nodes in K8S

A Kubernetes cluster that is built to orchestrate containerized applications is composed of one or many nodes.  

• K8S Worker Nodes: These Are the instances on which the application’s related containers are running. Each container is hosted on a pod. Kubernetes takes advantage of the resources on the worker node to run the containers of the orchestrated application.  

• k8S Control Plane: When there are several containers running on several nodes, there should be an entity that controls the worker nodes. This entity needs to keep track of the status of all the containers, on which node they are running, and whether they crashed or are healthy. It should also repair and regenerate the killed containers. Thus, it has to decide on which worker node will be hosted depending on the available resources on each worker node subscribed to the cluster. All these responsibilities and more are accorded to the Kubernetes Cluster Control Plane. You can find some old documents referring to this same entity as Master nodes  

Unlike the legacy Kubernetes Clusters hosted on-premises and running on hypervisors, the Amazon EKS cluster assigns independent Amazon EC2 instances as worker nodes. They are hosted on the customer managed VPC. However, the control plane is hosted in an AWS Managed VPC (out of reach). The worker nodes and the control plane work in great harmony to fulfil the required orchestration of the containers. For instance, on each worker node, there is a Kubelet agent that reports the status of pods running on it to the kube-api control plane. This same agent is responsible for receiving and applying the orders of the control plane whenever a pod needs to be generated, deleted, recreated or modified.

Why Nodes May Fail to Join the Cluster?

There are a few reasons why you may see the message NodeCreationFailure.
This error means that the launched instances may not be able to register with your Amazon EKS cluster.
The primary causes are:  

• You may have a cluster that doesn’t obey the VPC and subnet recommendations detailed above.  

• The launched worker node cannot reach out to the Amazon EKS control plane because it doesn’t have the required permissions:  

Just like any other resource on AWS, the EC2 instance (the worker node) needs specific permissions to interact with other entities. These assigned permissions will depend on what the mission of the EC2 instance is in the architecture and what other resources it’s intended to interact with.
The responsible service for the allocation of permissions to the corresponding resource on AWS is called Amazon Identity and Access Management (Amazon IAM).  

In this same perspective, the Amazon EC2 instances that are functioning as worker nodes need a set of specific permissions called 'node IAM role'. With this role, the kubelet on the worker node will have permission to interact with AWS API. Thanks to this pre-defined collection of permissions, the kubelet will be able to subscribe the launched instance to the control plane in the first place and eventually exchange API calls with Amazon EKS kube-api.  

• The launched worker node cannot reach out to the Amazon EKS control plane because of ’network security rules’ that are restricting access:  
While creating the Amazon EKS.

Troubleshooting:

After inspecting the reasons for which this issue appears, you can already guess the resolution:  

• Verify the construction of your cluster under the light of the imposed recommendations of AWS related to VPC and Subnet considerations.  

• Inspect the security group related to the Amazon EKS cluster. See whether it’s kept to its default state or at least it has the required open outbound
ports to not restrict cluster traffic.  

• Verify that the Amazon EKS node IAM role is allocated to the launched node by following these steps:

1. Create the node-role-trust-relationship.json file:

cat >node-role-trust-relationship.json <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Effect": "Allow",
     "Principal": {
       "Service": "ec2.amazonaws.com"
     },
     "Action": "sts:AssumeRole"
   }
 ]
}
EOF

2. Create the node IAM role

aws iam create-role \--role-name AmazonEKSNodeRole \--assume-role-policy-document file://"node-role-trust-relationship.json"

3. Attach IAM-managed policies to the created node IAM role.
There are only three policies that are mandatory which are AmazonEKSWorkerNodePolicy , AmazonEC2ContainerRegistryReadOnly
and AmazonEKSCNIPolicy.

aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy \ AmazonEKSWorkerNodePolicy --role-name AmazonEKSNodeRole  
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy \ AmazonEC2ContainerRegistryReadOnly --role-name AmazonEKSNodeRole  
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy \ AmazonEKS_CNI_Policy --role-name AmazonEKSNodeRole

No Route to Host

Understanding why this error may appear on your screen requires a good understanding of container runtime software and its related networking configuration.

Container Runtime

Container Runtime is the beating heart that enables and initiates containerization. In other terms, without the container runtime, the container engine cannot communicate with the operating system and the containerization process will never be launched. Thus, the container will be never brought to life.
The container runtime is what handles all the tasks related to running the containerization process. It mounts the container and clones system calls to communicate with the kernel of the operating system on which you intend to run the containers. Cloning system calls mean creating new processes in a way similar to fork system calls that host the containerization mission.

Overlapping Networks

Amazon-managed Kubernetes clusters take advantage of the container runtime engines to get the hosted containers up and running on the worker nodes.
One very popular container runtime engine is Docker. It has gained a good reputation regarding the easy and efficient management of containers.

While relying on Amazon EKS User Guide, we can see that AWS has reserved the 172.17.0.0/16 CIDR range to run Docker containers in Amazon EKS clusters. In other words, the private IP addresses in this range are arranged to be kept for Docker only. Whenever you use this same range for other resources in your VPC, you will be introduced to an overlapping network issue.  

In fact, in networking, a private IP address should be accorded to one single network card in the subnet.
An overlapping network error in Amazon EKS may be displayed as following:
Error: : error upgrading connection: error dialing backend: dial tcp 172.17.x.y:10250: getsockopt: no route to host

As you can see, the error message points to getsockopt. In fact, getsockopt is a function that manipulates options associated with a socket.
A socket is defined as the combination of the precise IP address and the port number that appears as following IP address: Port Number.  
This formula is used by the TCP layer to identify the application that data is destined to be sent to. In our context, the communications are mainly happening between a worker node (through the kubelet) and the control plane (through the Kube-api Server). The Kube-api server accepts unauthenticated requests on TCP port 10250/10255). The port displayed on the error message above is 10250. This port is the one used to reach out to the kube-api server to execute arbitrary commands over HTTPS.

Because of the overlapping network issue, major confusion occurs regarding the IP address required by the getsockopt() function. As you can see, the correct port is already defined and recognized but the problem resides in the IP address. As a result, building and managing the corresponding socket becomes impossible which causes a networking failure.

Troubleshooting:

Whenever you use a previously accorded IP address, a networking conflict emerges, and Kubernetes will never be able to distinguish the correct route to your host.
By default, the 172.17.0.0/16 CIDR range is reserved to run Docker containers hosted on the Amazon EKS cluster. Simply, you have to avoid the usage of IP addresses in this specific range.

TLS handshake timeout

Before jumping right to this issue, there are a few concepts that we need to shed the light on. First and foremost, we have to understand what is the TLS Protocol and how it’s exploited in Kubernetes.

TLS Protocol

Transport Layer Security (TLS) is a conventional protocol that is intended to guarantee communications security over a computer network by using entities relying on cryptographic techniques such as certificates.

You may wonder: why we need TLS/SSL Certificates?
Since the appearance of the distant exchange of information between entities through networks, encryption has become a must. It keeps the information ambiguous for intruders and those who are not meant to know it. Encryption can be established by using various available encryption algorithms such as symmetric-key algorithms, asymmetric-key algorithms and elliptic-curve algorithms.

Encrypting guarantees confidential communication. No one except for those who have the encryption keys can decipher the content of the traffic. This is why each of communicating parties uses an asymmetric-key algorithm to generate a pair of keys (private key + public key). Each entity delivers the public key to its peer on the other side and keeps the private key for itself. After this exchange of keys, the entities can generate and exchange a symmetric key that will be used to encrypt their traffic. Although this approach looks like enough to secure communication at a first sight, it’s not. In fact, an attacker can be in the middle of the two entities, create two pairs of keys, distribute the two public keys to the end entities while convincing each that he is the true corresponding peer and keep the private keys to himself.
In this scenario, the "man in the middle" succeeded to decipher the traffic between the two entities. In order to avoid such types of attacks, we need to wonder how to avoid this kind of attacks. How to prove to each entity the identity of the corresponding interlocutor and that ties this identity to his public key. The answer to this need is the "Electronic Certificate".

What is mTLS?

A Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificate is a digital artefact that enables computers to create an encrypted network connection to another system after confirming their identity. In this same perspective, we can define the Mutual TLS. It’s often known as mTLS and it’s a technique for two-way authentication. By confirming that both parties have the right private key, mTLS makes sure that the persons at either end of a network connection are who they say they are.
Additional confirmation is provided by the data contained in each party’s TLS certificate.  

The Content of TLS Certificates

An Electronic Certificate is a set of data enabling the establishment of communication with a trusted tier. It contains the identification of the holder (name, DNS, etc) attached to its public key. This bond is always approved by a trusted entity called "certification Authority (CA)".
A CA signs the certificate related to the holder by using hashing algorithms. All these elements are parts of a Public Key Infrastructure (PKI).

So what is a PKI?

Digital certificates are issued in accordance with public key infrastructure (PKI) regulations to safeguard sensitive information, offer distinct digital identities for users, devices, and apps, and ensure secure end-to-end interactions.

SSL/TLS In Kubernetes

To grant secure communication between the Worker nodes and the control plane, as well as between the interacting components within the control plane itself, the usage of TLS becomes crucial. Entities should communicate with each other while taking advantage of cryptography techniques in order to grant the confidentiality and integrity of the data. This is why Kubernetes requires PKI certificates for authentication over TLS.
For instance, The Kube-api server has its own certificate that is used to authenticate to other control 9 plane entities, such as the ETCD server, the kube-scheduler and the worker nodes as well (through the kubelet)

TLS handshake timeout

The initiation of a TLS communication session begins with a handshake. The two communicating sides exchange messages during a TLS handshake to recognize one another, confirm one another, decide on session keys, and specify the cryptographic techniques they will employ.
The TLS handshake timeout issue may occur because of two major reasons:

• The timeout is caused because of a bad usage of certificates allocated to Kubernetes entities.

• The timeout is caused because of a delay in the traffic going from the nodes to the public endpoint.

Troubleshooting

While discussing a customer-managed Kubernetes cluster, TLS handshake timeout may happen due to improper certificate usage.
For example, the handshake may fail because of an expired certificate. At this point, the Kubernetes administrator should verify the status of the used certificates on
both the control plane and worker nodes. He needs to verify as well whether the certificate authority (CA) is defined and updated or not. In addition to that, the administrator needs to inspect the traffic between the worker nodes and the Kubernetes API server by troubleshooting the network access as well as firewall rules.

However, while discussing the appearance of this same error on an Amazon EKS cluster, we have to drop the possibility of the improper certificate usage.
In fact, the provisioning of X.509 credentials for the components of the Amazon EKS cluster is automated through the Kubernetes Certificates API.
Clients of the Kubernetes API can request and get X.509 certificates from a Certificate Authority by using the command line interface provided by the kube-API.  

So, while running an Amazon EKS cluster, you have just to make sure that traffic from the nodes can reach the public endpoint.
To do this you have to simply examine the route table and inspect the security groups corresponding to the failing nodes.

Recommendations

This article provides a detailed explanation of possible issues related to networking on Amazon EKS. Before diving into the plenty of possible solutions for your confronted issue, you have to follow the recommendations instructed below. First and foremost, we highly recommend the usage of the exact installation criteria imposed by AWS itself. Following the instruction of the official owner of the product is the safest and shortest path to having a well-functioning cluster orchestrating your applications. Secondly, it’s important to notice that reading this article is not enough to have a full awareness of the management of your Kubernetes cluster. Although this content contains a detailed explanation of Amazon EKS cluster requirements and troubleshooting, we still recommend a step-by-step inspection of Amazon EKS’s official user guide, especially for those who are dealing with the clusters. Thankfully, AWS provides a wide range of updated white papers and technical charts that administrators can inspect to build solid knowledge. Each detail counts for good governing of the cluster and for a better understanding of the main reasons behind issues related to installation or cluster governing.

Read the GigaOM CXO Decision Brief:
Cloud Change Intelligence
What's new
Deploy cloud infrastructure changes with confidence. Troubleshoot faster with the complete context of your cloud environment.
GET STARTED