Hands-on Guide: Resolve Common EKS Cluster Autoscaler Problems

Tal Shladovsky
April 18, 2023
13
min. read

TL;DR

  • AWS EKS cluster autoscaling can optimize resource utilization by scaling up and down based on demand.
  • By using AWS EKS cluster autoscaling, you can reduce cost, improve availability and scalability, while simplifying management.
  • Find out common issues and troubleshooting tips on this post including: pod placement issues, networking and metrics.

Overview – What is EKS cluster autoscaling?

Cluster Autoscaler is a Kubernetes component that automatically adjusts the size of a cluster based on the current workload. In Amazon Elastic Kubernetes Service (EKS), Cluster Autoscaler can optimize resource utilization and reduce costs by scaling down nodes when they are not needed and scaling up when demand increases. When running on EKS, Cluster Autoscaler works by integrating with the Amazon EC2 Auto Scaling group. When a node is needed to meet demand, Cluster Autoscaler sends a request to the Amazon EC2 Auto Scaling group to create a new node. Similarly, when a node is no longer needed, Cluster Autoscaler requests the Amazon EC2 Auto Scaling group to terminate the node. To use Cluster Autoscaler with EKS, you need to create a Kubernetes deployment or DaemonSet that runs the Cluster Autoscaler container. The deployment or DaemonSet should be configured with the appropriate flags and environment variables to connect to the Amazon EC2 Auto Scaling group and monitor the cluster.

Cluster Autoscaler Policy

One important consideration when using Cluster Autoscaler with EKS is that it requires permission to interact with the Amazon EC2 Auto Scaling group. You can grant these permissions by creating an IAM policy and attaching it to the IAM role used by the EKS nodes.

Benefits of using EKS Cluster Autoscaler

There are several benefits of using Cluster Autoscaler in Amazon Elastic Kubernetes Service (EKS):  

  1. Efficient use of resources: Cluster Autoscaler can monitor the workload and adjust the size of the cluster accordingly. This ensures that resources are used efficiently, and nodes are added or removed as needed to meet demand.  
  1. Cost savings: By using Cluster Autoscaler to manage the cluster size, you can reduce cost by only running the necessary number of nodes. This can minimize idle resources and optimize resource usage.  
  1. Improved availability: Cluster Autoscaler can help improve availability by ensuring the cluster has enough capacity to handle the workload. This can help to prevent service disruptions and ensure that your applications are always available.  
  1. Increased scalability: With Cluster Autoscaler, you can easily scale your cluster up or down to meet changing demand. This can help to ensure that your applications can handle sudden spikes in traffic or usage.  
  1. Simplified management: Cluster Autoscaler can automate the cluster scaling process, reducing the need for manual intervention. This can help to simplify management and reduce the risk of errors or downtime.

Common Issues, Troubleshooting and Tips

When using Cluster Autoscaler with Amazon Elastic Kubernetes Service (EKS), there may be instances where you encounter issues or errors. In this section, we will discuss some common troubleshooting tips and solutions to help you address these issues.

1. Autoscaling not working as expected:

Use-case: The Cluster Autoscaler is not scaling up/down the nodes as expected.  

Troubleshooting tips:  

  • Check the Cluster Autoscaler logs for any errors or warnings that may indicate why the autoscaling is not functioning as expected.  
  • Verify that the Cluster Autoscaler deployment is running and has the correct configuration.
  • Check for resource constraints, such as low CPU/memory, which may be preventing the autoscaling from functioning correctly.  
  • Check the configuration for errors, including the HPA configuration, to ensure that there are no syntax errors or other Cluster Autoscaler Troubleshooting issues.  
  • Check the metrics backend for errors, if applicable.

Code example:

# Check the Cluster Autoscaler logs  
kubectl logs -f cluster-autoscaler-xxx-xxx-xxxxx  
# Verify that the Cluster Autoscaler deployment  
kubectl get deployments -n kube-system  
# Check for resource constraints  
kubectl top nodes kubectl top pods  
# Check the configuration for errors  
kubectl describe hpa  
# Check the metrics backend for errors  
kubectl get pods -n monitoring

2. Pod placement issues:

Use-case: Pods are not getting placed on the nodes by Cluster Autoscaler.  

Troubleshooting tips:  

  • Check the node and pod status to see if there are any issues with the nodes or pods.  
  • Check the Cluster Autoscaler logs for any errors or warnings related to pod placement.  
  • Verify that the Kubernetes API server is reachable from the Cluster Autoscaler pod.

Code example:

# Check the node and pod status  
kubectl get nodes kubectl get pods  
# Check the Cluster Autoscaler logs  
kubectl logs -f cluster-autoscaler-xxx-xxx-xxxxx

To verify that your API server is working properly in EKS by following these steps:  

  1. Open your AWS Management Console and navigate to the EKS service.  
  2. Select the cluster for which you want to verify the API server.  
  3. Click on the “Configuration” tab and scroll down to the “Kubernetes endpoint” section. Here, you will see the “Endpoint” URL for your API server.\
API Server Endpoint
  1. If the API server is working properly, you should see a JSON response with the Kubernetes version information.  
  2. If you encounter any errors, it may indicate that there is an issue with your API server configuration or connectivity. In this case, you may need to review your cluster and API server configuration, check network connectivity, and consult the EKS documentation or support resources for further assistance.

To verify that your API server is working properly from inside the Cluster Autoscaler by following these steps:

If you are unable to exec inside the Cluster Autoscaler pod because of distroless image, you can try running the curl command from a different pod within the same namespace.  
For example, you can create a temporary test pod and run the ‘curl’ command from there to verify that the Kubernetes API server is reachable.

Testing the K8s API server is available from inside the EKS cluster

Here are the steps:  

  • Create a temporary test pod with a shell container:
kubectl run -i --tty test-pod -n kube-system --image=busybox --restart=Never --sh
  • Once the test pod is running, run the following command inside the pod to verify that the Kubernetes API server is reachable:
curl -k https://xxxxxxxxxxxxxxxxxxxxxxxxx.gr7.us-east-1.eks.amazonaws.com/version  

This command should return the version information for the Kubernetes API server. If you encounter any errors or connection issues, it may indicate that there is an issue with your network configuration or connectivity. You can try checking the logs of the test pod and the Kubernetes API server pod to see if there are any error messages that can help diagnose the issue.

  • After you have verified that the Kubernetes API server is reachable, you can delete the test pod:
kubectl delete pod test-pod

3. Configuration errors:

Use-case: The Cluster Autoscaler is not functioning correctly due to configuration errors.  

Troubleshooting tips:  

  • Review the configuration for syntax errors or other issues that may be preventing the Cluster Autoscaler from functioning correctly.  
  • Verify that the configuration is correct and matches the desired behavior.  
  • Check the Cluster Autoscaler logs for any errors or warnings related to configuration.  

Code example:

# Review the configuration
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names <asg-name>

# ASG Name Example:  
# eksctl-lightlytics-eks-cluster-nodegroup-AZ3-N  

# Verify the configuration matches the desired behavior
kubectl get hpa  

# Check the Cluster Autoscaler logs  
kubectl logs -f cluster-autoscaler-xxx-xxx-xxxxx

4. Incorrect node labels:

Use-case: The Cluster Autoscaler is not scaling up/down the nodes because of incorrect node labels.

Troubleshooting tips:  

  • Check the node labels to ensure that they are correct and match the requirements of the Cluster Autoscaler.  
  • Verify that the labels are being applied correctly to new nodes.  
  • Check the Cluster Autoscaler logs for any errors or warnings related to node labels.  

    Code example:
# Check the node labels  
kubectl get nodes --show-labels  

# Verify that the labels are being applied correctly to new nodes
kubectl describe node  

# Check the Cluster Autoscaler logs  
kubectl logs -f cluster-autoscaler-xxx-xxx-xxxxx

5. Metrics not available:

Use-case: The Cluster Autoscaler is not able to scale up/down the nodes due to the unavailability of metrics.

Troubleshooting tips:

  • Verify that metrics-server is running: Metrics-server is a Kubernetes component that collects and serves cluster-wide resource usage data. The Cluster Autoscaler relies on the metrics server to retrieve node metrics and determine when to scale the cluster.

You can check the status of the metrics server using the following command:

kubectl get deployment metrics-server -n kube-system

If the deployment is not running or is in a failed state, you can try deleting and re-creating the deployment:

kubectl delete deployment metrics-server -n kube-system
kubectl apply -f https://github.com/kubernetes-s
  • Verify that the HPA is correctly configured: The Horizontal Pod Autoscaler (HPA) is responsible for scaling the number of pods running in the cluster based on resource utilization metrics. The Cluster Autoscaler uses the HPA to determine the desired number of nodes in the cluster.

You can check the status of the HPA using the following command:

kubectl describe hpa <hpa-name>

If the HPA is not configured correctly, you can try adjusting the target CPU/memory utilization thresholds or scaling policies.

  • Verify that the Cluster Autoscaler is configured correctly: The Cluster Autoscaler needs to be configured with the correct credentials and permissions to access the Kubernetes API server and modify the cluster.

You can check the configuration of the Cluster Autoscaler using the following command:

kubectl describe deployment cluster-autoscaler -n kube-system --kubeconfig=path/to/kubeconfig

Ensure that the --kubeconfig flag points to the correct Kubernetes configuration file and that the Cluster Autoscaler has the necessary RBAC permissions to scale the cluster.

  • Verify that the nodes are correctly labelled: The Cluster Autoscaler relies on node labels to determine which nodes can be scaled.
    If the nodes are not correctly labelled, the Cluster Autoscaler may not be able to scale the cluster.  
    You can check the labels of the nodes using the following command:
kubectl describe node <node-name>

Ensure that the nodes are labelled with the correct node-role.kubernetes.io/<role> label, where <role> is the name of the node group that the node belongs to.

  • Verify that the Cluster Autoscaler logs: The Cluster Autoscaler logs can provide valuable information about the scaling decisions and any errors or issues encountered during scaling.  
    You can check the logs of the Cluster Autoscaler using the following command:
kubectl logs -f deployment/cluster-autoscaler –n kube-system

Look for any error messages or warnings that may indicate issues with metrics or scaling. If necessary, you can also adjust the logging level of the Cluster Autoscaler to provide more detailed information.

6. IAM roles not configured correctly:

Use-case: The Cluster Autoscaler is not able to scale up/down the nodes due to incorrect IAM roles.

Troubleshooting tips:  

  • Verify that the IAM roles are correctly configured and have the required permissions to scale the nodes.  
  • Check that the IAM role ARN is specified correctly in the Cluster Autoscaler deployment configuration file.  
  • Verify that the IAM policy attached to the IAM role used by the EKS nodes has the necessary permissions to interact with the Amazon EC2 Auto Scaling group.  
  • Check the logs of the Cluster Autoscaler deployment for any errors related to IAM roles or permissions.  
  • Test the IAM permissions by manually creating or deleting an Auto Scaling group to verify that the IAM roles and permissions are configured correctly.

If IAM roles are not configured correctly in Cluster Autoscaler, it can cause issues with the autoscaling process. Here are some troubleshooting tips and solutions with code examples:

  • Verify IAM Roles: Verify that the IAM roles for your EKS nodes are configured correctly. Make sure that the IAM role used by the EKS nodes has the necessary permissions to interact with the Amazon EC2 Auto Scaling group. You can do this by checking the IAM policy attached to the IAM role.

To check the IAM policy attached to an IAM role, you can use the AWS CLI command ‘aws iam get-role-policy'.  
Here is an example command:

aws iam get-role-policy --role-name <IAM_ROLE_NAME>

Replace <IAM_ROLE_NAME>  with the name of the IAM role used by your EKS nodes. This command will return the IAM policy attached to the IAM role.

IAM policy
  • Check IAM Role ARN: Make sure that the IAM role ARN is specified correctly in the Cluster Autoscaler deployment configuration file. The IAM role ARN should be specified in the --node-group-auto-discovery option of the deployment configuration.

Here is an example of the --node-group-auto-discovery option in the Cluster Autoscaler deployment configuration file:

--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<CLUSTER_NAME>

Replace <CLUSTER_NAME> with the name of your EKS cluster. Make sure that the IAM role ARN specified in the AWS Console matches the IAM role ARN specified in the --node-group-auto-discovery option.

  • Verify IAM Policy: Verify that the IAM policy attached to the IAM role used by the EKS nodes has the necessary permissions to interact with the Amazon EC2 Auto Scaling group. The IAM policy should include the necessary permissions to describe, create, and delete Auto Scaling groups.
Here is an example IAM policy that includes the necessary permissions for Cluster Autoscaler:
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Sid": "VisualEditor0",
     "Effect": "Allow",
     "Action": [
       "autoscaling:SetDesiredCapacity",
       "autoscaling:TerminateInstanceInAutoScalingGroup"
     ],
     "Resource": "*",
     "Condition": {
       "StringEquals": {
         "autoscaling:ResourceTag/Environment": "production"
       }
     }
   },
   {
     "Sid": "VisualEditor1",
     "Effect": "Allow",
     "Action": [
       "autoscaling:DescribeAutoScalingGroups",
       "autoscaling:DescribeAutoScalingInstances",
       "ec2:DescribeLaunchTemplateVersion",
       "autoscaling:DescribeTags",
       "autoscaling:DescribeLaunchConfiguration",
       "ec2:DescribeInstanceTypes"
     ],
     "Resource": "*"
   }
 ]
}

Make sure that this IAM policy is attached to the IAM role used by your EKS nodes.

  • Check for Errors in Logs: Check the logs of the Cluster Autoscaler deployment for any errors related to IAM roles or permissions. You can use the kubectl logs command to view the logs of the Cluster Autoscaler deployment.
kubectl logs -f cluster-autoscaler-xxx-xxx-xxxxx

Replace cluster-autoscaler-xxx-xxx-xxxxx with the name of the Cluster Autoscaler pod.

  • Test IAM Permissions: Test the IAM permissions by manually creating or deleting an Auto Scaling group. This can help verify that the IAM roles and permissions are configured correctly.  

Here is an example AWS CLI command to manually create an Auto Scaling group:

aws autoscaling create-auto-scaling-group --auto-scaling-group-name <ASG_NAME> --launch-template LaunchTemplateName=<LC_NAME> --vpc-zone-identifier <SUBNET_IDS>

Replace <ASG_NAME> with the name of the Auto Scaling group, <LC_NAME> with the name of the launch configuration, and <SUBNET_IDS> with the IDs of the subnets used by your EKS nodes.  After creating the Auto Scaling group, check the logs of the Cluster Autoscaler deployment to see if it detected the new Auto Scaling group and adjusted the node capacity accordingly.

Here is an example AWS CLI command to manually delete an Auto Scaling group:

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name <ASG_NAME>

Replace <ASG_NAME> with the name of the Auto Scaling group that you created earlier. After deleting the Auto Scaling group, check the logs of the Cluster Autoscaler deployment to see if it detected the deleted Auto Scaling group and adjusted the node capacity accordingly.

If you are still having issues with IAM roles and permissions, you may need to consult the AWS documentation or contact AWS support for further assistance.

Best practices to optimize performance and scalability

Amazon Elastic Kubernetes Service (EKS) is a fully managed Kubernetes service that makes it easy to deploy, manage, and scale containerized applications. Cluster Autoscaler is a tool that can automatically adjust the size of a Kubernetes cluster based on resource utilization. When running workloads on EKS, configuring Cluster Autoscaler is a best practice to optimize performance and scalability. Here are some best practices for configuring Cluster Autoscaler in EKS:

  1. Enable horizontal pod autoscaling (HPA): Cluster Autoscaler works best when combined with HPA. HPA automatically scales the number of replicas of a pod based on CPU utilization, memory usage, or custom metrics. By enabling HPA, you can ensure that your workloads scale up and down based on resource utilization. You can configure HPA by creating a Kubernetes manifest file that specifies the scaling policy for each pod.
  1. Use the recommended settings: Amazon provides a set of recommended settings for configuring Cluster Autoscaler in EKS. These settings include minimum and maximum node group sizes, node group scaling policies, and other parameters. It’s important to configure Cluster Autoscaler with these recommended settings to ensure optimal performance and scalability. You can configure Cluster Autoscaler by creating a Kubernetes manifest file that specifies the recommended settings
  1. Use the right instance types and sizes: The instance types and sizes you choose for your node groups can have a significant impact on performance and scalability. It’s important to choose instance types and sizes that are appropriate for your workload requirements. For example, if you have CPU-intensive workloads, you may want to choose instance types with higher CPU capacity. You can choose instance types and sizes when creating or updating node groups in your EKS cluster.
  1. Monitor and analyze resource utilization: To ensure that Cluster Autoscaler is working properly, it’s important to monitor and analyze resource utilization. This includes CPU and memory usage, network traffic, and other resource utilization metrics. You can use tools like Kubernetes Metrics Server, Prometheus, and CloudWatch Metrics to monitor resource utilization. You can create dashboards and alerts to monitor and analyze resource utilization.
  1. Test and validate your configuration: Before deploying your workloads to production, it’s important to test and validate your Cluster Autoscaler configuration. This includes testing how your cluster scales up and down based on resource utilization, testing failover scenarios, and ensuring that your workloads remain stable during scaling events. You can create a testing environment and simulate different scenarios to validate your configuration.

By following these best practices, you can configure Cluster Autoscaler to optimize performance and scalability in EKS. This can help ensure that your workloads are running efficiently and cost-effectively.

Debugging Tips &Tricks

Debugging is an essential skill for any developer or Kubernetes cluster administrator. Here are some debugging tips and tricks that can help you identify and resolve issues in your Kubernetes cluster:

  1. Review Logs: Logs can provide valuable information about errors, performance issues, and other problems. In Kubernetes, you can access logs for pods, containers, and nodes using Kubectl or a logging tool like ElasticSearch, Fluentd, and Kibana (EFK) stack. Reviewing logs can help you identify the source of an issue and take appropriate action.
  1. Use Debug Containers: Debug containers can be used to troubleshoot issues in Kubernetes. Debug containers are containers that run in the same pod as the affected container, allowing you to access the same file system and network namespace. This makes it easier to diagnose and resolve issues. You can use tools like kubectl to launch a debug container in a pod and diagnose issues.
  1. Check Resource Utilization: Resource utilization can have a significant impact on the performance of your Kubernetes cluster. It’s important to monitor resource usage and identify any bottlenecks. You can use tools like Kubernetes Metrics Server, Prometheus, and CloudWatch Metrics to monitor resource utilization. This can help you identify which resources are underutilized or over-utilized.
  1. Run Diagnostic Commands: Diagnostic commands can help you identify issues in Kubernetes. You can use Kubectl to run diagnostic commands on pods, containers, and nodes. Some common diagnostic commands include kubectl describe, kubectl logs, kubectl exec, kubectl top, and kubectl portforward. Running diagnostic commands can provide you with detailed information about your resources and help you identify issues.
  1. Simulate Issues: Simulating issues can help you identify and resolve issues before they occur in production. You can use tools like Chaos Monkey, LitmusChaos, and Chaos Toolkit to simulate various failure scenarios in your Kubernetes cluster. This can help you identify weaknesses in your system and make improvements to prevent issues from occurring.

By using these debugging tips and tricks, you can identify and resolve issues in your Kubernetes cluster more effectively. It’s important to continuously monitor and analyze your cluster to ensure that it’s running efficiently and cost-effectively.

Conclusion

Yes, it is always a good idea to have a proper understanding of the technologies you are using and the common issues that may arise while working with them. Cluster Autoscaler is a powerful tool that can help you scale your Kubernetes cluster dynamically based on the workload. However, it can also cause some problems if not configured correctly.

In this guide, we have covered some of the common issues that you may encounter while using Cluster Autoscaler in EKS and provided troubleshooting tips to help you resolve them. By following these tips, you can ensure that your Cluster Autoscaler deployment is configured correctly and working as expected, which can help you avoid downtime and other issues in your Kubernetes cluster. Remember, the key to successfully using Cluster Autoscaler is to have a deep understanding of the technology, know what to look for when things go wrong, and be prepared to troubleshoot issues as they arise.

With these tips and a little patience, you can ensure that your Cluster Autoscaler deployment is running smoothly and your Kubernetes cluster is ready to handle any workload that comes its way.

What's new