Kubernetes — setting your service at the right capacity

Farooq Naveen A K
6 min readJul 31, 2022
Photo by Venti Views on Unsplash

Irrespective of the scale of your service running in a production Kubernetes cluster, reliability is one among the most important factors. Capacity at which your service is running has an upper hand on whether your service is reliable. We will be discussing here certain good practices related to infrastructure resources and pod counts which will help a service to achieve that.

This story would expect to have a basic understanding on how pods are deployed in Kubernetes.

Deciding the requests and limits for your pod

First let us understand in a practical scenario what the request and limit(CPU or memory) are for.

Request — There will be a usual load that your POD is going to handle during regular usage hours(or business hours). There will be a corresponding CPU and Memory usage for this usual load(derived in performance tests). This amount of CPU and memory should be your request. This will be the amount of resources required for your pod to function normally.

Kubernetes will use these values declared to find the right node where these resources are available to place your pods and it will guarantee your POD the requested resources.

Limit — During regular operations, there can be spikes in the CPU and/or memory usages due to increase in load/requests or other factors. In this scenario, your pod uses these resources from the underlying node. Value for “limit” says at what point the pod should be stopped from using more resources. When a pod crosses memory limit, it gets restarted. In case of CPU, it is throttled.

It is important to set the limit values correctly. It’s always better to avoid keeping a very large limit compared to the request as this can cause the node to be overcommitted and such pods can affect the operations of other pods that run in the same node by using up more resources limit it for other pods. We’ll take an example to understand that.

Consider a pod being deployed with the below values for limit and request — we are considering only memory and some dump values to keep things simple:

memory.request: 500 Mi
memory.limit: 5 Gi

Looking at the request memory of 500 Mi, Kubernetes finds a node which has a free memory of 2Gi. This node had 10 other pods as well. Things were going normal and the pod was using around 900Mi. On a very busy day the pod we deployed started getting too many requests than usual for which it started using up more memory. It reached 2 Gi usage and wanted more. At this point, the node is in a choked state and slows down. This affects the quality of service of other pods as well who also started to request for slightly more memory.
Kubernetes only considered the “request” and not “limit” while allocating a pod to a node.
In this case, “limit” was far higher than “request” allowing it to shoot up memory usage.
The pod should have been killed earlier than the point where it clogged the node by setting lower value for “limit” unless you are not bothered about budget to pay for as much as under-utilised resources :).

Also this is an area where there should be a clear understanding between various teams(Development and Platform) on what they are committing for.

Take away— set limit to request ratio low | consider total capacity across all nodes.

High Availability Principle

This is a prerequisite to the next section. To understand this, we first need to understand how High Availability is achieved in Cloud and Kubernetes. I am talking about the case of a managed Kubernetes on any public cloud.

Public cloud providers, provide multiple datacenters in a specific region called Availability Domains or Availability Zones(let’s call it zone). It allows you to design you infrastructure to be spread across these zones. The advantage here is that, if your service is properly distributed across these zones, a failure of a datacenter(or a subset) does not affect the availability of your services — considering the fact that the biggest possible fault can occur at a datacenter level. Kubernetes is also designed to be aware of this. When we create a Kubernetes cluster, the worker nodes can be distributed across these multiple Zones/Domains. When adding a node to the cluster, the node metadata has a field called “failure-domain.beta.kubernetes.io/zone” where this detail is added. When a user deploys a service with replica count of 3, Kubernetes will try to deploy the replicas across the 3 zones.

Take away — create the cluster across multiple failure domains | set the right number of pods(next section)

Setting the right number of replicas

In a cloud environment, after all you are running your workloads in some computer sitting in some datacenter. These could fail for a number of reasons including maintenance, hardware level issues, power issues, network issues in datacenter etc. This means pods running in Kubernetes can get restarted or evicted anytime. To have our service up and running without users experiencing disruptions while a pod or a set of pods go down, other pods in the replicaset should be able to support the traffic. To understand this, let’s take an example of a service that serves requests from clients(again certain scenarios might be dump for keeping things simple).
Number of requests served in a minute on a business day: 30000
Replica count(number of pods): 6
Let’s assume the requests are equally balanced across each pod(5000/pod), one pod can serve without issues till 6000 requests and Kubernetes has distributed the pods equally across different zones(described in previous section).
Now, at this state one zone goes down due to a power outage in the datacenter making two pods unavailable. Till the time they back come up in some other nodes, the traffic handled by these pods(5000*2=10000 requests) should be handled by the remaining pods. So, 10000/4=2500 requests reach each of the 4 remaining pod. Now each pod gets 2500 requests, making it a total of 7500 requests per pod. Assuming there is no pod level limiting for the number of requests, the pods may not stand the new load as their cap is 6000.
At this state, the other 4 pods could go into restarts(OOM/Error due to bad health). Now when the new two pods come up, there may not be enough capacity to stand the whole load causing a total downtime for the service.

The solution for this is to keep enough replicas to handle the load in case of such failures. In this example, if the pod count were 8 pods in total, for the same 30000 requests:
Each pod would have had 3750 requests
Suppose 3 pods are in a zone that went down, the remaining 5 pods will have to handle 11250 requests
Additional 2250 requests comes to each pod, and each remaining pod will have 3750+2250=6000 requests per pod.

Take away — set the replica count right!

Ephemeral storage

If your pods are writing to disk(say to a Kafka state store), find the maximum disk it is going to consume(running a performance test). Make sure to declare this in the pod spec. If not declared, when the node’s disk usage goes above 80%, Kubelet evicts pods who have not declared them. On top of that, declaring it ensures the pod is allocated to a node with enough disk space.

Take away — declare it!

Programming languages that are not container aware

Older versions of Java are not container aware, in simple words, they do not know about limit set by the container engine. When a container running a Java process runs in Kubernetes with limit set for resources, the Java process keeps using more and more memory as long as the operating system provides it. When container memory limit is reached, Kubelet restarts the container. To prevent this, Xmx value can be set for the Java process which will limit it from crossing the container limit. Upgrading the later version is an even better option.

Take away — set the limit at the language runtime!