Skip to content

Monitor a Pod’s container-resources in GCP

Recently, we switched to GKE (Google Kubernetes Cluster) “autopilot” with the whole cluster. Autopilot is a feature where Google will manage the required VM’s / Scale-Sets for you. You will not need to specify, monitor, scale or even pay for VM’s and therefore IaaS. Instead, you specify what workloads (services and cronjobs) you have, and how much CPU and memory they need. These are called resource “request”. GKE needs this information to do proper horizontal auto-scaling. You finally pay with autopilot is the total-requested CPU and memory of all your Pods (Instances) – no matter how many VMs and what size of VM’s is needed to run all this.

The problem

How to find the right values on how much CPU and memory each of your workload uses?

If you go to the GCP console – the web-UI of the Google Cloud – you can drill down to your workload details and get something like this:

You see how many vCPU’s all your Pods need together. The red line is what one specified in the deployment YAML files. First though: nice and quick overview.

But this can be miss-leading as it shows the aggregated statistics of all containers instances running in all your pod’s. Especially with autopilot, resource requirements are managed strictly per container.

So the question is, how much CPU and memory does each of your container really use/need? What does one have to specify in the YAML for which container?

The solution

You can monitor the resource-usage of your containers in the Google Metrics Explorer.

CPU is found in Kubernetes Container > Container > CPU usage time. Memory just next to it in Kubernetes Container > Container > Memory usage.

Then a lot of lines will be shown in a diagram: one for each container (not pod) in your cluster:

Not quite that helpful 😁.

Pro-tip: add a filter on “app” which is the name of your workload:

When you get all running containers for the given workload like this and on the right side you see the number of cpu’s for the given container and point in time.

I’d say this is better, but still not perfect. Our pods normally have two or three containers running per pod. For example, our own backend-service and an Istio (mesh) sidecar and sometimes even a Google Cloud SQL proxy. All of these containers are shown in the above diagram as well.

As you have to specify the resource requests per container, you normally like to take a look at a single container type. For example, workload (=app) “CustomerBackend”, container (=container_name) “istio-proxy”.

Pro-tip: add a filter on container-name.

Now the chart looks better, and you see relatively good, what the istio-proxy container of your customer-backend workload really uses:

Compare the value on the Y axis with the request values you have in our yaml or Helm chart and adjust them if needed.

Pro-tip: you can also use other features of the metrics explorer like group, e.g. by container_name. To count the instances running over the time, you can filter by app and group by container_name and then aggregate the count. You can even go further and build your custom GCP dashboard based on the metrics settings with filter variables, so you can use the dashboard to check each of your workloads by using the filter toolbar.

A note on hitting the requested CPU and memory limits:

If the CPU values reach the value of your requested CPU, the container gets suspended for a while by the underlying Linux kernel. Normally this is not helpful and gives you other issues. Especially containers like the istio-proxy (e.g. used by Google Anthos) don’t like to be suspended as they control all the network traffic in and out of your pods. They also control network access for Google services like the GKE meta-data store. This is needed to inject your workload settings (environment variables) into your application containers.

Hitting the memory limit is about what you may expect: classic out of memory errors will occur.

I guess you can imagine all sorts of strange errors coming out of these situations. So don’t just set the values for your application container right, but also for the supporting containers your pod runs.

Summary

It’s hard to get the request values for CPU and memory right, as the container have peaks and lows. But it’s important to keep them running smoothly and on the other hand don’t pay too much for your cloud compute power.

The Google Metrics Explorer is your fried here and offers many metrics and features to nail down your cluster.

If you have Google Anthos (Istio) meshing configured in your cluster, you get even more metrics out of it. You can even go one step further and use KEDA to do event-based auto-scaling of your pods using many more metrics. This all works great with GKE autopilot on our clusters.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: