Trying K8sGPT – AI For Kubernetes

Artificial Intelligence is showing up everywhere, including Kubernetes! The K8sGPT project is an official CNCF sandbox project, first announced at KubeCon Amsterdam in 2023. It was open-sourced under the Apache-2.0 license and the github project already has lots of stars and contributors. Its motto is …

Giving Kubernetes Superpowers to everyone

In this article I explain what I’ve learned as well as my first impressions of the project.

Article Overview

What Does It Actually Do?
OK, But Where Is The AI Magic?
Large Language Models (LLMs)
K8sGPT in a CI/CD Pipeline
The Results
My Impressions

What Does It Actually Do?

Let’s get this question out of the way first. It is primarily a command-line tool that can:

Identify potential problems within a Kubernetes cluster
Report results and suggest possible solutions (the AI part)
Offers “Integrations” for scanning cloud resources or processing Trivy security CVEs

If the project catches on, I’m sure more features and capabilities will be added.

OK, But Where Is The AI Magic?

Obviously, the AI aspect has attracted the most attention. At the time of this writing, artificial intelligence is only used to enhance the scan results — offering suggestions how to fix the problem and explaining the report results in more natural language.

The scans are pre-programmed using what are called “analyzers”. Analyzers are coded logic for performing checks on objects in a Kubernetes cluster. Custom analyzers are also supported, so you can write your own.

Sorry if that is disappointing. But I do believe that at this early stage, it was a good decision not to use AI to perform any real actions inside the cluster. We all know that AI is subject to hallucinations and false positives. Let’s get back to the technical stuff.

Large Language Models (LLMs)

Since I wanted to try this on a real Kubernetes cluster at my company, one real concern was exposing private company data to OpenAI, the default LLM backend. Not to mention needing an OpenAI API token, which is not free.

After going through the list of supported backends, Ollama caught my eye (maybe it was the cute name). I learned that Ollama can be used to server “local LLMs”. In other words, Ollama is a framework for serving many different sizes and flavours of locall LLMs, and the API is compatible with OpenAI’s. That sounds interesting!

Next, I had to pick a local LLM from the model library. I chose to try Meta’s Lllama 3, for no particular reason other than the blog article from Meta claiming that it is “The most capable openly available LLM to date“.

Ollama has some basic commands for downloading and serving LLMs:

# download the llama3 LLMs
ollama pull llama3

# start ollama without running the desktop application
ollama serve

# list downloaded LLMs
ollama list

K8sGPT in a CI/CD Pipeline

Since I wanted to show my colleagues k8sGPT, and I didn’t want scans running from my computer, I created a gitlab CI/CD pipeline. I will describe all the pieces.

Docker Image

First, I created a docker image with Ollama and k8sGPT, tagged as k8sgpt:

FROM ubuntu

ENV DEBIAN_FRONTEND=noninteractive

ENV OLLAMA_HOST=0.0.0.0

...

RUN curl -fsSL https://ollama.com/install.sh | sh

RUN curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.24/k8sgpt_amd64.deb && \
    sudo dpkg -i k8sgpt_amd64.deb && \
    rm k8sgpt_amd64.deb

GitLab Job

Next, I created a gitlab job to run the k8sgpt analyze command. Note the gitlab pipeline service used to run Ollama in the background:

analyze:
  stage: analyze
  image: <image-registry-url>/k8sgpt
  timeout: 2 hours 30 minutes
  variables:
    #CI_DEBUG_SERVICES: "true"
    OLLAMA_MODELS: $CI_PROJECT_DIR/.ollama/models
    CLUSTER_NAME: some-eks-cluster
  services:
    - name: <image-registry-url>/k8sgpt
      alias: ollama
      entrypoint: ["/usr/local/bin/ollama"]
      command: ["serve"]
  script:
    - mkdir -p $OLLAMA_MODELS
    - test -n "$(ls -A $OLLAMA_MODELS)" || ollama pull llama3  # pull llama3 if not detected in models directory
    - ollama list
    - k8sgpt auth add --backend localai --model llama3 --baseurl http://ollama:11434/v1
    - CONTEXT=$(kubectl config current-context)
    - k8sgpt analyze --explain --config k8sgpt.yaml --kubeconfig $HOME/.kube/$CLUSTER_NAME --kubecontext $CONTEXT |tee k8sgpt-${CLUSTER_NAME}-report.txt
  cache:
    key: ollama
    paths:
      - $OLLAMA_MODELS
  artifacts:
    paths:
      - k8sgpt-*.txt

Let’s have a more details look at the k8sgpt commands:

# configure the localai backend with the llama3 model,  passing the URL to the API served by Ollama
k8sgpt auth add --backend localai --model llama3 --baseurl http://ollama:11434/v1

# run the analyze command, which actually performs the cluster scan
k8sgpt analyze --explain --config k8sgpt.yaml --kubeconfig $HOME/.kube/$CLUSTER_NAME --kubecontext $CONTEXT

The analyze command prints to standard-out, which I pipe to the tee command to later save the output as a gitlab artifact.

There is set of standard “filters” which determine what kind of resources are scanned, but this can be customized:

k8sgpt filters list
Active: 
> ValidatingWebhookConfiguration
> PersistentVolumeClaim
> StatefulSet
> Node
> MutatingWebhookConfiguration
> Service
> Ingress
> CronJob
> Pod
> Deployment
> ReplicaSet
> Log
Unused: 
> HorizontalPodAutoScaler
> PodDisruptionBudget
> NetworkPolicy
> GatewayClass
> Gateway
> HTTPRoute

The contents of the k8sgpt.yaml config file passed to --config option:

active_filters:
    - ValidatingWebhookConfiguration
    - PersistentVolumeClaim
    - StatefulSet
    - Node
    - MutatingWebhookConfiguration
    - Service
    - Ingress
#    - CronJob
    - Pod
    - Deployment
    - ReplicaSet
#    - Log
#    - Gateway
#    - HTTPRoute
    - HorizontalPodAutoScaler
    - PodDisruptionBudget
    - NetworkPolicy
#    - GatewayClass
ai:
    providers:
        - name: localai
          model: llama3
          baseurl: http://ollama:11434/v1
          temperature: 0.7
          topp: 0.5
          topk: 50
          maxtokens: 2048
    defaultprovider: localai

By the way, if you are using the OpenAI backend, there is an --anonymize option which (supposedly) hides private data. I did not use this option since I am using a localai backend, and the results are not shared outside my company.

The Results

This is the moment of truth. Let’s have a look at some example results, obfuscated for privacy.

From the HorizontalPodAutoscaler scan:

Error: HorizontalPodAutoscaler uses Deployment/<some-deployment> as ScaleTargetRef which does not exist.

Solution: 
1. Check if the correct object type (e.g., ReplicationController, StatefulSet) is used in the ScaleTargetRef.
2. Ensure that the namespace and name of the target object are correctly specified.
3. Verify that the scale target ref exists in the desired state.

From the Ingress scan:

<some-namespace>/<some-ingress>
- Error: Ingress uses the service <some-service> which does not exist.


Error: The Ingress is referencing non-existent services, including <some-service>.

Solution: 
1. Check if the services exist in the current namespace or another namespace.
2. Verify the spelling and naming conventions of the services.
3. Create the missing services using `kubectl expose` command or by deploying a new pod/service that provides the required API.

From the Pod Disruption Budget scan:

Error: InsufficientPods, expected pdb pod label app.kubernetes.io/instance=<some-service> and app.kubernetes.io/name=<some-app>.

Solution:
1. Verify the labels are correctly applied to the PDB (Pod Disruption Budget) pods.
2. Check if the PDB is associated with the correct namespace and service account.
3. Ensure the PDB is configured correctly and meets the minimum requirements.

Note: This error typically occurs when the expected number of replicas in the deployment is not met, or there's an issue with the Pod Disruption Budget (PDB) configuration.

From the Pod scan:

some-namespace/some-pod (Deployment/some-deployment)

- Error: back-off 5m0s restarting failed container=app pod=foobar-587b5976bc-c9s (326ab939-c84e-4469-af33-b45963e7afb4)
- Error: the last termination reason is OOMKilled container=app pod=foobar-587b5976bc-c9s 

Error: The container is experiencing Out Of Memory (OOM) issues and was terminated (OOMKilled). A back-off attempt is being made to restart the failed container.

Solution: 
1. Check application memory usage and optimize as needed.
2. Increase pod's resources (e.g., `requests` or `limits`) in deployment.yaml.
3. Consider upgrading node's RAM or adding more nodes to distribute load.

From the Service scan:

some-namespace/some-service
- Error: Service has no endpoints, expected label app.kubernetes.io/app=some-service

Solution:
1. Verify that the service is created with the expected labels.
2. Check if the service pods are running successfully (kubectl get pods -n <namespace>).
3. If the service pods are not running, check logs for errors (kubectl logs <pod_name> -n <namespace>).
4. If the issue persists, inspect the service YAML configuration file and update any missing or incorrect labels.

My Impressions

As you can see from the scan results, the built-in analyzers do catch potential misconfigurations and some errors in the cluster. But personally, I expected more from the ai-enhanced explanations. Maybe it’s because the LLM I used was a smaller one.

Nonetheless, playing around with the tool allowed me to learn some new things about LLMs, especially using AI with local LLMs.

As for the usefulness of the scan results, if you already monitor your cluster with tools like Kyverno and Prometheus, some of the output from k8sgpt can seem a bit redundant. However, I do like the fact that there is very little configuration required to get started with k8sgpt — the built-in analyzers seem to do a good job.

This project is fun and something to keep an eye on, but like most things AI these days, it promises more than it delivers.

What Does It Actually Do?

OK, But Where Is The AI Magic?

Large Language Models (LLMs)

K8sGPT in a CI/CD Pipeline

Docker Image

GitLab Job

The Results

My Impressions

Related Posts

Leave a Comment Cancel Reply