Self Hosting Open source AI Agents on a Kubernetes cluster

The AI apocalypse is coming for our jobs, right? Okay, maybe not quite yet. But the rise of powerful AI models has shifted from a futuristic fantasy to a present-day capability. So what's the best way to slow them down and save your job? Stop giving them all of your damn data and run the models at home.

I'm giving it a try by using Kube-AI: a Kubernetes operator that allows you to self-host models, dynamically scaling them based on demand. It deploys with a user-friendly interface called Open Web UI to make interacting with the models a breeze. Overall, a compelling solution for anyone wanting more control over their data and AI infrastructure.

Best of all, this can all be configured with GitOps. Therefore restoring or duplicating the AI capabilities of a cluster is as easy as using ArgoCD and pointing it at your git repo and boom! You got your AI cluster back. Stop giving your money to Anthropic, Google, OpenAI or Open Router, you can now successfully run those sweet sweet open source models at home.

I already am running this at home, on my i5-13600KF, 64GB DDR5, 1660 Super 6BG GPU server. For example, here is me using a model called Gemma3 with 4 billion parameters that I run on my CPU.

0:00

/0:28

The following video has been sped up 2x because it's a cold start, thanks ffmpeg

And that's not all, I can also run models on my GPU, just much smaller. For example, I run llama3.2 1 billion parameter model on my GPU at home. It's a little bit faster than using my CPU, that's for sure!

0:00

/0:08

Join me as I delve into the world of self-hosting AI, sharing the tricks and techniques that make it all… well, enjoyable.

What I've got deployed at home

I've spent a considerable amount of time to get this working at home with my consumer hardware. I only do this for projects I truly love. Projects like Kanidm and Immich. I think that Kube-AI with Open Web UI is also a great experience for AI models, so I wanted to share it. It's a complete solution that works and I do use it occasionally, so I feel it would be good to share how I deploy it.

🎤

I did the math, and for me to build out my solution, it was about +3,000 lines of yaml and days of trial and error. For that reason, this post will only focus on the green box below in this article.

The overview of my deployment of Kube AI and how I am able to utilize my own AI models

The diagram above shows all of the software that I rely on to host my version of Kube-AI and Open Web UI. As you can see I depend on also hosting my own search engine (SearXNG), Redis (for web sockets), authentication server (for SSO) and Traefik load balancer (for terminating HTTPS requests). For learning purposes you won't need all of these components but if you want to roll a solution like this out for a company, you should consider them.

What you can expect from this article is to give you some yaml that might deploy a bare bones deployment of Kube-AI and Open Web UI. For example, getting SSO working on Open Web UI, spinning up a Postgres database and hosting your own search engine using SearXNG won't be covered. I still expect that you will use the supported documentation to configure the software to work with your setup.

Kube AI

Self host-able Kubernetes Operator for Autoscaling your AI models on your overbuild Home Lab

Kube-AI is an Kubernetes operator that handles managing your open and closed source AI models and deploys them on containers. The containers that run the AI models are based on Ollama or vLLM, but this is configurable. These containers are used for serving many AI models. Ollama is normally recommended to used for CPU based models while vLLM is better to be used for GPU based models.

Models are managed through CRDs (Custom Resource Definitions). They use a simple definition so that you can just point at a model using a URL and let Kube-AI handle downloading and running it in a container on a cluster.

Kube-AI will then act as the interface and load balancer for interacting with the models registered in the cluster.

The operator works by allowing an administrator to manually add AI model CRD's to the cluster which can then be used by consumers through the OpenAI API interface that Kube-AI exposes. Depending on the configuration you've setup on, Kube-AI will deploy a number of models to support the given load.

Deploying Kube-AI

I suggest you to use the helm chart provided by Kube-AI, as it provides a nice abstraction for configuring the deployment of the operator. There are two key attributes to configure, resourceProfiles and openwebui.

💡

Open Web UI configuration is much harder so we’ll do it lower on the page.

Resource profiles represent some type of compute that is a requirement to exist on your cluster. You pair the resource profile with an AI model, so that the needed compute that the model expects exists when the pod for the model is created. This is great for when you have many different types of hardware on bare metal clusters, such as different CPU's and GPU's across many machines.

The below is what I've configured.

secrets:
  # Support pulling models from huggingface
  huggingface:
    create: false
    name: hugging-face-token

resourceProfiles:
  cpu-large:
    imageName: "cpu"
    requests:
      cpu: 6
      memory: "42Gi"
  cpu-small:
    imageName: "cpu"
    requests:
        cpu: 2
        memory: 6Gi
  nvidia-1660-super:
    runtimeClassName: nvidia
    imageName: nvidia-1660-super
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"
      cpu: "2"
      memory: "6Gi"
    tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

In the example above, I’ve created profiles for large and small cpu resources and one for my 1660 Super. These resource profiles can be used when we declare our AI models so that we can match up a given AI model with a given set of compute. These values declared here will be included on the pod that is created to run the AI model once it's spun up.

💡

if you are wondering how I got the nvidia resource limit, checkout https://github.com/NVIDIA/k8s-device-plugin

Kube AI Model CRD

Once we've created some resource profiles for your cluster to use, you can start registering models in your cluster.

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: gemma3-4b
spec:
  engine: OLlama
  env:
    GIN_MODE: release
    OLLAMA_CONTEXT_LENGTH: "100000"
  features:
  - TextGeneration
  resourceProfile: cpu-large:1
  scaleDownDelaySeconds: 100
  targetRequests: 16
  url: ollama://gemma3:4b

The above is a definition for the gemma3:4b model from Ollama that is using my cpu-large:1 resource configuration (42 GB of RAM, 6 CPUs) and is configured with a 100,000 context length. Though it doesn't need that much compute as it only takes up 12.5GB of memory. The container configured is Ollama and thus we are also pulling models from the Ollama registry.

Declaring a resource for my 1660 Super looks a bit different.

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama3-1b
spec:
  args:
  - --enforce-eager
  - --dtype=half
  - --max_model_len=48000
  - --chat-template=/mnt/chat-template.jinja
  engine: VLLM
  features:
  - TextGeneration
  files:
  - content: |
      {{- bos_token }}
      {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}

      {{- "You are a helpful assistant. You reply directly in natural language" }}
      {{- "<|eot_id|>" }}

      {%- for message in messages %}
          {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
          {%- if message['content'] is string %}
              {{- message['content'] | trim}}
          {%- else %}
              {%- for content in message['content'] %}
                  {%- if content['type'] == 'text' %}
                      {{- content['text'] | trim }}
                  {%- endif %}
              {%- endfor %}
          {%- endif %}
          {{- '<|eot_id|>' }}
      {%- endfor %}
    path: /mnt/chat-template.jinja
  resourceProfile: nvidia-1660-super:1
  scaleDownDelaySeconds: 60
  targetRequests: 32
  url: hf://meta-llama/Llama-3.2-1B-Instruct

In this example, I'm using my nvidia-1660-super resource, which uses the vLLM container. vLLM expects for a chat template to be passed in to bootstrap the start of the conversation. I've created my own very simple one for testing purposes.

💡

There is more information on chat templates that I don't know. Huggingface seems to have more information: https://huggingface.co/docs/transformers/main/en/chat_templating

Finally, I pass in some custom arguments to support running on an old GPU. Most models won't run on my 1660 Super unless I have --dtype=half and the models use a Tensor type of BF16. If you have a 20xx series and above, you might not have my issues.

Keeping AI models running

When configuring a model, you can decide how it should run. For example you might want to configure it to always run one model with a max of 5. The model should then scale up if the average number of requests are above an average threshold. Finally, when requests to the AI model slow down, it can be configured to wait some time to scale down because loading models from cold start can take some time.

Given the above description, let's take a look at the model CRD below. Here we have a model that always has one replica running using Ollama. If the target requests goes over 5, it will scale up a new model. When the average number of requests go down, it will wait 60 seconds before scaling down the model.

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: deepcoder-1-5b
spec:
  engine: OLlama
  env:
    GIN_MODE: release
    OLLAMA_CONTEXT_LENGTH: "64000"
  features: [TextGeneration]
  resourceProfile: cpu-small:1
  minReplicas: 1
  maxReplicas: 5
  scaleDownDelaySeconds: 60
  targetRequests: 5
  url: ollama://deepcoder:1.5b

The average requests are calculated on a schedule that is configured on the operator's helm chart during deployment. The autoscaling capability will use these values to determine when to calculate if a model should scale down.

Below is an example where every 30 seconds, Kube-AI will re-evaluate if a model should scale down, given a 3 minute time window.

modelAutoscaling:
  # Interval that the autoscaler will scrape model server metrics.
  # and calculate the desired number of replicas.
  interval: 30s
  # Time window the autoscaling algorithm will consider when calculating
  # the desired number of replicas.
  timeWindow: 3m

Model Performance

Not all models are completed equal, and a newer model with more parameters might not always live up to your expectations (cough cough qwen2.5-14B). You'll need to test each one out for your use cases to determine if they are a good fit for your work. I've tried out quite a few to figure out how they work with my hardware.

A list of all of the models that I'm hosting on my cluster

From the models I've deployed, I've really liked the performance of llama3 type models and gemma3. I found when trying to code a simple todo application, gemma3 models performed better. But none of the models could figure out drag and drop sadly 😦. Kube-AI makes testing out many modules a breeze.

Open Web UI

For when you want to use ChatGPT UI but would rather pay your power bill then a billion dollar corporation

The problem with Kube-AI is that the OpenAI API interface to access models uses an unauthenticated HTTP connection. However, Kube-AI recommends running Open Web UI in front of it to add authentication and a nice web UI so that normal people like family and business folks can use it. Users can select from models stored in Kube-AI and chat with them like they are used to with other providers like ChatGPT and Gemini.

It has many features, some of which I still haven't made use of. One that I use sometimes is the web search feature, which will use a search engine to find some links and build up a small rag database that it will use to answer your query. However, I found that the performance of the web search can be hit or miss.

Llama3.2-1B answering my query about a fun fact about the roman empire

They also have a web platform where people share their automation for completing different tasks. For example, ChatGPT has the idea of canvas, where when HTML, CSS and Javascript is produced, it can render that in the chat that you are currently in. In Open Web UI, you can download ArtifactV3, which will show you a preview of the generated code in your chat window.

0:00

/0:14

More add-ons created by the community exists. The downside is that the add-ons run on the Open Web UI server, so all code that you add to your server should be read to confirm it's not doing anything wrong that would compromise your server.

You can explore some of the other functions below. It can provide cool functionality like getting the current time or the weather for a given location, basically extending what the model is capable of doing (sort of like Anthropic's MCP).

Functions | Open WebUI Community

Open WebUI Community

A list of functions that are available to add to your instance of open web ui

Aside from chat, you can also configure Open Web UI to create users when they login through a SSO. For example, I have it configured that when someone logins in through Kanidm onto my server, a user is created. I had an issue where as a user they wouldn't be able to see any of the models though, so I made them all admin.

I'll need to review that later 😅 Luckily I don't have users that care all that much.

You can also then create API tokens which can be used for developing applications. I did this when I was researching how I could create my own RAG database and so instead of hosting a locally running LLM, I instead targeted my server.

One feature I've heard about that might be useful for business is the fact that you let your users create API tokens that are tied to them, then configure your instance to use something like OpenRouter to provide everyone with access to models such as Claude 3.7 sonnet and other pay for models. Then you use a function (open web ui term) to limit the amount of use of a given model. By using these add ons, we can provide a similar experience to cursor where you only have some access to premium models, up until a point, where then requests will be denied and users will need to use the self hosted models which aren't going to be as good.

Running Open Web UI

You create the YAML for Open Web UI through the helm chart provided by Kube-AI described in the last section. Though Kube-AI just uses the normal open web ui Helm chart as a sub chart.

The hardest part of configuring Open Web UI is the fact that there are many integrations that you might want to consider. I believe they have over 100+ environment variables which all enable certain features. For example, for my configuration, I set 48 environment variables.

The below is a simple example of a configuration that you can use. Key attention should be on the fact that we don't want to enable Ollama in Open Web UI because all of our AI functionality should come from Kube-AI OpenAI API interface instead.

open-webui:
  enabled: true
  ollama:
    enabled: false
  pipelines:
    enabled: false
  websocket:
    enabled: true
    manager: redis
    url: redis://redis:6379/1
    redis:
      enabled: false
  redis-cluster:
    enabled: false
  openaiBaseApiUrl: "http://kubeai/openai/v1"
  service:
    type: ClusterIP
    port: 8080
  fullnameOverride: "openwebui"
  ingress:
    enabled: false
  persistence:
    enabled: false
  autoscaling:
    enabled: false
  extraEnvVars:
    - name: MAIN_LOG_LEVEL
      value: "10"
    - name: PORT
      value: "8080"
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: open-webui-db-app
          key: uri
    - name: WEBUI_AUTH
      value: "true"
    - name: WEBUI_NAME
      value: "Open WebUI"
    - name: AIOHTTP_CLIENT_TIMEOUT
      value: "50000"
    - name: ENABLE_OLLAMA_API
      value: "false"
    - name: ENABLE_OPENAI_API
      value: "true"
    - name: OPENAI_API_KEY
      value: "not-defined"
    - name: RAG_EMBEDDING_ENGINE
      value: openai
    - name: RAG_EMBEDDING_MODEL
      value: "nomic-embed-text"
    - name: RAG_OPENAI_API_BASE_URL
      value: http://kubeai/openai/v1
    - name: RAG_OPENAI_API_KEY
      value: not-defined

💡

Pay special attention to the AIOHTTP_CLIENT_TIMEOUT and set it to something high, otherwise you might find that if your AI responses take longer then 5 minutes, it will stop the response from being written to the web page.

Once the configuration is up and running, you can start testing out AI models!

AI Models

Tips and tricks for getting AI models running on Kube-AI

Getting AI models to run can be very tricky. I've found the most user friendly experience is to stick with Ollama models for as much time as possible as they have many configurations out of the box which allow you to not think too much about configuration. It's also great for users that only have access to a CPU so running larger models is easier, albeit, a bit slow.

One key piece of information for running Ollama models is understanding that by default, all context length are set to a default of 2048, which is nothing. I recommend changing the context length configuration using the environment variable OLLAMA_CONTEXT_LENGTH. Pick the biggest number that the model supports that your computer can run.

When using a GPU and running models with vLLM, remember to include chat templates, there are many examples to choose from for your given model. Depending on the model you choose, you might find that running it on your GPU isn't possible but you can extend the amount of memory the GPU has access to by using the --cpu-offload-gb=<number> command so that the model also uses RAM. However, in my experience, this resulted in a very poor experience, so unless you have a high bandwidth to the CPU memory, I wouldn't recommend it. Also be sure to configure the designed context length using the --max_model_len=<number> command.

In Conclusion

Ultimately, this journey has demonstrated the power and versatility of Kube-AI with Open Web UI. While deploying and configuring these tools – especially when dealing with the nuances of GPU memory and context lengths – presented several challenges, the reward is a powerful, self-hosted AI environment.

This article highlights the potential for individuals to directly engage with and experiment with AI models, and with the right configurations, open up opportunities for sharing and collaboration using self hosted infrastructure. I believe that Kube-AI with Open Web UI, combined with tools like Ollama and vLLM, represents a significant step towards making access to AI technology easier and less expensive.