🌐 State of DevOps-Prompts: Evaluating AI Tools for Kubernetes

Artificial Intelligence (AI) has become a hot topic of discussion in the DevOps community, particularly in the context of Kubernetes. Recognising its significance, we decided to evaluate this topic as it is highly relevant to our customers. Our primary focus was to assess the maturity of various AI tools in handling Kubernetes-specific operations and which risks might have to be considered using such AI tools. Secondly, the evaluation will serve as a foundation for future testing, as we believe it is important to continue exploring and refining these tools. The AI tools under consideration utilise Foundation Models to collect contextual data from Kubernetes clusters and build prompts with as much context as possible. In a nutshell, the key question driving this evaluation was: "Which model is best for handling Kubernetes-specific tasks and what level of trust should be placed into the models for Kubernetes operations?"

Target of Evaluation

The evaluation aimed to determine the effectiveness of Large Language Models (LLMs) in solving specific, technically demanding tasks within the Kubernetes environment. This involved assessing the performance of multiple LLMs from different providers, including Google, Anthropic, Mistral AI, OpenAI, and Meta. The goal was to identify which model could handle Kubernetes-specific day-2 tasks most effectively.

How We Evaluated

To address the research question, "Are LLMs capable of solving specific, technically demanding tasks in the Kubernetes environment?", the evaluation process involved several steps:

Approach and use cases: We studied two primary sets of use cases: Kubernetes day-2 operations (KillerCoda scenarios) and SecOps tasks (issues reported by Trivy and their analysis). We formulated each problem as a set of basic and enriched prompts, ran them against multiple large language models (LLMs), graded them manually and automatically, and captured them with a prompt evaluation tool. The problems were analysed from the perspectives of beginner, intermediate, and senior DevSecOps Engineers. We analysed and compared several additional use cases to explore various aspects of design, architecture, and conceptual challenges across different LLMs.
Selection of LLMs: We tested multiple LLMs from various providers to ensure a comprehensive evaluation. The selected models included those from Google, Anthropic, Mistral AI, OpenAI, and Meta.
Setup: The setup involved using LLM provider APIs, Promptfoo, kube-linter, and kind. The results were collected in JSON format. Figure 1 illustrates this setup.

^{Figure 1: Evaluation setup}
Evaluation Tool: We used Promptfoo, a software interface designed for evaluating LLMs in the context of Platform Engineering. The tool provided various metrics, including pass rates and performance scores. Promptfoo's interface includes features such as creating new evaluations, managing prompts and datasets, and tracking progress. It also offers visual representations of the evaluation results through bar charts and scatter plots, and detailed tables showing the performance of different LLMs on specific tasks.

Quantitative Results

The evaluation yielded insightful quantitative results:

Overall Performance: The overall pass rates for the tested LLMs ranged between 51.7% and 53.9%, with the combined average being 52.7%.
Task-Specific Performance: The performance of the LLMs varied significantly across different task categories. For instance, tasks like "Node-Join" and "Scheduling Priority" saw higher success rates, while more complex tasks like "Cluster Upgrade" and "NetworkPolicy" had lower success rates.
Summary: The results indicated that the success rate of LLMs decreases as the complexity of the tasks increases. Or in other words: The more complex the tasks, the lower the success rate. The following illustration shows this in more detail.

Qualitative Results

In addition to this, our evaluation resulted in the following qualitative findings:

Built-in LLMs provide lower-quality responses compared to current-generation foundational models.
The optimal way to use LLMs is through human-in-the-loop scenarios.
LLMs can be considered as unreliable search engines; response quality depends on the training corpus.
As LLMs improve with addressing simple and obvious problems, detecting misrepresentations becomes more challenging.
Critical thinking and experience are key differentiators in extracting maximum value from LLMs. Senior DevOps engineers gain more value than junior engineers: the more domain knowledge you have, the faster you can identify false positives and interpret LLM responses.

Our Recommendations

As a result of this evaluation, we recommend the following when using LLMs for creating DevOps prompts:

Educate teams on the importance and methods of critical thinking through tailored training sessions and workshops.
Be aware of cognitive load (it may take more time to compensate for misleading LLM responses compared to referring to documentation or the colleague’s advice).
Use “calibrating” questions to infer the training corpus of each LLM to gauge its ability to provide useful responses (“reverse” Retrieval-Augmented Generation, RAG).
For corporate users, it is especially important to implement AI/LLM FinOps practices to track costs and ROI across multiple scenarios (SaaS and on-premises solutions), that the developed methodology provides.

Future Evaluations

Looking ahead, we defined several key areas for future assessments:

Expansion to a Full Benchmark: The evaluation will be expanded to a comprehensive benchmark for Kubernetes tasks, including more granular assessments and additional tasks.
Evaluation of Additional LLMs/AI Tools: Future evaluations will include more LLMs and AI tools to determine the impact of various factors, such as Retrieval-Augmented Generation (RAG).
Monitoring Future Developments: Continuous observation of advancements in AI and Kubernetes will be crucial to keep the evaluation framework up-to-date and relevant.

In conclusion, while LLMs show promise in handling simpler Kubernetes tasks, there is significant room for improvement in tackling more complex challenges. The developed methodology provides a solid foundation for ongoing and future evaluations, ensuring that we stay at the forefront of AI and Kubernetes integration.

Tony Nutzmann

IT-Associate, Computacenter

Yuriy Lesyuk

Technical Architect Developer Velocity, Computacenter

Norbert Steiner

Solution Manager, Computacenter

Back to blogs homepage Retour à la page d'accueil des blogs Zurück zur Blog-Startseite Terug naar de startpagina van blogs

Share Share Partager Compartir Deel

Latest Blog articles Nieuwste blogartikelen Neueste Blog-Artikel Derniers articles de blog

Huawei vs NVIDIA - Two track AI infrastructure?

Aug 14, 2025 by Peter Eccles

Raising the bar on circularity: Computacenter’s vision for responsible IT recycling

Aug 07, 2025 by Julian Wase

AI needs the right network – and your network needs AI

Jun 30, 2025 by Timm Wächter

Is Security Service Edge essential?

Jun 13, 2025 by Lutz Feldgen

Name	Type	Duration	Purpose
consent_seen	1st party	1 year	Memorize if the visitor has seen the consent message, to avoid poping this message on every page viewed.
code_lang	1st party	1 year	Memorize the visitor location preference to redirect him to the right location.
ASP.NET_SessionId	1st party	Session	Session cookie for server side scripting. This cookie identify your session.
sf_...	1st party	Session	Session cookie for CMS SitefInity. This cookie identify your session.

Name

Type

Duration

Purpose

consent_seen

1st party

1 year

Memorize if the visitor has seen the consent message, to avoid poping this message on every page viewed.

code_lang

1st party

1 year

Memorize the visitor location preference to redirect him to the right location.

ASP.NET_SessionId

1st party

Session

Session cookie for server side scripting. This cookie identify your session.

sf_...

1st party

Session

Session cookie for CMS SitefInity. This cookie identify your session.

Name	Type	Duration	Purpose
consent_ga	1st party	1 year	Memorize if the visitor has consent to give more details about its visit but still without any personal data.
_ga	1st party	14 months	used to distinguish users, visitor ID is a random ID which doesn't allow to identify a specific user.
_ga_<container-id>	1st party	14 months	used to store the session information, to know how many sessions have been made and session duration

Name

Type

Duration

Purpose

consent_ga

1st party

1 year

Memorize if the visitor has consent to give more details about its visit but still without any personal data.

_ga

1st party

14 months

used to distinguish users, visitor ID is a random ID which doesn't allow to identify a specific user.

_ga_<container-id>

1st party

14 months

used to store the session information, to know how many sessions have been made and session duration

Name	Type	Duration	Purpose
consent_reciteme	1st party	1 year	Memorize if the visitor has consent to allow cookies from reciteme tool.
Recite.Persist	1st party	Session	Session cookie for server side scripting. This cookie identify your session.
Recite.Preferences	1st party	Session	Session cookie for server side scripting. This cookie identify your session.
elementor_split_test_client_id	3rd party	Deleted when ReciteMe is turned off	.
_ga_	3rd party	Deleted when Recite Me is turned off	Analytics cookie to measure usage of ReciteMe

Name

Type

Duration

Purpose

consent_reciteme

1st party

1 year

Memorize if the visitor has consent to allow cookies from reciteme tool.

Recite.Persist

1st party

Session

Session cookie for server side scripting. This cookie identify your session.

Recite.Preferences

1st party

Session

Session cookie for server side scripting. This cookie identify your session.

elementor_split_test_client_id

3rd party

Deleted when ReciteMe is turned off

_ga_

3rd party

Deleted when Recite Me is turned off

Analytics cookie to measure usage of ReciteMe