Artificial Intelligence (AI) has become a hot topic of discussion in the DevOps community, particularly in the context of Kubernetes. Recognising its significance, we decided to evaluate this topic as it is highly relevant to our customers. Our primary focus was to assess the maturity of various AI tools in handling Kubernetes-specific operations and which risks might have to be considered using such AI tools. Secondly, the evaluation will serve as a foundation for future testing, as we believe it is important to continue exploring and refining these tools. The AI tools under consideration utilise Foundation Models to collect contextual data from Kubernetes clusters and build prompts with as much context as possible. In a nutshell, the key question driving this evaluation was: "Which model is best for handling Kubernetes-specific tasks and what level of trust should be placed into the models for Kubernetes operations?"
Target of Evaluation
The evaluation aimed to determine the effectiveness of Large Language Models (LLMs) in solving specific, technically demanding tasks within the Kubernetes environment. This involved assessing the performance of multiple LLMs from different providers, including Google, Anthropic, Mistral AI, OpenAI, and Meta. The goal was to identify which model could handle Kubernetes-specific day-2 tasks most effectively.
How We Evaluated
To address the research question, "Are LLMs capable of solving specific, technically demanding tasks in the Kubernetes environment?", the evaluation process involved several steps:
- Approach and use cases: We studied two primary sets of use cases: Kubernetes day-2 operations (KillerCoda scenarios) and SecOps tasks (issues reported by Trivy and their analysis). We formulated each problem as a set of basic and enriched prompts, ran them against multiple large language models (LLMs), graded them manually and automatically, and captured them with a prompt evaluation tool. The problems were analysed from the perspectives of beginner, intermediate, and senior DevSecOps Engineers. We analysed and compared several additional use cases to explore various aspects of design, architecture, and conceptual challenges across different LLMs.
- Selection of LLMs: We tested multiple LLMs from various providers to ensure a comprehensive evaluation. The selected models included those from Google, Anthropic, Mistral AI, OpenAI, and Meta.
- Setup: The setup involved using LLM provider APIs, Promptfoo, kube-linter, and kind. The results were collected in JSON format. Figure 1 illustrates this setup.Figure 1: Evaluation setup
- Evaluation Tool: We used Promptfoo, a software interface designed for evaluating LLMs in the context of Platform Engineering. The tool provided various metrics, including pass rates and performance scores. Promptfoo's interface includes features such as creating new evaluations, managing prompts and datasets, and tracking progress. It also offers visual representations of the evaluation results through bar charts and scatter plots, and detailed tables showing the performance of different LLMs on specific tasks.
Quantitative Results
The evaluation yielded insightful quantitative results:
- Overall Performance: The overall pass rates for the tested LLMs ranged between 51.7% and 53.9%, with the combined average being 52.7%.
- Task-Specific Performance: The performance of the LLMs varied significantly across different task categories. For instance, tasks like "Node-Join" and "Scheduling Priority" saw higher success rates, while more complex tasks like "Cluster Upgrade" and "NetworkPolicy" had lower success rates.
- Summary: The results indicated that the success rate of LLMs decreases as the complexity of the tasks increases. Or in other words: The more complex the tasks, the lower the success rate. The following illustration shows this in more detail.
Qualitative Results
In addition to this, our evaluation resulted in the following qualitative findings:
- Built-in LLMs provide lower-quality responses compared to current-generation foundational models.
- The optimal way to use LLMs is through human-in-the-loop scenarios.
- LLMs can be considered as unreliable search engines; response quality depends on the training corpus.
- As LLMs improve with addressing simple and obvious problems, detecting misrepresentations becomes more challenging.
- Critical thinking and experience are key differentiators in extracting maximum value from LLMs. Senior DevOps engineers gain more value than junior engineers: the more domain knowledge you have, the faster you can identify false positives and interpret LLM responses.
Our Recommendations
As a result of this evaluation, we recommend the following when using LLMs for creating DevOps prompts:
- Educate teams on the importance and methods of critical thinking through tailored training sessions and workshops.
- Be aware of cognitive load (it may take more time to compensate for misleading LLM responses compared to referring to documentation or the colleague’s advice).
- Use “calibrating” questions to infer the training corpus of each LLM to gauge its ability to provide useful responses (“reverse” Retrieval-Augmented Generation, RAG).
- For corporate users, it is especially important to implement AI/LLM FinOps practices to track costs and ROI across multiple scenarios (SaaS and on-premises solutions), that the developed methodology provides.
Future Evaluations
Looking ahead, we defined several key areas for future assessments:
- Expansion to a Full Benchmark: The evaluation will be expanded to a comprehensive benchmark for Kubernetes tasks, including more granular assessments and additional tasks.
- Evaluation of Additional LLMs/AI Tools: Future evaluations will include more LLMs and AI tools to determine the impact of various factors, such as Retrieval-Augmented Generation (RAG).
- Monitoring Future Developments: Continuous observation of advancements in AI and Kubernetes will be crucial to keep the evaluation framework up-to-date and relevant.
In conclusion, while LLMs show promise in handling simpler Kubernetes tasks, there is significant room for improvement in tackling more complex challenges. The developed methodology provides a solid foundation for ongoing and future evaluations, ensuring that we stay at the forefront of AI and Kubernetes integration.