Introduction
Cloud native architectures have revolutionized the management of modern IT environments, and their integration with agentic AI is a critical challenge. This article explores the foundations, tools, and best practices for building multi-agent systems on Kubernetes, as illustrated by the ongoing project at Orange Innovation.

Why choose cloud native for agentic AI?
Agentic AI systems share many operational challenges with cloud native architectures: identity management, security policies, advanced observability, and GitOps. These solutions enable structuring agents as distinct Kubernetes workloads, ensuring increased scalability, flexibility, and security.
Good to know
Projects under CNCF and Linux Foundation governance, such as cert-manager and Falco, guarantee reliable adoption in regulated environments.
Key points for building an agentic platform
1. Each agent is an independent Kubernetes workload
For each agent, we use a Kubernetes Deployment with resource limits, its own identity, and distinct restart rules. This approach enables:
- Progressive updates (canary rollouts)
- Namespace isolation
- Efficient error handling (e.g., API timeout).
Caution
Avoid integrating all agents into a single process: this compromises system resilience in case of failure.
2. Inter-agent traffic: use mTLS, not a service mesh
Inter-agent messages are encapsulated with the A2A protocol, carrying detection rules and sensitive actions. The choice of cert-manager and CiliumNetworkPolicy simplifies mTLS security without the complexity of a service mesh.
1kubectl apply -f cert-manager-configuration.yamlThis configuration ensures:
- Authentication based on agent identities.
- Granular protection of network communications.
3. Security constraints: adopt a policy-as-code approach
Instead of relying on LLM prompts, structure security constraints in versioned policies. For example, we chose:
- OPA rules for action validation.
- Escalation recognition via Kyverno.
These constraints are codified, versioned, and tested, providing increased transparency and reliability.
4. Enhanced observability through trace_id
Each A2A message includes a unique trace_id, essential for:
- Tracing the full decision chain.
- Monitoring agent performance via Prometheus and Cilium Hubble.
Structured logs allow identifying anomalies in minutes rather than hours.
5. Classic anomaly model before LLM activation
An Isolation Forest filters events before they are sent to LLM agents. This optimizes costs related to LLM usage while ensuring rapid identification of significant anomalies:
1from sklearn.ensemble import IsolationForest2model = IsolationForest(n_estimators=100)The anomaly threshold is dynamically adjustable via reviewer policies.
Keeping humans in the loop
Critical decisions follow three states:
- Auto-execute: decision applied automatically.
- Auto-reject: decision blocked automatically.
- Human escalation: sent to a SOC analyst via Mattermost.
These escalations are not errors but predictable cases. Each process is supported by versioned artifacts such as GitOps policies, strengthening collaboration between teams.
Tip
Consolidate your artifacts in a centralized Git repository to minimize surprises during escalations.
How to organize work between teams
A regular collaborative approach is essential. Here is the typical distribution between teams:
- SOC team: responsible for security policies and detection.
- Platform team: cluster management and GitOps pipelines.
- AI team: maintenance of models and agent interfaces.
Conclusion
Adopting a cloud native approach and open governance like the CNCF is essential for developing robust and scalable agentic AI. Tools such as Kubernetes, cert-manager, and Argo CD transform this complexity into a maintainable system. If you want to delve deeper, find KubeCon Slack sessions or contact CNCF experts to share your feedback.
About the author
Willem Berroubache, Chief Security Architect at Orange Innovation, is a specialist in cloud native security and active contributor to the CNCF.



