Cloud native and agentic AI: an essential duo

Introduction

Cloud native architectures have revolutionized the management of modern IT environments, and their integration with agentic AI is a critical challenge. This article explores the foundations, tools, and best practices for building multi-agent systems on Kubernetes, as illustrated by the ongoing project at Orange Innovation.

Figure 1: Overview of the multi-agent system.

Why choose cloud native for agentic AI?

Agentic AI systems share many operational challenges with cloud native architectures: identity management, security policies, advanced observability, and GitOps. These solutions enable structuring agents as distinct Kubernetes workloads, ensuring increased scalability, flexibility, and security.

Good to know

Projects under CNCF and Linux Foundation governance, such as cert-manager and Falco, guarantee reliable adoption in regulated environments.

Key points for building an agentic platform

1. Each agent is an independent Kubernetes workload

For each agent, we use a Kubernetes Deployment with resource limits, its own identity, and distinct restart rules. This approach enables:

Progressive updates (canary rollouts)
Namespace isolation
Efficient error handling (e.g., API timeout).

Caution

Avoid integrating all agents into a single process: this compromises system resilience in case of failure.

2. Inter-agent traffic: use mTLS, not a service mesh

Inter-agent messages are encapsulated with the A2A protocol, carrying detection rules and sensitive actions. The choice of cert-manager and CiliumNetworkPolicy simplifies mTLS security without the complexity of a service mesh.

⚡PowerShell

1kubectl apply -f cert-manager-configuration.yaml

This configuration ensures:

Authentication based on agent identities.
Granular protection of network communications.

3. Security constraints: adopt a policy-as-code approach

Instead of relying on LLM prompts, structure security constraints in versioned policies. For example, we chose:

OPA rules for action validation.
Escalation recognition via Kyverno.

These constraints are codified, versioned, and tested, providing increased transparency and reliability.

4. Enhanced observability through trace_id

Each A2A message includes a unique trace_id, essential for:

Tracing the full decision chain.
Monitoring agent performance via Prometheus and Cilium Hubble.

Structured logs allow identifying anomalies in minutes rather than hours.

5. Classic anomaly model before LLM activation

An Isolation Forest filters events before they are sent to LLM agents. This optimizes costs related to LLM usage while ensuring rapid identification of significant anomalies:

🐍Python

1from sklearn.ensemble import IsolationForest
2model = IsolationForest(n_estimators=100)

The anomaly threshold is dynamically adjustable via reviewer policies.

Keeping humans in the loop

Critical decisions follow three states:

Auto-execute: decision applied automatically.
Auto-reject: decision blocked automatically.
Human escalation: sent to a SOC analyst via Mattermost.

These escalations are not errors but predictable cases. Each process is supported by versioned artifacts such as GitOps policies, strengthening collaboration between teams.

Tip

Consolidate your artifacts in a centralized Git repository to minimize surprises during escalations.

How to organize work between teams

A regular collaborative approach is essential. Here is the typical distribution between teams:

SOC team: responsible for security policies and detection.
Platform team: cluster management and GitOps pipelines.
AI team: maintenance of models and agent interfaces.

Conclusion

Adopting a cloud native approach and open governance like the CNCF is essential for developing robust and scalable agentic AI. Tools such as Kubernetes, cert-manager, and Argo CD transform this complexity into a maintainable system. If you want to delve deeper, find KubeCon Slack sessions or contact CNCF experts to share your feedback.

About the author

Willem Berroubache, Chief Security Architect at Orange Innovation, is a specialist in cloud native security and active contributor to the CNCF.

Introduction

Figure 1: Overview of the multi-agent system.

Why choose cloud native for agentic AI?

Good to know

Projects under CNCF and Linux Foundation governance, such as cert-manager and Falco, guarantee reliable adoption in regulated environments.

Key points for building an agentic platform

1. Each agent is an independent Kubernetes workload

For each agent, we use a Kubernetes Deployment with resource limits, its own identity, and distinct restart rules. This approach enables:

Progressive updates (canary rollouts)
Namespace isolation
Efficient error handling (e.g., API timeout).

Caution

Avoid integrating all agents into a single process: this compromises system resilience in case of failure.

2. Inter-agent traffic: use mTLS, not a service mesh

⚡PowerShell

1kubectl apply -f cert-manager-configuration.yaml

This configuration ensures:

Authentication based on agent identities.
Granular protection of network communications.

3. Security constraints: adopt a policy-as-code approach

Instead of relying on LLM prompts, structure security constraints in versioned policies. For example, we chose:

OPA rules for action validation.
Escalation recognition via Kyverno.

These constraints are codified, versioned, and tested, providing increased transparency and reliability.

4. Enhanced observability through trace_id

Each A2A message includes a unique trace_id, essential for:

Tracing the full decision chain.
Monitoring agent performance via Prometheus and Cilium Hubble.

Structured logs allow identifying anomalies in minutes rather than hours.

5. Classic anomaly model before LLM activation

An Isolation Forest filters events before they are sent to LLM agents. This optimizes costs related to LLM usage while ensuring rapid identification of significant anomalies:

🐍Python

1from sklearn.ensemble import IsolationForest
2model = IsolationForest(n_estimators=100)

The anomaly threshold is dynamically adjustable via reviewer policies.

Keeping humans in the loop

Critical decisions follow three states:

Auto-execute: decision applied automatically.
Auto-reject: decision blocked automatically.
Human escalation: sent to a SOC analyst via Mattermost.

These escalations are not errors but predictable cases. Each process is supported by versioned artifacts such as GitOps policies, strengthening collaboration between teams.

Tip

Consolidate your artifacts in a centralized Git repository to minimize surprises during escalations.

How to organize work between teams

A regular collaborative approach is essential. Here is the typical distribution between teams:

SOC team: responsible for security policies and detection.
Platform team: cluster management and GitOps pipelines.
AI team: maintenance of models and agent interfaces.

Conclusion

About the author

Willem Berroubache, Chief Security Architect at Orange Innovation, is a specialist in cloud native security and active contributor to the CNCF.

Cloud native and agentic AI: an essential duo

Introduction

Why choose cloud native for agentic AI?

Key points for building an agentic platform

1. Each agent is an independent Kubernetes workload

2. Inter-agent traffic: use mTLS, not a service mesh

3. Security constraints: adopt a policy-as-code approach

4. Enhanced observability through trace_id

5. Classic anomaly model before LLM activation

Keeping humans in the loop

How to organize work between teams

Conclusion

About the author

Houssem MAKHLOUF

Related articles

Microsoft Cloud, AI and Security Certifications: Anticipate 2026

Understanding and Using Claude Skills for Automation

Copilot Memory: Essential Updates for Users

Cloud native and agentic AI: an essential duo

Introduction

Why choose cloud native for agentic AI?

Key points for building an agentic platform

1. Each agent is an independent Kubernetes workload

2. Inter-agent traffic: use mTLS, not a service mesh

3. Security constraints: adopt a policy-as-code approach

4. Enhanced observability through trace_id

5. Classic anomaly model before LLM activation

Keeping humans in the loop

How to organize work between teams

Conclusion

About the author

Houssem MAKHLOUF

Related articles

Microsoft Cloud, AI and Security Certifications: Anticipate 2026

Understanding and Using Claude Skills for Automation

Copilot Memory: Essential Updates for Users