IAMinerva
HomeBlogAbout
m3M365 NewscoMicrosoft CopilotteMicrosoft TeamsshSharePoint & OneDriveinIntune & SecurityexExchange & OutlookpoPower PlatformazAzure & Entra IDtuTutorials & GuidesevEvents & ConferencesseSecuritywiWindows
IAMinerva

Professional blog dedicated to the Microsoft 365 ecosystem.

Quick links

HomeBlogAboutNewsletter

Stay informed

Get the latest Microsoft 365 news delivered straight to your inbox.

© 2026 IAMinerva. All rights reserved.

Built withNext.js&Tailwind
Réseau de connexions représentant des utilisateurs avec des éléments graphiques dorés.
BlogMicrosoft CopilotMulti-Turn Evaluators: Quality and Recommendations in Copilot
Microsoft Copilot#copilot#evaluate AI agents#Microsoft Foundry

Multi-Turn Evaluators: Quality and Recommendations in Copilot

Discover how Microsoft Foundry's multi-turn evaluators improve complex AI agent interactions through in-depth analysis.

Houssem MAKHLOUF
June 28, 2026
4 min read

TL;DR par Minerva

généré par IA

Discover how Microsoft Foundry's multi-turn evaluators improve complex AI agent interactions through in-depth analysis.

Introduction

Evaluating the performance of AI agents in multi-turn interactions is crucial to ensure exceptional quality. Unlike single-turn metrics, which evaluate individual responses on criteria such as relevance or tone, multi-turn sessions require a more holistic approach. Microsoft Foundry establishes rigorous standards to validate and calibrate these evaluators, ensuring their reliability and relevance in complex scenarios such as those managed by Microsoft Copilot.

Understanding Multi-Turn Evaluators

Multi-turn evaluators were designed to analyze entire sessions, taking into account elements such as:

  • The agent's ability to accomplish a complete task (Task Completion)
  • Overall user satisfaction (CSAT)
  • Conversational coherence between turns (Conversation Coherence)
  • Support of the agent's statements with verified facts (Groundedness)

Available Evaluator Types

Choosing the appropriate evaluator type is essential. Here is a comparison of different evaluator families:

Evaluator FamilyWhat They EvaluateAnalysis Units
Single-turnOne pair (input/output) on a fixed gridOne turn
Multi-turnAn entire sessionOne conversation
AdaptiveA complete session with a rubric generated for the groupOne conversation
i

Good to Know

Multi-turn approaches focus on evaluating properties at the session level, ensuring comprehensive analysis instead of limiting themselves to isolated responses.

Evaluator Evaluation Methodology

Within the Microsoft Foundry framework, the reliability and validity of multi-turn evaluators are analyzed through:

  • Reference Datasets: Specific datasets are selected to isolate each property.
  • Multi-Judge Tests: Varied judgment models are used to examine both their accuracy and consistency.
  • Key Metrics: Measurement axes include validity, reliability, and robustness.

Study Results

The overall evaluation presents the following points:

  • Task Completion: Largely reliable with little variability between judges, suitable for session scores.
  • CSAT: Extremely solid, particularly with advanced judges like GPT-5.5 and Claude Opus 4.7.
  • Groundedness: More difficult to stabilize; recommended as a triage signal rather than a fixed threshold.
  • Conversation Coherence: Reliable, although some smaller judges show gaps in incoherent cases.

[TABLE] Evaluator | Property | Output Task Completion | Did the agent fully accomplish the user's task? | Binary (success / failure) + details CSAT | Level of user satisfaction | Likert scale 1-5 Groundedness | Statements supported by sources | Likert scale 1-5 Conversation Coherence | Smooth progression between turns | Likert scale 1-5 [/TABLE]

Critical Implementation Points

1

Choose the Right Evaluator

Adapt the evaluator to the property you want to measure. Single-turn evaluators are not suitable for evaluating complete sessions.

2

Use Reliable Judges

Prefer advanced models like GPT-5.5 for critical properties such as fact-checking. Recalibrate small judges if you must use them.

3

Test on Varied Domains

Perform tests on diverse corpora to avoid conclusions limited to specific benchmarks. This ensures generalization of results.

4

Combine Multiple Judges

Favor cross-evaluation with multiple judges to minimize bias and balance performance.

Practical Recommendations

To ensure optimal results, here are some essential tips:

  • Calibrate decision thresholds according to your specific data.
  • Avoid small judges for Groundedness evaluation, except as trend indicators.
  • Measure the quality of the evaluators themselves to ensure the integrity of generated scores.
  • Use multi-turn evaluators for session testing and regulations before production deployment.
Ă—

Important

Public benchmarks often provide only result labels, not process labels. Be aware of the limitations of LLM-based scores in the absence of deterministic oracles.

Conclusion

Multi-turn evaluators are essential for navigating the complexities of deep conversational interactions. Through rigorous testing and solid methodology, Microsoft Foundry provides tools that enable developers to master these dimensions and create robust and reliable AI experiences.

Additional Resources

  • Start Building with Microsoft Foundry
  • Build Session BRK252
  • Discover the Documentation
  • Join the Community
✦

Tip

Leverage insights from multi-turn evaluators to continuously improve your agents' performance, especially during critical development phases.

Share:
HM

Houssem MAKHLOUF

Microsoft 365 enthusiast & IT professional.

Previous article

Accelerating the Patching Process: Five Eyes Priorities

Jun 27, 2026
Next article

Discover GPT‑5.6 Sol: The Next Generation AI Model

Jun 28, 2026

Related articles

Paysages montagneux avec des formes géométriques dorées sur un fond sombre.copilot

Microsoft Cloud, AI and Security Certifications: Anticipate 2026

Discover the new Microsoft certifications for cloud, AI and security. Anticipate these changes to remain competitive in 2026.

Jun 29, 20263 min
Engrenage doré avec des lignes fluides lumineuses sur fond sombre.copilot

Understanding and Using Claude Skills for Automation

Learn how to use Claude Skills to automate your professional tasks with flexible AI and custom connectors.

Jun 29, 20265 min
Fluides lumineuses dorées avec des bulles de dialogue sur fond noir.copilot

Copilot Memory: Essential Updates for Users

Explore the essential updates to Copilot Memory and conversation persistence. Optimize your use of Microsoft 365 Copilot.

Jun 29, 20265 min