Introduction
Evaluating the performance of AI agents in multi-turn interactions is crucial to ensure exceptional quality. Unlike single-turn metrics, which evaluate individual responses on criteria such as relevance or tone, multi-turn sessions require a more holistic approach. Microsoft Foundry establishes rigorous standards to validate and calibrate these evaluators, ensuring their reliability and relevance in complex scenarios such as those managed by Microsoft Copilot.
Understanding Multi-Turn Evaluators
Multi-turn evaluators were designed to analyze entire sessions, taking into account elements such as:
- The agent's ability to accomplish a complete task (Task Completion)
- Overall user satisfaction (CSAT)
- Conversational coherence between turns (Conversation Coherence)
- Support of the agent's statements with verified facts (Groundedness)
Available Evaluator Types
Choosing the appropriate evaluator type is essential. Here is a comparison of different evaluator families:
| Evaluator Family | What They Evaluate | Analysis Units |
|---|---|---|
| Single-turn | One pair (input/output) on a fixed grid | One turn |
| Multi-turn | An entire session | One conversation |
| Adaptive | A complete session with a rubric generated for the group | One conversation |
Good to Know
Multi-turn approaches focus on evaluating properties at the session level, ensuring comprehensive analysis instead of limiting themselves to isolated responses.
Evaluator Evaluation Methodology
Within the Microsoft Foundry framework, the reliability and validity of multi-turn evaluators are analyzed through:
- Reference Datasets: Specific datasets are selected to isolate each property.
- Multi-Judge Tests: Varied judgment models are used to examine both their accuracy and consistency.
- Key Metrics: Measurement axes include validity, reliability, and robustness.
Study Results
The overall evaluation presents the following points:
- Task Completion: Largely reliable with little variability between judges, suitable for session scores.
- CSAT: Extremely solid, particularly with advanced judges like GPT-5.5 and Claude Opus 4.7.
- Groundedness: More difficult to stabilize; recommended as a triage signal rather than a fixed threshold.
- Conversation Coherence: Reliable, although some smaller judges show gaps in incoherent cases.
[TABLE] Evaluator | Property | Output Task Completion | Did the agent fully accomplish the user's task? | Binary (success / failure) + details CSAT | Level of user satisfaction | Likert scale 1-5 Groundedness | Statements supported by sources | Likert scale 1-5 Conversation Coherence | Smooth progression between turns | Likert scale 1-5 [/TABLE]
Critical Implementation Points
Choose the Right Evaluator
Adapt the evaluator to the property you want to measure. Single-turn evaluators are not suitable for evaluating complete sessions.
Use Reliable Judges
Prefer advanced models like GPT-5.5 for critical properties such as fact-checking. Recalibrate small judges if you must use them.
Test on Varied Domains
Perform tests on diverse corpora to avoid conclusions limited to specific benchmarks. This ensures generalization of results.
Combine Multiple Judges
Favor cross-evaluation with multiple judges to minimize bias and balance performance.
Practical Recommendations
To ensure optimal results, here are some essential tips:
- Calibrate decision thresholds according to your specific data.
- Avoid small judges for Groundedness evaluation, except as trend indicators.
- Measure the quality of the evaluators themselves to ensure the integrity of generated scores.
- Use multi-turn evaluators for session testing and regulations before production deployment.
Important
Public benchmarks often provide only result labels, not process labels. Be aware of the limitations of LLM-based scores in the absence of deterministic oracles.
Conclusion
Multi-turn evaluators are essential for navigating the complexities of deep conversational interactions. Through rigorous testing and solid methodology, Microsoft Foundry provides tools that enable developers to master these dimensions and create robust and reliable AI experiences.
Additional Resources
- Start Building with Microsoft Foundry
- Build Session BRK252
- Discover the Documentation
- Join the Community
Tip
Leverage insights from multi-turn evaluators to continuously improve your agents' performance, especially during critical development phases.



