Multi-Turn Evaluators: Quality and Recommendations in Copilot

Introduction

Evaluating the performance of AI agents in multi-turn interactions is crucial to ensure exceptional quality. Unlike single-turn metrics, which evaluate individual responses on criteria such as relevance or tone, multi-turn sessions require a more holistic approach. Microsoft Foundry establishes rigorous standards to validate and calibrate these evaluators, ensuring their reliability and relevance in complex scenarios such as those managed by Microsoft Copilot.

Understanding Multi-Turn Evaluators

Multi-turn evaluators were designed to analyze entire sessions, taking into account elements such as:

The agent's ability to accomplish a complete task (Task Completion)
Overall user satisfaction (CSAT)
Conversational coherence between turns (Conversation Coherence)
Support of the agent's statements with verified facts (Groundedness)

Available Evaluator Types

Choosing the appropriate evaluator type is essential. Here is a comparison of different evaluator families:

Evaluator Family	What They Evaluate	Analysis Units
Single-turn	One pair (input/output) on a fixed grid	One turn
Multi-turn	An entire session	One conversation
Adaptive	A complete session with a rubric generated for the group	One conversation

Good to Know

Multi-turn approaches focus on evaluating properties at the session level, ensuring comprehensive analysis instead of limiting themselves to isolated responses.

Evaluator Evaluation Methodology

Within the Microsoft Foundry framework, the reliability and validity of multi-turn evaluators are analyzed through:

Reference Datasets: Specific datasets are selected to isolate each property.
Multi-Judge Tests: Varied judgment models are used to examine both their accuracy and consistency.
Key Metrics: Measurement axes include validity, reliability, and robustness.

Study Results

The overall evaluation presents the following points:

Task Completion: Largely reliable with little variability between judges, suitable for session scores.
CSAT: Extremely solid, particularly with advanced judges like GPT-5.5 and Claude Opus 4.7.
Groundedness: More difficult to stabilize; recommended as a triage signal rather than a fixed threshold.
Conversation Coherence: Reliable, although some smaller judges show gaps in incoherent cases.

Critical Implementation Points

Choose the Right Evaluator

Adapt the evaluator to the property you want to measure. Single-turn evaluators are not suitable for evaluating complete sessions.

Use Reliable Judges

Prefer advanced models like GPT-5.5 for critical properties such as fact-checking. Recalibrate small judges if you must use them.

Test on Varied Domains

Perform tests on diverse corpora to avoid conclusions limited to specific benchmarks. This ensures generalization of results.

Combine Multiple Judges

Favor cross-evaluation with multiple judges to minimize bias and balance performance.

Practical Recommendations

To ensure optimal results, here are some essential tips:

Calibrate decision thresholds according to your specific data.
Avoid small judges for Groundedness evaluation, except as trend indicators.
Measure the quality of the evaluators themselves to ensure the integrity of generated scores.
Use multi-turn evaluators for session testing and regulations before production deployment.

Important

Public benchmarks often provide only result labels, not process labels. Be aware of the limitations of LLM-based scores in the absence of deterministic oracles.

Conclusion

Multi-turn evaluators are essential for navigating the complexities of deep conversational interactions. Through rigorous testing and solid methodology, Microsoft Foundry provides tools that enable developers to master these dimensions and create robust and reliable AI experiences.

Additional Resources

Tip

Leverage insights from multi-turn evaluators to continuously improve your agents' performance, especially during critical development phases.

Introduction