Training and evaluating reliable AI agents

Posted January 22, 2026

Key takeaways

Agentic AI success depends on high-quality training data and "golden path" benchmarking, annotating the ideal sequence of API calls and reasoning steps to create a ground truth.
When deploying AI agents, developers must address three primary challenges: reliability (preventing compound errors in multistep workflows), specification (resolving ambiguous instructions and tool overlaps) and data drift (ensuring performance holds up against real-world queries compared to handpicked examples).
To build production-ready systems, the agent's trajectories must be analyzed to identify where logical failures occur and ensure that tool orchestration is efficient rather than circuitous.

There is a growing interest around the potential of large language model (LLM) agents in enabling intelligent automation across various industries. Gartner predicts that 40% of enterprise applications will embed AI agents by the end of 2026, up from less than 5% in 2025. However, deploying them in real-world systems requires satisfying several key requirements.

Agents must:

Interact seamlessly with both humans and tools over long context windows to incrementally gather information and resolve intents;
Accurately adhere to complex policies and rules specific to a task or domain and
Maintain consistency and reliability at scale, across millions of interactions.

For example, consider an AI travel agent assisting a client with a multistep itinerary change. When a traveler decides to shift their vacation from one destination to another, the agent must execute several high-level tasks:

Engage in a dynamic dialogue to gather new preferences, such as updated travel dates, budget constraints and specific accommodation needs.
Cross-reference the original bookings against supplier-specific terms and conditions to determine cancellation penalties or credit eligibility.
Navigate global distribution systems (GDS) or proprietary partner application programming interfaces (APIs) to source real-time availability for flights, lodging and ground transport.
Finalize the rebooking process, ensuring all vouchers and confirmations are synchronized.

To be effective, the agent should provide the same high level of service and follow the same logical protocols for every client with an identical request, regardless of the user's communication style. Similarly, the agent must remain focused on the end goal even if the conversation becomes non-linear, such as when a client provides information out of order.

Understanding agentic trajectories

An agent typically operates in a cycle. Before fulfilling a request, an agent engages in a continuous process of observation, reasoning and action, which requires:

Perceiving the current state (observation) - receiving information from sensors, images, text or other inputs
Analyzing the situation (reasoning or thought) - processing state information using its policy
Taking a concrete step toward resolution (action) - executing a decision to affect the environment
Evaluating the outcome - which generates a new observation and reward

If the observation reveals an error or incomplete information, the agent re-enters the cycle to refine its approach. The complete sequence of these observation-reasoning-action cycles is called a trajectory, representing the path of states, reasoning and actions the agent took to reach its goal.

An illustration of the agent trajectory.

As frameworks like chain-of-thought (CoT) and reasoning and acting (ReAct) enable dynamic planning by interleaving tool interaction with reasoning, AI agents have evolved to execute complex, multistep tasks. A trajectory represents a complete trace, the end-to-end record of an agent's execution path composed of multiple spans, where each span captures an individual operation or step within that path. These spans might include LLM calls, tool invocations, retrieval operations or reasoning steps, each with their own timing, inputs, outputs and metadata.

For instance, an agent might:

Compare user input with session history to disambiguate a term
Look up a policy document or search a knowledge base
Invoke multiple APIs to gather information
Reason through intermediate steps before taking an action

Each of these operations becomes a span within the overall trace and together they form the complete trajectory. Understanding this trajectory with all its constituent spans, their relationships, timing and data flow is essential for debugging failures, optimizing performance and ensuring the agent behaves as intended across its multistep execution path.

Training an agent for multistep reasoning, tool orchestration and policy adherence

In a recent webinar with Tommy Guy, principal applied researcher on Microsoft's Co-Pilot Studio team, we explored the practical realities of moving agents from prototype to production.

Overcoming the challenges of AI agent creation, training and evaluation

Join us for an enlightening discussion with two pioneers in the field of AI: Tommy Guy, principal applied researcher at Microsoft Copilot Studio and Steve Nemzer, senior director of AI growth and innovation at TELUS Digital.

Watch the video

The context problem remains central

As Tommy Guy suggests, the difficulty lies not in the model’s intelligence, but in the strategy used to identify and apply the correct context for a specific domain. To solve this, system prompts must be optimized for context-aware tool orchestration. System messages serve as the persistent logic layer for every interaction, defining how an agent reasons, navigates available tools and structures its decision-making process. Without a system prompt designed for orchestration, the agent lacks the framework to resolve functional overlaps or ambiguous tool descriptions. Rather than relying on external task clarity, the system prompt must encode the "common sense" heuristics required to:

Disambiguate tool selection: Provide the internal logic to distinguish between similar tools based on the current environmental state.
Operationalize domain context: Act as the mechanism that finds and applies the "right context" for a problem, ensuring the agent doesn't improvise in unpredictable ways.

The mathematics of compound failure

Due to the probabilistic nature of LLMs and resulting output variability, an agent might fail on an input it previously handled correctly. In multistep workflows, errors compound exponentially and a single misstep in an agent's reasoning chain can derail an entire process. Consider a task requiring four sequential tool calls, each with a 99% success rate. The overall success rate? Only 96%. If each tool has a more realistic 95% success rate, the system fails one time in five.

"That pernicious math, that combinatorial math for these multi-tool call scenarios, this is what really makes me nervous," Guy noted. "How do we instill trust in something where everything has to go right in the entire progression of a task?"

Goldilocks data: The "just right" challenge

"The training data needs to be just right," Guy explained. "It needs to be just right in the sense that sometimes I get a success, sometimes I get a failure when I sample. Then I've got enough signal to compare what the model did when it succeeded versus what it did when it failed."

Model developers need goldilocks datasets, where domain experts curate scenarios that:

Challenge the agent without being impossible
Provide clear ground truth while allowing for multiple valid solution paths
Generate the mixture of successes and failures needed for effective reinforcement learning

To train and measure an agent's ability to interact reliably with both human users and programmatic systems, while adhering to domain-specific policies, model developers need comprehensive agentic AI training frameworks that include:

Designing API specifications and databases that mirror production systems, including authentic tool names, arguments and real-world constraints like rate limits, data inconsistencies and error conditions.
Developing complex, multistep tasks that require sequential tool invocation, conditional logic and the integration of information gathered across multiple interaction turns.
Crafting diverse user scenarios that span the full spectrum of interactions an agent might encounter, from straightforward requests to adversarial attempts to circumvent policies, each with corresponding ground truth annotations.
Enumerating the correct sequence of API calls for each task. This provides the ground truth against which agent performance can be measured. Annotation must account for multiple valid solution paths while identifying approaches that violate policies or produce incorrect final states.

For context-aware tool orchestration, each of these components requires a well-designed tool with:

A clear description of what each does
A working implementation
Strongly-typed arguments to prevent errors
Optionally, typed outputs for easier integration with other tools

An example of an expert creating the ideal agentic trajectories with tool calls and parameters in our proprietary platform, Fine-Tune Studio.

Three-tiered evaluation framework: Response, trajectory and step-level analysis

Consider an AI travel agent that provides the correct flight itinerary but only after querying the wrong GDS systems, making unnecessary hotel availability checks and taking a circuitous path through multiple booking APIs. The final itinerary might be accurate, but the inefficiency, latency and potential for failure make this agent unreliable for production use. To ensure reliability, we must evaluate the joint probability of the correct output and the optimal execution path. At TELUS Digital, we employ a comprehensive evaluation strategy that provides both breadth and depth of system inspection.

Level 1: Final response evaluation

This method evaluates only the user's input and the agent's final answer, ignoring the steps taken to reach it. It assesses:

Response quality: Is the output high-quality and does it make sense based on the context?
Semantic accuracy: Does it match the reference answer?

This approach is quick, cost-effective and easily automated, which is why many benchmark datasets adopt it. For example, tau-bench evaluates agent performance using the final response plus database state changes. However, while efficient and flexible, this method doesn't reveal why an agent failed or identify inefficient execution paths, which is why we complement it with deeper evaluation methods.

Level 2: Trajectory evaluation

This method examines the complete execution path the agent took to reach its answer. While some evaluation frameworks perform basic trajectory checks, most limit their analysis to whether the agent called certain tools or reached a final state. At TELUS Digital, we've implemented comprehensive trajectory evaluation that examines every aspect of the agent's decision-making process. Through granular JavaScript object notation (JSON) annotation on Fine-Tune Studio, we evaluate:

Reasoning (The "why"): Did the CoT logic logically precede the action? Did the agent understand the context before attempting a solution? Did it properly cycle through thought-action-observation coherently?
Tool selection and use: Is the agent choosing appropriate functions for each sub-task? Is the sequence of operations efficient or circuitous?
Argument mapping: Are the arguments in the JSON payload semantically correct and properly structured?
Hallucination detection: Is the agent fabricating information, inventing non-existent tools or making up parameter values?
Error handling: Does the agent appropriately respond to failures and edge cases?

This comprehensive approach helps pinpoint exactly where in the reasoning process a failure occurred, whether it's a logical error, incorrect tool selection, malformed arguments, hallucinated information or an inefficient execution path.

In addition, our experts provide detailed feedback on what went wrong at each step of the agent's reasoning process. This human-in-the-loop approach combines automated verification with expert analysis to deliver actionable insights. Our experts annotate:

The specific decision point where the agent deviated from optimal behavior
Why the chosen action was incorrect or suboptimal
What the correct action should have been and the reasoning behind it
Contextual factors the agent missed or misinterpreted

This attribute-level inspection, enriched with expert feedback, allows developers to:

Identify the exact step where reasoning breaks down or incorrect tools are called, along with expert diagnosis of the root cause
Generate high-quality training datasets by converting corrected trajectories with expert annotations into examples for fine-tuning or trajectory tuning
Obtain trajectory quality scores that can be used as reward signals in reinforcement learning pipelines
Understand not just that something failed, but why it failed and how to prevent similar failures

This granular level annotation provides developers with a complete diagnostic and improvement toolkit: level 1 catches failures, level 2 diagnoses where and how they occurred across every dimension of the trajectory and generates the granular data and expert insights needed to systematically improve agent reasoning through targeted training interventions.

An example of an expert reviewing and providing feedback on an agent’s trajectory in our proprietary platform, Fine-Tune Studio.

Evaluation methodologies

Offline evaluation with golden path benchmarking

This involves creating comprehensive evaluation datasets exported in JSON format. These datasets capture the full lifecycle of multiturn conversations where each entry includes both the inputs and the assessment criteria:

Input context: User query and system instructions
Ground truth trajectory: The expected sequence of tool calls and intermediate responses (the "golden path")
Reference final answer: The gold-standard output for semantic comparison
Performance review: A structured assessment of the agent's actual response, including rubric-based ratings, quality scores and detailed feedback on where divergence occurred

The approach of combining the "golden path" with post-inference scoring and review establishes a baseline for regression testing, allowing teams to quantify exactly how close their agent is to the target behavior.

Interactive trace analysis with dynamic prompt modification

For exploratory testing and edge-case discovery, Fine-Tune Studio provides an interactive trace pane. This allows expert evaluators to inspect the live inference chain, modify prompts dynamically and annotate traces in real-time to expand the evaluation set.

Both methods leverage our granular JSON annotation capabilities, providing turn-by-turn, attribute-level annotation for precise model correction.

TELUS Digital’s comprehensive training and evaluation services

Assessing an AI agent's reasoning, behavior and tool usage is critical for developing reliable systems. However, achieving this at scale demands specialized expertise, advanced tools and established processes.

We provide:

Comprehensive training data creation: To ensure agents can navigate complex policies and long context windows, we create "Goldilocks" datasets with calibrated difficulty. Our domain experts craft multistep tasks and ground-truth annotations that push agent reasoning without exceeding operational limits, for consistent and real-world autonomous performance.
Expert trajectory analysis: Our trained reviewers examine detailed agent trajectories, pinpointing problems in reasoning or implementation and tagging them for refinement. They assess both functional performance and non-functional requirements like safety and bias.

Contact us to learn how we can help you move from impressive demos to production-ready systems. Let's build the future of AI together, with quality at its core.

GenAI applications that drive business performance

Insights Overview

Categories

Industries

Resource Types

Glossary

Training and evaluating reliable AI agents

Key takeaways

Understanding agentic trajectories

Training an agent for multistep reasoning, tool orchestration and policy adherence

Overcoming the challenges of AI agent creation, training and evaluation

The context problem remains central

The mathematics of compound failure

Goldilocks data: The "just right" challenge

Three-tiered evaluation framework: Response, trajectory and step-level analysis

Level 1: Final response evaluation

Level 2: Trajectory evaluation

Evaluation methodologies

Offline evaluation with golden path benchmarking

Interactive trace analysis with dynamic prompt modification

TELUS Digital’s comprehensive training and evaluation services

Be the first to know

Related insights

Overcoming the challenges of AI agent creation, training and evaluation

Overcoming the challenges of AI agent creation, training and evaluation

Agentic AI: Evolution and evaluation from cognitive foundations

Agentic AI: Evolution and evaluation from cognitive foundations

The executive's guide to agentic AI: Enhancing workflows to unlock high-value cost savings

The executive's guide to agentic AI: Enhancing workflows to unlock high-value cost savings