Optimizing AI agent performance: Why expert-curated Goldilocks datasets are critical

Key takeaways
- Effective training relies on a Goldilocks balance where data is neither too simple, leading to the agent copying explicit instructions repeated in its training data, nor too complex, causing ineffectual training and a waste of compute. This precise calibration forces the agent to develop genuine reasoning skills rather than simply following explicit instructions.
- Because high-quality training data requires an understanding of subtle nuances and critical edge cases, it can’t be crowdsourced or left to generalists. Instead, domain experts need to oversee the data to ensure the agent’s logic aligns with real-world complexity.
- To bridge the gap between theory and execution, agents learn to reason through golden path trajectories — multilayered execution maps that provide a clear blueprint for successful problem-solving.
- This training evolves from simple, single-tool tasks into advanced, multi-turn orchestrations. Developers further refine this process using techniques like ablation — systematically removing hints — to isolate and identify the high-value data where the agent’s logic is most rigorously tested.
Forget the chatbots of the past that were just built to talk — today’s AI agents are built to do. To move from conversation to execution, these agents must master complex, multistep decision-making — a feat that hinges largely on the quality of their training data. Success requires a Goldilocks dataset: one that isn't so prescriptive it stifles the model’s autonomy, yet isn't so difficult that it misses the mark on critical goals.
Mastering this balance is the secret to moving beyond basic automation. By hitting that “just right” training data sweet spot, you build a foundation for agents that can actually reason through ambiguity, handle real-world curveballs and recover from errors.
What is a Goldilocks dataset?
Training AI agents, compared to training standard AI models, is fundamentally different due to their extended capabilities — they reason, orchestrate multiple tools, map parameters across complex workflows and make sequential decisions that compound over time. This means their training data has to capture not just the what, but also the how and the why of expert decision-making.
A Goldilocks training dataset is a precisely calibrated collection of data used to train AI agents. Unlike standard datasets, it’s neither too simple (which leads the agent simply following explicit instructions) nor too complex (which provides little training value and wastes compute). These datasets challenge agents at just the right difficulty level, teaching them how to handle uncertainty and fix their own mistakes.
To effectively teach AI agents how to reason, not just what to execute, the dataset must achieve the correct level of complexity, incorporate expert domain knowledge and document both successful and failed outcomes. The real measure of an AI agent lies not in its ability to handle straightforward tasks, but in whether its training data is comprehensive enough to manage the ambiguous and unpredictable real-world scenarios that characterize live operations.
The Goldilocks principle of AI agent training
Defining a Goldilocks dataset is the first step. Understanding the underlying mechanics that make it work — the Goldilocks principle — is where the real engineering begins. Essentially, this principle represents the pursuit of the "just right" level of complexity, sitting between the extremes of underfitting and overfitting. Translating this theoretical balance into a functional training environment requires constant monitoring of how the agent responds to the data’s difficulty level.
During training phases, when an agent achieves a high degree of accuracy on a dataset and minimal variation in its execution paths, it can be a red flag. It means the dataset may lack the complexity that mirrors real-world scenarios. Essentially, the agent may just be following explicit instructions or repeating common workflows rather than learning to navigate uncertainty, handle edge cases and recover from errors. Alternatively, a dataset that causes consistent failure signals a mismatch between the agent's current capabilities and the task complexity, leading to wasted training cycles rather than growth. If high accuracy is a red flag for stagnation, then the solution lies in finding the tension point where the model is challenged but not overwhelmed — a concept discussed in the webinar, Overcoming the challenges of AI agent creation, training and evaluation:
“The training data needs to be just right in that sometimes I get a success, sometimes I get a failure when I sample,” says Tommy Guy, principal applied researcher for Microsoft Copilot Studio. “Then I can compare and train a model away from failure and towards success. But the key thing is I need a mixture of successes and failures. That is the hard problem — how do we generate data that is just hard enough?”

Overcoming the challenges of AI agent creation, training and evaluation
Join us for an enlightening discussion with two pioneers in the field of AI: Tommy Guy, principal applied researcher at Microsoft Copilot Studio and Steve Nemzer, senior director of AI growth and innovation at TELUS Digital.
To achieve this, the Goldilocks principle guides researchers to seek the Zone of Proximal Development — a learning theory borrowed from educational psychology. This zone represents the spot where data is challenging enough to push the agent beyond its current boundaries, forcing it to develop novel strategies and improve generalization capabilities, while remaining structured enough to provide clear learning signals.
To find this narrow window of optimal difficulty, developers often use techniques like ablation, which involves the systematic removal of components or instructions within the training data to quantify their impact. By stripping away redundant information that the agent already understands, developers can filter out "easy" data and redirect their focus toward high-value, challenging edge cases. For instance, if removing 50% of a specific instruction doesn't hinder performance, that data is deemed redundant; however, if a minor change causes the agent to fail, you have found the critical logic threshold.
The role of domain experts in creating Goldilocks AI training data
Finding the ideal training dataset for an AI agent requires more than just mathematical techniques like ablation. While these techniques offer the framework, they can’t determine what constitutes correct behavior in a nuanced, professional environment. Therefore, to convert raw data into a reliable learning signal, the training process needs to be guided by experts who deeply understand the specific industry's complexities and consequences. These domain experts bring an essential understanding of the nuances, terminology, context and edge cases that define a professional field. They are uniquely positioned to:
- Identify and label subtle distinctions: They can accurately differentiate between highly similar concepts, intents or outcomes that a layperson might conflate.
- Annotate with contextual depth: Their knowledge allows them to label data in a way that reflects the why and how behind each entry, capturing the context that is crucial for an agent's decision-making.
- Prioritize critical edge cases: Experts know which rare scenarios are most important for an agent to handle correctly, ensuring the training set focuses on scenarios with the highest risk or reward.
- Ensure linguistic and conceptual fidelity: They ensure the training data accurately reflects the terminology, regulatory requirements and conventional wisdom of the profession, making the agent's eventual output authoritative and trustworthy.
For example, in training an insurance claims AI agent, domain experts contribute:
Real-world scenario knowledge:
- Understanding of common fraud patterns (e.g., claims filed just after policy purchase)
- Recognition of seasonal variations (e.g., water damage claims spiking in winter)
- Familiarity with industry-specific terminology and abbreviations
- Regulatory compliance requirements
Edge case identification:
- What happens when a claim spans multiple policy periods?
- How to handle claims during policy renewal gaps?
- Processing claims for partially covered events
- Dealing with subrogation scenarios
Error pattern documentation:
- Common mistakes novice adjusters make
- Ambiguous queries that require clarification
- Tool invocation sequences that seem logical but violate business rules
- Parameter combinations that are technically valid but practically nonsensical
Reasoning trace validation:
- Ensuring the agent's decision logic matches expert judgment
- Identifying where the agent should express uncertainty
- Defining acceptable variation in execution paths
- Calibrating confidence thresholds
Pinpointing this high-value data naturally leads to the question: What actually constitutes this ideal training material? While the Goldilocks principle establishes the target difficulty, it’s the golden path trajectories that provide the necessary structure and depth to guide an agent through these complex scenarios.
The anatomy of a golden path trajectory
These trajectories are multilayered execution maps that capture the complex behavior of an expert agent and provide the optimal challenge for learning. Essentially, a golden path trajectory is the sequence of actions that leads an agent from its starting point to its goal in the most efficient and successful way possible. It includes the following:
Detailed execution trees:
- The complete thought process, from the initial user query to the final response
- Decision points in the reasoning where the agent must select from several valid options
- Branch points or path divergences that result in the use of different tools
- Recovery paths that include strategies for overcoming failures when initial approaches are unsuccessful
Reasoning traces:
- Step-by-step thought processes that capture the granular sequence of events, mirroring expert decision-making
- Contextual awareness at each decision point
- Justifications for why certain tools or parameters were selected
- Indications of an agent’s confidence levels and how uncertainty was addressed
Tool usage patterns:
- Determine the optimal tools to invoke and the order in which they should be executed
- Address dependencies, ensuring prerequisite tools complete successfully before dependent tools start
- Establish clear strategies for handling tool failures, including primary tool failure fallbacks
- Decide between parallel or sequential execution for tool orchestration
Argument mapping and parameter binding:
- Identify and extract relevant parameters from natural language queries
- Implement type conversions and validation logic to ensure data integrity
- Define default values for optional parameters
- Address parameter dependencies when coordinating multiple tool calls
To see how these technical elements — like execution trees and reasoning traces — function in a real-world scenario, consider a common agentic task: rebooking a flight under strict constraints. The user might say, "I need to change my New York to London flight scheduled for March 15th to an earlier date, but the total cost must not exceed $800."
A golden path trajectory for this scenario would capture the:
Detailed execution tree
- Initial reasoning: Parse the request to identify the original booking, new constraints and budget limit
- Decision point: Should the agent first check the existing booking details or search for new flights?
- Chosen path: Retrieve existing booking to understand current costs and change fees
- Branch point: If the change fee plus the new flight exceeds budget, explore alternative airports or dates
- Recovery path: If no direct rebooking works, consider the cancel and rebook strategy
Reasoning traces
- "User has an existing booking that needs modification. Before searching for new flights, I need to understand the change policy and fees associated with the current ticket."
- "The current booking shows a $150 change fee. This leaves $650 for the new flight cost if we modify or $800 if we cancel and rebook from scratch."
- "March 10-14 flights are available, but March 12th offers the best price at $620, staying within budget after the change fee."
Tool usage patterns
- Sequential execution: get_booking_details() → check_change_policy() → search_flights() → calculate_total_cost() → process_rebooking()
- Fallback strategy: If modify_booking() fails due to fare class restrictions, invoke cancel_booking() then create_new_booking()
- Parallel opportunity: While waiting for flight search results, simultaneously check get_travel_credits() to see if user has applicable credits
Argument mapping and parameter binding
- Extract from query: origin="New York", destination="London", original_date="2026-03-15", max_budget=800, direction="earlier"
- Derive parameters: new_date_range=["2026-03-01" to "2026-03-14"] (inferred from "earlier")
- Type conversion: "March 15th" → international organization for standardization (ISO) date format "2026-03-15"
- Parameter dependency: available_flight_budget = max_budget - change_fee (calculated after retrieving change fee)
- Validation: Ensure new_date < original_date and total_cost <= max_budget before confirming
Ultimately, this golden path trajectory demonstrates that a truly effective AI agent is defined by its ability to transform a vague human goal into a structured execution strategy that prioritizes logical reasoning and budget-conscious precision.
Expert-generated AI training data
Creating effective training data for AI agents requires an approach that balances complexity, incorporates domain expertise and captures both successful executions and realistic failure modes. At TELUS Digital, we have developed a comprehensive framework for generating high-quality agentic training data that addresses these needs.
Calibrated complexity distribution
Effective agent training requires datasets with intentional difficulty stratification. Our approach involves creating execution scenarios across three complexity tiers:
- Simple scenarios: Single-tool uses with parameter mapping to baseline skills
- Intermediate complexity: Multistep multi-tool orchestrations with parameter dependencies
- Complex scenarios: Advanced multi-turn conversations, parallel tool orchestration and sophisticated reasoning chains
Agent training starts with providing the agent detailed goals, task sequences and code snippets for executing API calls. As the agent succeeds, instructions and data are removed until performance degrades, marking the boundary of the difficult training set. Including failure modes like errors and edge cases ensures agents learn to recognize and recover from obstacles.
Balanced capability development
Training data must develop the full spectrum of agent capabilities. Our methodology ensures balanced coverage across:
- Knowledge retrieval scenarios: Tests the agent's ability to search, extract and synthesize information from knowledge bases
- Tool use scenarios: Focuses on application program interface (API) orchestration, parameter mapping and sequential tool invocation
- Hybrid scenarios: Requires sophisticated combinations of knowledge retrieval and tool use within single execution chains
This distribution prevents agents from over-indexing on any single capability while ensuring they develop integrated competencies.
Domain-specific AI training
Realistic training requires authentic operational environments. Our approach includes designing:
- Specialized knowledge databases: Cross-industry data structures reflect real-world complexity, schemas and volumes for domains like insurance, IT and logistics.
- Integrated API frameworks: Domain-specific APIs that emulate production environments. These allow agents to perform verifiable calls using authentic logic, response formats and error handling for scenarios.
- Ground truth references: Expert-validated documentation of expected behaviors, including tool sequences, parameter mapping and decision logic.
Expert-led quality assurance process
Our three-phase validation approach ensures training data meets the highest standards.
Stage 1: Query editing and quality control
Domain experts review and refine user queries to ensure they:
- Reflect authentic business scenarios and terminology
- Include appropriate contextual metadata
- Present realistic parameter variations and ambiguities
- Match the complexity distribution targets
Stage 2: Rubric generation and validation
Custom evaluation rubrics with weighted criteria — between 2 and 15 per rubric — are developed and mapped to specific queries. These rubrics ensure objective evaluation across:
- Reasoning quality and logical coherence
- Tool selection accuracy and sequencing
- Parameter mapping correctness
- Error handling and recovery strategies
- Response completeness and accuracy
Stage 3: Execution trace annotation
Using structured annotation features, every input, expected response and evaluation criterion is validated against ground truth references. This includes:
- Detailed reasoning traces at each decision point
- Tool invocation sequences with full parameter documentation
- Alternative valid execution paths
- Failure mode documentation with recovery strategies
Comprehensive output generation
For each agent domain, the framework produces extensive training materials, including:
- Detailed execution transcripts: Featuring diverse, realistic interactions including multi-turn conversations and autonomous agent behaviors
- Varied file types: Including different formats, lengths and complexity levels to ensure robust generalization
- Annotated reasoning traces: Step-by-step documentation of agent decision-making with expert validation
- Custom rubric themes: Domain-specific evaluation criteria that capture industry best practices
Benchmarking against real-world workflows
The training data creation process itself is benchmarked against established workflows to ensure it produces:
- Prompts that authentically represent user needs
- Responses that match expert-level quality
- Ground truth outcomes validated by domain specialists
- Evaluation criteria aligned with business objectives
This systematic approach ensures AI agents receive training signals that are neither too easy nor too hard, but precisely calibrated to push agents toward expert-level performance in their specific domains.
Leveraging the Goldilocks approach as a strategic advantage
Achieving autonomous execution demands more than just large datasets — it requires a sophisticated architectural approach to learning. In the agentic era, the competitive advantage will ultimately rest not with those who have the most data, but with those whose data is the most precisely calibrated. Mastering the Goldilocks principle will ensure agents aren’t merely executing pre-programmed scripts, but are truly prepared to navigate the unpredictable, high-stakes environment of high-value work.
By maintaining the delicate balance of complexity, incorporating domain expert knowledge and systematically documenting both success and failure paths, we create training datasets that don't just teach agents what to do, they teach agents how to reason. Contact us today to learn more.



