The evolution of post-training in the age of reasoning models

Posted May 12, 2025

Hands typing at a keyboard with superimposed images of a graduation cap, notebook, bullseye and more.

Large language models (LLMs) have transformed natural language processing, revolutionizing tasks ranging from text generation to complex reasoning. With an ever-increasing number of product use cases built on these reasoning models, it is crucial for machine learning and artificial intelligence (AI) teams to understand how to effectively adapt a base LLM into a tool that not only understands language, but interacts in a human-centric, responsible and contextually aware manner.

In this article, we’ll walk through the multi-stage journey of LLM adaptation — from pre- and post-training refinements to the latest advancements in reinforcement learning algorithms.

Pre-training and the industry-standard approaches to fine-tuning large language models

The development of LLMs involves a structured, multi-phase process, each stage contributing to the model’s final performance and usability.

The initial stage, known as pre-training, forms the foundation. During this phase, the model is exposed to an extensive and diverse dataset gathered at internet-scale. This data often includes unreliable sources and can include misinformation, propaganda, conspiracy theories and other harmful content that needs further refinement.

In the post-training stage, supervised fine-tuning (SFT) introduces curated, high-quality datasets, such as content from human domain experts. This phase molds the model, enhancing its ability to produce useful, contextually appropriate, accurate and acceptable responses. However, SFT is limited by its one-sided nature: It only shows the model good examples. This leads us to a more dynamic and iterative post-training method: reinforcement learning (RL).

Why is post-training essential?

While pre-training enables LLMs to learn broad patterns from internet-scale data, it also has clear limitations that make post-training essential.

For example, pre-trained models don't truly "understand" language, they merely predict the next token. This makes them unreliable for nuanced instructions or complex tasks. Left unchecked, they can echo harmful, low-quality or clickbait content from their training sources. Further, extending models through continued pre-training carries risks like catastrophic forgetting, where previously learned knowledge degrades, causing an inability for the models to align well with new objectives. These challenges are compounded by a growing scarcity of high-quality, diverse, human-authored data as the internet slows in growth and becomes increasingly saturated with AI-generated content. This directly undermines the effectiveness of pre-training by limiting the model’s exposure to rich and varied linguistic patterns. Finally, from a practical standpoint, pre-training is extremely resource-intensive.

Instead of pre-training from scratch, model builders increasingly rely on foundation models and apply post-training techniques like SFT, instruction tuning and reinforcement learning to align them with specific tasks, domains or safety requirements. This approach addresses the shortcomings of pre-training and enables more efficient and targeted model deployment.

Advancing post-training with reinforcement learning

While methods like SFT have improved LLM performance, they are inherently limited when it comes to teaching dynamic reasoning or adaptive behavior. Reinforcement learning (RL) builds on these methods, offering a more interactive and iterative improvement loop.

RL shifts the paradigm from static, example-driven learning to a more dynamic, exploration-based process. It allows a model to learn through trial and error by interacting with an environment, receiving both positive and negative feedback and improving in less-scripted, often multi-objective settings.

Unlike SFT, which shows only correct answers, RL enables a model to learn what not to do. This makes RL especially suitable for tasks requiring sequential decision-making or complex planning, such as robotics, game playing or strategic reasoning.

What is reinforcement learning?

From mastering complex games to enabling the next generation of autonomous vehicles, reinforcement learning is at the forefront of AI's push into new territories — particularly in tasks that involve sequential decision-making or benefit from trial-and-error optimization approaches. RL involves a model interacting with an environment, receiving observations and rewards based on its actions. The state, which dynamically improves, refers to the model’s current scenario, while actions are the preferences or choices it makes to influence future states. The reinforcements in learning are the rewards that signal the quality of the action. The strategy the model uses to choose actions based on states is called the policy followed by the model.

The ultimate goal in reinforcement learning is to learn a policy that will maximize the total reward the model receives over time. This training loop creates a closed incrementally improving system: the model sends an action to the environment, which responds with a new observation and a reward.

Reinforcement learning from human feedback

Reinforcement learning from human feedback (RLHF) is a specific application of RL focused on aligning LLM behavior with human preferences. It integrates human judgment directly into the RL loop by using curated feedback to guide model updates.

In a typical RLHF pipeline, the model produces multiple responses. Human evaluators then compare these outputs and choose the one that best meets the desired criteria. These preferences are used to train a reward model, which scores future outputs. Reinforcement learning algorithms such as proximal policy optimization (PPO) are then used to optimize the model against this reward model.

The foundation of RLHF lies in effective data collection. Human feedback is typically gathered in two forms: scalar feedback, where a single response is scored, and ranking feedback, where multiple responses are ordered by quality. Through repeated feedback and optimization, the model progressively aligns its behavior with human preferences, producing outputs that are more helpful, accurate and aligned with user needs.

At TELUS Digital, our tech-led solutioning is optimized for complex human-in-the-loop and LLM-interactivity tasks such as RLHF. Fine-Tune Studio, our feature-rich task execution platform, enables a variety of fine-tuning tasks, including RL, for model improvements and alignment. Meanwhile, our sourcing platform, Experts Engine, algorithmically matches the best qualified individuals to the task to be performed, such as ranking model outputs or generating preference signals.

Reinforcement learning algorithms

Reinforcement learning, whether applied with or without human feedback, relies on a range of algorithms to optimize model behavior. At its core, there are two primary strategies for learning an optimal policy that maximizes rewards over time: value-based methods and policy-based methods.

Value-based methods, like Q-learning, estimate the value of being in a state or taking an action. While the value-based methods help estimate the long-term gain of actions, this works well only in discrete action spaces where a model selects from a fixed list of options. It cannot handle continuous action spaces where the range of possible actions is vast or even infinite, common in real-world scenarios like autonomous driving or robotic manipulation. Additionally, Q-learning is inherently deterministic: Given the same input state, it will always produce the same output action. This rigidity becomes a problem in stochastic environments, where flexible and probabilistic behaviors are needed to respond effectively.

Policy-based methods, on the other hand, directly learn a policy in a given environment. Unlike Q-learning, which estimates the value of actions, policy-based methods focus on optimizing the policy itself. In these methods, the model chooses actions based on probabilities defined by its policy. Initially, the policy might be random, but as the model interacts with the environment and gathers experience, it adjusts the policy to increase the likelihood of actions that lead to higher rewards.

The key idea in methods like policy-gradient algorithms is to adjust the policy by following the gradient of the expected reward with respect to the policy parameters. This is typically done using gradient ascent or descent techniques, where the policy is updated in the direction that improves performance. Over time, the model refines its policy, allowing it to take actions that maximize long-term rewards. This helps the model to try different actions over time, reducing the chance of it getting stuck in less effective strategies and improving performance in complex environments. Now, we will turn our attention to some of the most commonly used policy-based algorithms.

Actor-critic methods

Actor-critic methods separate learning into two roles: the actor selects actions, while the critic evaluates and provides feedback on these actions. This helps reduce errors and speeds up model learning. The approach is ideal for industrial automation and warehouse robotics, particularly in robotic-arm training, where the critic offers continuous feedback on grasp success or failure. This feedback allows the actor to quickly refine motions, minimizing trial noise. The method's ability to deliver both sample efficiency (to reduce costly hardware wear) and rapid convergence (to shorten training time) makes it a powerful choice.

Proximal policy optimization

This is a popular RL algorithm used to train models in environments where they need to learn a sequence of actions to achieve a goal, like playing a game. Developed by OpenAI, proximal policy optimization (PPO) aims to improve the stability and reliability of the training process compared to earlier policy gradient methods. The algorithm uses the advantage function to estimate how much better a specific action taken in a particular state was compared to the average action the model might have taken in that same state. A positive advantage means the action was better than average; a negative advantage means it was worse.

The key innovation in PPO is how it updates the policy using this advantage information. Older methods tried to rapidly increase the probability of actions with high advantages. However, this can lead to large, unstable updates where the policy changes too drastically, potentially causing performance to collapse. PPO prevents this by being more conservative. It calculates a ratio between the probability of taking an action under the new updated policy and the old policy. If this ratio suggests a large change (either increasing the probability of a good action too much or decreasing the probability of a bad action too much), PPO "clips" the objective function. This clipping essentially puts a limit on how much the policy can change in a single update step. It ensures the new policy stays "proximal" (close) to the old policy, preventing destructive updates.

The model iteratively collects experience using its current policy, calculates advantages and then updates the policy using the clipped objective for several optimization steps. This cycle repeats, allowing the policy to gradually improve in a stable manner.

Suppose we are training a model to play a simple video game where the goal is to collect coins while avoiding obstacles. The model plays the game using its current policy (strategy). It might move left, right or jump. It receives positive rewards for collecting coins and negative rewards (penalties) for hitting obstacles. After playing for a while, the PPO algorithm looks back at the actions taken. Let's say in one situation, jumping led to getting a coin (positive advantage). In another, moving right led to hitting an obstacle (negative advantage). PPO calculates how much it wants to increase the probability of jumping in the first situation and decrease the probability of moving right in the second. However, it checks if these changes are too large compared to the previous strategy. If increasing the likelihood of jumping would change the policy too dramatically, PPO clips the update. So, instead of a huge jump, it might increase the probability to only 15% or 20% in this update cycle. Similarly, it limits how much it decreases the probability of moving right if that change is too drastic.

Group relative policy optimization

This is a new training method introduced in DeepSeek’s R1 model, designed for tasks that require strong reasoning, like solving math problems or writing code. Unlike earlier methods that use a separate advantage function to estimate how good each response is, group relative policy optimization (GRPO) compares a group of responses to one another.

For each prompt, the model generates multiple responses. Instead of judging each one on its own, GRPO looks at how each response compares to the others in the same group. The reward for each response is adjusted based on the average reward of the group. This helps the model learn which responses are better relative to others, making training more stable and efficient.

Suppose the model is asked to solve a math problem and produces five different answers. GRPO scores each answer, calculates the average score and then adjusts each individual score by comparing it to that average. If one answer is much better than the others, the model learns to prefer that kind of reasoning — even without knowing the exact value of each response. This approach reduces training complexity and speeds up learning in tasks where reasoning matters most.

Decoupled clip and dynamic sampling policy optimization

Building on GRPO, decoupled clip and dynamic sampling policy optimization (DAPO) is a new method that introduces two key techniques to enhance learning stability and exploration.

Decoupled clipping: DAPO addresses large policy updates that destabilizes learning by applying separate limits (clipping) to changes in the model's predicted values versus changes in its action policy, allowing each to update at a suitable pace independently.

Dynamic sampling: Instead of using a fixed method to generate or select training examples, DAPO changes the sampling strategy dynamically, introducing more variety in the responses the model sees and explores. This helps the model find better solutions and helps prevent it from getting stuck in repetitive patterns, leading to faster learning, more diverse outputs and better generalization.

Suppose the model is tasked with generating multiple steps to solve a difficult math problem. Decoupled clipping would ensure that a positive result (getting the correct final answer) doesn't cause both the model's belief about the value of each step and its method for generating steps to change too abruptly. Meanwhile, dynamic sampling would push the model to explore different reasoning paths or phrasing for the steps, ensuring it doesn't just repeat one successful pattern.

Each algorithm addresses a specific trade-off such as variance versus bias, stability versus speed and safety versus exploration, making it the optimal choice for its target industry application.

Why reinforcement learning works

Regardless of the algorithm or method used, RL has proven more effective than SFT when used as a fine-tuning technique in model development. This success is explained by three key hypotheses, highlighting where RL offers advantages over SFT.

RL introduces preference diversity. SFT expects specific answers. For example, if the model is trained that “Spanish” is a correct answer to, "What’s an example of a language?", it might mark “Japanese” as wrong — even though it’s also a valid answer.
RL introduces negative feedback, which is crucial for understanding and avoiding undesired model behavior. Supervised learning shows only good answers; however, to improve, models also need to see what not to do.
In RL, the model actively tests its hypotheses and receives feedback, refining its internal understanding. In contrast, SFT is a passive process where the model learns solely from pre-labeled correct answers, limiting its ability to improve through experimentation.

Building effective AI training datasets

Two datasets drive the learning process in RLHF. The prompt dataset contains a wide range of real-world prompts — questions, commands, instructions — that the model responds to. The preference dataset captures human judgments on those responses.

Developing strong datasets is critical to the success of RLHF. High-quality data must be clean, consistent, relevant and diverse to ensure the model generalizes well and avoids bias. Clean data eliminates errors and inconsistencies, consistent formatting ensures uniformity, relevance aligns the dataset with the model’s intended tasks and diversity broadens the model’s exposure to different scenarios, improving its robustness. Ensuring high inter-annotator agreement is equally important and the human evaluators must apply clear, consistent criteria when distinguishing between responses to minimize ambiguity in the preference dataset.

For reward model training, it is typical to begin with 5,000 to 10,000 labeled examples, scaling as needed to improve performance. During RLHF, a few hundred to a few thousand carefully selected examples are generally sufficient. Maintaining balanced class distributions helps prevent outcome bias and ensures that the model does not overfit to particular response patterns.

How TELUS Digital supports the future of post-training

At TELUS Digital, we help model developers build the next generation of AI with high-quality, human-validated training data for post-training. Our expertise in building diverse prompt datasets and carefully labeled preference data ensures your models learn from the best possible feedback. If you're looking to stay ahead as post-training practices continue to advance, connect with our experts to explore how we can help you build AI that adapts, improves and truly aligns with human needs.

Solutions

Data & AI Solutions

Consulting

Customer Experience

Digital Services

Trust, Safety & Security

Industries

How telecom brands can seize industry opportunities with AI

Elevating the customer experience for a leading cryptocurrency platform

About Us

Insights

Categories

Industries

Resource Types