Candidates

Companies

Candidates

Companies

How Language Models Learn to Follow Instructions

By

Samara Garcia

Illustration of people analyzing charts, factory systems, mobile tech, and data dashboards, symbolizing the wide range of modern career fields and how to evaluate them.

Large language models are initially trained to predict the next token, which doesn’t always guarantee that they’ll follow instructions or reflect human preferences. Instruction-following behavior emerged as a core requirement for real-world applications like coding assistants, chatbots, and research tools, prompting the rise of training methods based on human feedback. This article walks through the RLHF pipeline, newer direct alignment algorithms, practical challenges in data collection, and the skills engineers need to work in this area.

Key Takeaways

  • Modern large language models are first pretrained on internet-scale text, then refined through supervised fine-tuning and reinforcement learning from human feedback to reliably follow instructions.

  • Reinforcement learning from human feedback became a standard alignment technique around 2022 with systems like InstructGPT, combining supervised fine-tuning, reward model training, and deep reinforcement learning.

  • Human preferences are translated into numerical rewards, which guide the language model to generate outputs that are more helpful, honest, and safe than raw pretrained models.

  • Alternative direct alignment algorithms such as DPO, KTO (Kahneman-Tversky Optimization), and online RLAIF are now standard approaches to reduce dependence on costly human data collection while maintaining strong alignment with human intent.

  • Working on RLHF and alignment requires skills across machine learning, reinforcement learning, large-scale data engineering, and practical software development in Python and modern deep learning frameworks.

How Pretrained Language Models Work Before Human Feedback

A base language model is trained with self-supervised learning on large text corpora to model token distributions rather than explicit instructions. Models learn from high-quality synthetic data, reasoning traces, and code, absorbing statistical patterns from trillions of text sequences. This pretraining phase is extremely resource-intensive. GPT-5, Gemini 2, and Claude 4 era models are trained on tens of trillions of tokens using distributed GPU clusters, requiring months of computation across thousands of accelerators.

Such pretrained language models often fail to follow instructions because they tend to mirror patterns from training data, may ignore user intent, and can produce unsafe content. The model has learned to predict likely continuations of text, not to answer questions truthfully or refuse harmful requests. This creates what researchers call misalignment, which means that base models are not optimized for human preferences, safety constraints, or conversational usefulness.

Companies and research groups in computer science typically keep the pretraining objective stable over time. Instead, they focus alignment work on later stages where human feedback directly shapes model behavior. This approach is economically practical because pretraining is so expensive that modifying it is prohibitive, while fine-tuning language models on smaller curated datasets is comparatively affordable.

Supervised Fine-Tuning: Teaching Language Models to Follow Instructions

Supervised fine-tuning is the process of training a pretrained language model on curated human-written prompt-response pairs, typically with explicit instructions and high-quality answers. This stage exposes the model to examples of desired model behavior across many task types.

Data collection for supervised fine-tuning works by hiring expert human annotators on platforms like Scale AI to write responses to thousands of instructions covering complex tasks like Q&A, summarization, and coding. Higher quality outputs are achieved through human feedback, which helps the model prioritize coherence, accuracy, and usefulness. Supervised fine-tuning is used to prime the model to generate responses in the expected format by exposing it to human-written examples before reinforcement learning is applied.

Supervised fine-tuning is a form of direct alignment algorithm, since the model is updated directly on human-produced outputs without a separate reward model or reinforcement learning loop. The supervised model learns by imitation, absorbing patterns from the demonstration dataset. This improves instruction following compared to the raw base model, making the system respond in more structured, user-friendly formats.

Why Supervised Fine-Tuning Is Necessary but Insufficient

Supervised fine-tuning provides a strong baseline because the model directly imitates high-quality model outputs. However, it is expensive to scale and cannot cover every instruction or edge case a user might ask. Fine-tuning language models typically involves a three-step process: supervised fine-tuning, reward model training, and reinforcement learning optimization.

Key limitations of supervised fine-tuning alone:

  • It does not allow fine-grained ranking of “better” versus “okay” responses, since it usually relies on a single target answer rather than nuanced human preferences over multiple responses

  • The supervised model can still hallucinate, produce toxic output generation, or ignore user constraints because the training objective does not explicitly penalize those behaviors

  • It cannot communicate complex goals about what makes one answer better than another

Work after 2021, such as InstructGPT and later ChatGPT-style systems, layered reinforcement learning from human feedback on top of supervised fine-tuning to refine these early gains and address the limitations.

What is Reinforcement Learning from Human Feedback?

Reinforcement learning from human feedback is a method where humans compare model outputs to create human preference data, a reward model is trained on those preferences, and deep reinforcement learning then optimizes the language model to maximize the learned reward. 

The RLHF process typically involves three main steps: supervised fine-tuning, reward model training, and reinforcement learning optimization using techniques like Proximal Policy Optimization. RLHF enhances AI models by allowing them to learn from human preferences, which helps align their outputs with user expectations and improve overall performance.

Stage

Training Signal

Typical Behavior

Base Model

Next token prediction on web text

Verbose, may ignore instructions, can produce harmful outputs

SFT Model

Human demonstrations of good responses

Follows instructions better, structured format, still hallucinates

RLHF Model

Learned reward function from human comparisons

Strongly instruction-following, refuses unsafe requests, more truthful

Reward Model Training: Turning Human Preferences into AI Feedback

Data collection for reward models typically works as follows: humans are shown a prompt and two or more candidate completions from the current model, then asked in early alignment systems to pick the best response according to guidelines. Human feedback is commonly collected by prompting humans to rank instances of the agent’s behavior, which can then be used to score outputs using methods like the Elo rating system.

These pairwise comparisons train a separate reward model, a neural network that outputs a scalar score reflecting how well a completion matches human preferences for helpfulness, honesty, and harmlessness. The reward model in fine-tuning language models translates human preferences into a numerical reward signal, which is crucial for guiding the model’s learning process.

Collecting human feedback through pairwise comparisons allows for effective learning from a relatively small amount of comparison data, which can lead to comparable results to larger datasets. The InstructGPT paper collected approximately 33,000 human preference comparisons for reward model training. Once trained, the trained reward model can rapidly evaluate thousands of candidate outputs without human input for every sample, enabling scalable alignment.

Reinforcement learning from AI feedback extends this approach by using model-generated feedback based on constitutional principles, reducing the volume of human feedback data required.

Deep Reinforcement Learning with PPO: Optimizing the Policy Model

In the RLHF phase, the language model itself is treated as a policy that generates outputs, which are then scored by the reward model. The reward signal is combined with a KL penalty that keeps the updated policy close to the supervised fine-tuning model, preventing the model from diverging too far from its baseline behavior.

Proximal Policy Optimization is a commonly used algorithm in the final step of fine-tuning language models, optimizing the model’s policy based on the reward model’s feedback. PPO was introduced in 2017 as a sophisticated reinforcement learning algorithm that updates the policy by gradient ascent on a clipped objective, avoiding destabilizing jumps in behavior.

The PPO training loop for language models:

  1. Sample prompts from the same prompt distribution used in production

  2. Generate outputs autoregressively

  3. Compute rewards using the reward model plus the KL term

  4. Estimate advantages and compute policy gradients

  5. Update model parameters to increase expected reward

Practical implementations mix in the original language modeling loss (PPO-ptx style) to prevent catastrophic forgetting of general knowledge while still aligning with human feedback. Deep reinforcement learning in this context is computationally expensive but produces instruction-following behavior that significantly outperforms both the base and SFT models on human evaluations. RLHF has been shown to improve the robustness of reinforcement learning agents and their ability to explore complex environments, resulting in more effective optimization processes.

Comparing Alignment Methods: RLHF, Direct Preference Optimization, and RLAIF

RLHF is no longer the only major approach for aligning language models with human preferences. Newer direct alignment methods have gained traction because they reduce complexity and computing costs.

Direct Preference Optimization (DPO) simplifies alignment by removing the separate reward model used in RLHF. Instead, it trains directly on preferred versus rejected responses using a supervised learning-style objective, making it easier and cheaper to implement.

Other approaches, like reinforcement learning from AI feedback, reduce the amount of human labeling required by using AI-generated feedback guided by predefined rules or principles.

Method

Requires Reward Model

Uses Deep RL

Primary Data Source

Typical Use

RLHF

Yes

Yes (PPO)

Human pairwise comparisons

Core alignment at major labs

DPO

No

No

Human preference data

Efficient fine-tuning, smaller teams

RLAIF

Sometimes

Varies

AI-generated feedback

Scaling alignment, domain expansion

Many industrial systems now combine approaches. A company might use RLHF for core alignment on human values like harmlessness and truthfulness, then use DPO-style refinement for specific task instructions. Practical choices depend on compute budgets, data collection capacity, and risk tolerance for aligning language models.

Real-World Impact: From InstructGPT to Modern Chat Systems

InstructGPT, documented in the 2022 paper, was a GPT-3-based model fine-tuned with RLHF that significantly outperformed the original GPT-3 API on user preference evaluations. The InstructGPT 1.3B parameter model was preferred by users over the 175B parameter GPT-3, representing a 100x reduction in model size while achieving better model performance. RLHF allows for more data-efficient model development, as it can lead to better performance with smaller models compared to larger ones trained without human feedback.

This approach led directly to the launch of ChatGPT in late 2022, followed by steadily more aligned models using refined RLHF and direct alignment algorithms for better safety and instruction following. The RLHF approach significantly enhances the ability of AI models to generate helpful, truthful, and non-harmful responses, improving alignment with human intent.

Specific improvements observed in production systems:

  • Stricter refusal behavior on unsafe requests

  • More consistent formatting in model outputs

  • Better adherence to user instructions in code generation and summarization

  • Reduced toxic output generation

The use of RLHF can lead to improved user satisfaction by guiding models toward more engaging and contextually appropriate responses, enhancing the overall user experience. Other organizations across the artificial intelligence ecosystem have adopted similar pipelines, adjusting the balance between supervised fine-tuning, reward modeling, and deep reinforcement learning to interact usefully with users and explore goals defined by their deployment contexts.

Challenges, Limitations, and Open Problems in RLHF

RLHF is powerful but far from a complete solution to AI alignment. Research has highlighted several limitations, including reward hacking, where models exploit weaknesses in the reward system instead of genuinely improving behavior. This can lead to issues like fabricated citations, overly cautious responses, or misleading outputs optimized for approval rather than accuracy.

RLHF also depends heavily on the quality and consistency of human feedback, which can introduce bias or poor generalization. Other challenges include encoding complex human values into simple rewards, detecting hallucinations in real-world settings, and avoiding over-optimization that reduces response diversity.

Newer research areas include scalable oversight, constitutional AI, interpretability-assisted alignment, and hybrid human-plus-AI feedback systems designed to improve alignment at larger scales.

Skills Engineers Need to Work on RLHF and Alignment

RLHF research and implementation sit at the intersection of machine learning, deep reinforcement learning, large-scale systems, and practical product engineering. The field requires engineers who can successfully train complex systems and understand both theoretical foundations and practical implementation.

Core technical skills include:

  • Strong grounding in probability and statistics

  • Deep learning expertise (transformers, optimization, regularization)

  • Hands-on experience with PyTorch, JAX, or Triton for training large neural networks

  • Familiarity with ranking data and producing outputs at scale

Reinforcement learning expertise is particularly relevant, including understanding policy gradient methods, PPO, value estimation, and exploration versus exploitation tradeoffs. While RLHF uses relatively structured environments compared to robotics, the training method draws from the same theoretical foundations.

Data engineering skills related to feedback pipelines are essential:

  • Building annotation tools for incorporating human feedback

  • Managing large preference datasets

  • Implementing quality checks and inter-annotator agreement metrics

  • Working with a distributed training infrastructure

Summary

Large language models are first pretrained on massive text datasets to predict the next token, but this alone does not make them reliable at following instructions or aligning with human preferences. To improve usefulness and safety, companies use supervised fine-tuning and reinforcement learning from human feedback (RLHF), where human preferences guide models toward more helpful, honest, and structured responses.

RLHF became a major alignment technique after the 2022 InstructGPT research, combining supervised training, reward models, and reinforcement learning methods like PPO. Human feedback is converted into numerical rewards that help models improve instruction following, reduce harmful outputs, and better match user intent. Newer approaches, such as DPO and RLAIF, also reduce the need for expensive human labeling while maintaining strong alignment performance.

Modern alignment research focuses on improving safety, reducing hallucinations, and scaling feedback systems more efficiently. Engineers working in RLHF and alignment need strong skills in machine learning, reinforcement learning, large-scale data systems, Python, and deep learning frameworks like PyTorch or JAX, making the field one of the most technically demanding areas in modern AI development.

FAQ

What is reinforcement learning from human feedback (RLHF), and how does it work?

How does RLHF train language models to follow instructions?

What is the difference between RLHF, supervised fine-tuning, and DPO?

What are the biggest challenges and limitations of RLHF?

What skills do engineers need to work on RLHF and alignment at AI companies?