
RLHF (Reinforcement Learning from Human Feedback)
After a model is trained, it might be technically correct but not always helpful, safe, or polite. RLHF helps fix that. Here's how it works: humans first review and rank different responses the model gives to the same prompt, marking which ones are more helpful or appropriate. Then, the model is trained using reinforcement learning to favor the kinds of responses that people preferred.
This is one of the key techniques that made models like ChatGPT more conversational, safer, and more useful in real-world settings.
If you want to explore the process in more depth, here’s a great explanation:
© 2025 Kumospace, Inc. d/b/a Fonzi