AI Alignment & Safety: Making LLMs Behave
Kishore Gunnam
Developer & Writer
Complete Guide to LLMs · Part 6 of 8
A raw language model is powerful but uncontrolled. It might generate harmful content, provide dangerous instructions, or confidently state falsehoods.
Alignment makes models actually helpful and actually safe.
The beginner-friendly way to think about alignment is: we’re shaping default behavior. Raw models try to produce likely text. Aligned models try to be helpful, safe, and honest by default.
The Problem
Harmful Content
Raw models can generate offensive or dangerous text
Misinformation
Confident wrong answers (hallucinations)
Misuse
Instructions for weapons, hacking, etc.
Alignment
Train models to be helpful AND safe
As covered in Part 5, models are trained on internet text - which contains everything good and bad.
RLHF: Learning From Humans
The breakthrough technique used in ChatGPT:
RLHF: The Three-Step Process
- Supervised Fine-Tuning: Train on examples of ideal responses
- Reward Model Training: Humans rank responses; model learns preferences. A reward model (RM) scores answers the way a human rater would.
- RL Optimization: Use PPO to maximize learned reward. PPO is an optimization method that nudges the model toward higher reward scores without changing it too drastically.
- Result: Model produces human-preferred outputs
A tiny RLHF intuition (toy example)
Pattern matching: ELIZA finds keywords and responds with templates
Challenges with RLHF
One issue: reward hacking. The model learns to game the reward signal rather than genuinely helping.
Constitutional AI: Anthropic's Approach
Claude uses a different method:
Constitutional AI:
- Write principles: Define a "constitution" of behaviors
- Self-critique: Model generates, then critiques itself
- Revision: Model revises based on its own critique
- RLAIF: Train on AI feedback, not just human feedback
Instead of training on "what humans prefer," train on "what follows our principles."
DPO: Simpler Alternative
Direct Preference Optimization skips the reward model:
RLHF vs DPO:
RLHF has more moving parts: collect preferences → train a reward model → run an RL step (like PPO). Flexible, but complex.
DPO has fewer moving parts: train directly on preference pairs to push the model toward preferred outputs. Simpler pipeline, still needs good data. The trade-off: RLHF pipelines are flexible but complex; DPO is simpler but still needs good preference data and careful evaluation.
What Models Refuse
Aligned models refuse certain requests:
- Instructions for weapons or violence
- Content sexualizing minors
- Detailed hacking instructions
- Medical/legal advice presented as professional
Different companies draw lines differently. Claude tends more cautious. Some open models have fewer restrictions.
The Alignment Tax
Finding the right balance is ongoing work.
Key Takeaways
RLHF trains models to maximize human preferences.
Constitutional AI trains on principles, not just preferences.
DPO simplifies the process.
Alignment tax - safety constraints have trade-offs.
Jailbreaking is an ongoing arms race.
Common beginner mistakes
- Thinking alignment is “extra safety rules bolted on.” In practice, it changes the model’s default behavior across many situations.
- Assuming refusals mean the model is dumb. Often it’s a policy/safety boundary or uncertainty management choice.
- Assuming RLHF guarantees truth. RLHF helps behavior, but it doesn’t turn the model into a fact database.
What's Next?
In Part 7, we get practical. Prompting, RAG, function calling - everything you need to build AI applications.
Complete Guide to LLMs · Part 6 of 8