·6 min read

AI Alignment & Safety: Making LLMs Behave

Kishore Gunnam

Kishore Gunnam

Developer & Writer

A raw language model is powerful but uncontrolled. It might generate harmful content, provide dangerous instructions, or confidently state falsehoods.

Alignment makes models actually helpful and actually safe.

The beginner-friendly way to think about alignment is: we’re shaping default behavior. Raw models try to produce likely text. Aligned models try to be helpful, safe, and honest by default.


The Problem

Problem 1

Harmful Content

Raw models can generate offensive or dangerous text

Problem 2

Misinformation

Confident wrong answers (hallucinations)

Problem 3

Misuse

Instructions for weapons, hacking, etc.

Solution

Alignment

Train models to be helpful AND safe

As covered in Part 5, models are trained on internet text - which contains everything good and bad.


RLHF: Learning From Humans

The breakthrough technique used in ChatGPT:

RLHF: The Three-Step Process

  1. Supervised Fine-Tuning: Train on examples of ideal responses
  2. Reward Model Training: Humans rank responses; model learns preferences. A reward model (RM) scores answers the way a human rater would.
  3. RL Optimization: Use PPO to maximize learned reward. PPO is an optimization method that nudges the model toward higher reward scores without changing it too drastically.
  4. Result: Model produces human-preferred outputs

A tiny RLHF intuition (toy example)

1 of 2

Pattern matching: ELIZA finds keywords and responds with templates


Challenges with RLHF

Strengths
Weaknesses
Much improved
Helpfulness
Can be sycophantic
Refuses harmful requests
Safety
Sometimes over-cautious
Uses human preferences
Data
Expensive to collect

One issue: reward hacking. The model learns to game the reward signal rather than genuinely helping.


Constitutional AI: Anthropic's Approach

Claude uses a different method:

Constitutional AI:

  1. Write principles: Define a "constitution" of behaviors
  2. Self-critique: Model generates, then critiques itself
  3. Revision: Model revises based on its own critique
  4. RLAIF: Train on AI feedback, not just human feedback

Instead of training on "what humans prefer," train on "what follows our principles."


DPO: Simpler Alternative

Direct Preference Optimization skips the reward model:

RLHF vs DPO:

RLHF has more moving parts: collect preferences → train a reward model → run an RL step (like PPO). Flexible, but complex.

DPO has fewer moving parts: train directly on preference pairs to push the model toward preferred outputs. Simpler pipeline, still needs good data. The trade-off: RLHF pipelines are flexible but complex; DPO is simpler but still needs good preference data and careful evaluation.


What Models Refuse

Aligned models refuse certain requests:

  • Instructions for weapons or violence
  • Content sexualizing minors
  • Detailed hacking instructions
  • Medical/legal advice presented as professional

Different companies draw lines differently. Claude tends more cautious. Some open models have fewer restrictions.


The Alignment Tax

More Aligned
Trade-offs
Refuses harmful requests
Safety
May refuse valid requests
Adds caveats
Helpfulness
Can feel overly cautious
Same underlying model
Capability
Perceived as less capable

Finding the right balance is ongoing work.


Key Takeaways

RLHF trains models to maximize human preferences.

Constitutional AI trains on principles, not just preferences.

DPO simplifies the process.

Alignment tax - safety constraints have trade-offs.

Jailbreaking is an ongoing arms race.


Common beginner mistakes

  • Thinking alignment is “extra safety rules bolted on.” In practice, it changes the model’s default behavior across many situations.
  • Assuming refusals mean the model is dumb. Often it’s a policy/safety boundary or uncertainty management choice.
  • Assuming RLHF guarantees truth. RLHF helps behavior, but it doesn’t turn the model into a fact database.

What's Next?

In Part 7, we get practical. Prompting, RAG, function calling - everything you need to build AI applications.