AI Alignment & Safety: Making LLMs Behave

Complete Guide to LLMs · Part 6 of 8

6.AI Alignment & Safety ←

A raw language model is powerful but uncontrolled. It might generate harmful content, provide dangerous instructions, or confidently state falsehoods.

Alignment makes models actually helpful and actually safe.

The beginner-friendly way to think about alignment is: we’re shaping default behavior. Raw models try to produce likely text. Aligned models try to be helpful, safe, and honest by default.

The Problem

Problem 1

Harmful Content

Raw models can generate offensive or dangerous text

Problem 2

Misinformation

Confident wrong answers (hallucinations)

Problem 3

Misuse

Instructions for weapons, hacking, etc.

Solution

Alignment

Train models to be helpful AND safe

As covered in Part 5, models are trained on internet text - which contains everything good and bad.

RLHF: Learning From Humans

The breakthrough technique used in ChatGPT:

RLHF: The Three-Step Process

Supervised Fine-Tuning: Train on examples of ideal responses
Reward Model Training: Humans rank responses; model learns preferences. A reward model (RM) scores answers the way a human rater would.
RL Optimization: Use PPO to maximize learned reward. PPO is an optimization method that nudges the model toward higher reward scores without changing it too drastically.
Result: Model produces human-preferred outputs

A tiny RLHF intuition (toy example)

1 of 2

Pattern matching: ELIZA finds keywords and responds with templates

Challenges with RLHF

Strengths

Weaknesses

Much improved

Helpfulness

Can be sycophantic

Refuses harmful requests

Safety

Sometimes over-cautious

Uses human preferences

Data

Expensive to collect

One issue: reward hacking. The model learns to game the reward signal rather than genuinely helping.

Constitutional AI: Anthropic's Approach

Claude uses a different method:

Constitutional AI:

Write principles: Define a "constitution" of behaviors
Self-critique: Model generates, then critiques itself
Revision: Model revises based on its own critique
RLAIF: Train on AI feedback, not just human feedback

Instead of training on "what humans prefer," train on "what follows our principles."

DPO: Simpler Alternative

Direct Preference Optimization skips the reward model:

RLHF vs DPO:

RLHF has more moving parts: collect preferences → train a reward model → run an RL step (like PPO). Flexible, but complex.

DPO has fewer moving parts: train directly on preference pairs to push the model toward preferred outputs. Simpler pipeline, still needs good data. The trade-off: RLHF pipelines are flexible but complex; DPO is simpler but still needs good preference data and careful evaluation.

What Models Refuse

Aligned models refuse certain requests:

Instructions for weapons or violence
Content sexualizing minors
Detailed hacking instructions
Medical/legal advice presented as professional

Different companies draw lines differently. Claude tends more cautious. Some open models have fewer restrictions.

The Alignment Tax

More Aligned

Trade-offs

Refuses harmful requests

Safety

May refuse valid requests

Adds caveats

Helpfulness

Can feel overly cautious

Same underlying model

Capability

Perceived as less capable

Finding the right balance is ongoing work.

Key Takeaways

RLHF trains models to maximize human preferences.

Constitutional AI trains on principles, not just preferences.

DPO simplifies the process.

Alignment tax - safety constraints have trade-offs.

Jailbreaking is an ongoing arms race.

Common beginner mistakes

Thinking alignment is “extra safety rules bolted on.” In practice, it changes the model’s default behavior across many situations.
Assuming refusals mean the model is dumb. Often it’s a policy/safety boundary or uncertainty management choice.
Assuming RLHF guarantees truth. RLHF helps behavior, but it doesn’t turn the model into a fact database.

What's Next?

In Part 7, we get practical. Prompting, RAG, function calling - everything you need to build AI applications.

Complete Guide to LLMs · Part 6 of 8

1.Genesis of Language AI

2.Transformer Revolution

3.GPT Evolution

4.Model Landscape 2025

5.How LLMs Work

6.AI Alignment & Safety ←

7.Building with LLMs

8.Future of AI

← How LLMs Work Building with LLMs →