Building with LLMs: A Practical Guide for 2025

Complete Guide to LLMs · Part 7 of 8

7.Building with LLMs ←

8.Future of AI

← AI Alignment & Safety Future of AI →

You understand how LLMs work. Now let's build with them.

The practical LLM app stack:

UI / API: Your app receives a user request
Prompt: Turn that request into a clear instruction + context
Model: The LLM generates a draft answer or chooses a tool
Tools (optional): Call databases/APIs/functions for real data or actions
Return: Show a final, user-ready response (with sources when possible)

Choosing Your Model

Use Case

Recommended

High volume, simple tasks

Fast & cheap

GPT-4o-mini, Claude Haiku

Complex reasoning

Best quality

GPT-4, Claude Opus

Document analysis

Long context

Gemini 2.5 Pro

Local deployment

Privacy

LLaMA 4

Compare all options in Part 4: Model Landscape.

Prompt Engineering

Effective Prompt Structure:

Role: Define who the AI should be
Context: Background information needed
Task: Clearly state what you want
Format: Specify output format
Examples: 1-3 examples of desired output

A prompt template you can copytext

Role: You are a helpful support agent.

Context:
- Product: Acme Billing
- Plan: Pro
- User message: "I was charged twice"

Task:
Explain what might have happened and ask 2 clarifying questions.

Format:
- 3 bullet explanation
- 2 questions
- 1 next action

Beginner trick: be explicit about the output format. It reduces rambling and makes the response easier to use in a UI.

Prompts work best when they read like a spec you'd hand to a teammate. If the model keeps "going off track," the task is usually ambiguous or the output format isn't constrained.

Key Techniques

Few-shot: Include examples for new task formats
Chain of Thought: Use "think step by step" for complex reasoning
Structured Output: Request JSON for parsing programmatically

RAG: Retrieval-Augmented Generation

LLMs have knowledge cutoffs. They don't know your private data. RAG solves this.

Your docs stay in your system. At question time, fetch the most relevant snippets and paste them into the prompt. The model isn't "remembering your database"—it's reading pages you picked for it.

Key insight: RAG quality is mostly determined by the retrieved chunks. If the retrieved text is irrelevant, the model will still hallucinate—just more confidently.

How RAG Works:

Index Documents: Split into chunks, generate embeddings
User Asks: Receive the question
Retrieve: Find relevant chunks via semantic search
Augment: Add context to the prompt
Generate: LLM answers using provided context

This reduces hallucinations by grounding responses in actual data.

RAG Walkthrough

1. User query→2. Retrieved chunks→3. Model answer

User asks

What's our refund policy for annual plans?

Same question, different retrieval. Most 'RAG failures' are retrieval failures, not model failures.

Function Calling

Modern LLMs can call external functions:

Function Calling Flow:

Define Tools: Describe available functions
User Request: "What's the weather in Tokyo?"
Model Decides: LLM outputs: call get_weather()
Execute: Your code calls the actual API
Return Results: Feed result back to LLM
Final Response: "It's 22°C and sunny"

This enables agents - LLMs that take actions. More on this in Part 8: Future of AI.

RAG vs Tools:

RAG: Use when the model needs to answer from text sources (docs, policies). It "reads" retrieved snippets you provide.
Tools: Use when you need live data or actions (database lookup, calendar event, payment status). Your code runs the tool; the model formats the request.

Important: People think the model "calls the API." It doesn't. The model only outputs a structured request like {'call get_weather({ city: "Tokyo" })'}. Your code decides whether to actually run it, with permissions, validation, rate limits, and logging. That separation is the difference between a fun demo and a safe product.

Cost Optimization

Cost optimization techniques:

Model tiering: Use smaller models when possible (10-50x cheaper)
Caching: Cache identical requests (100% savings on cache hits)
Prompt compression: Remove unnecessary context (proportional savings)

Price Comparison (per 1M tokens)

GPT-4o-mini

$0.15

Claude Haiku

$0.25

GPT-4o

GPT-4

$30

Local Deployment

Don't want to send data to APIs? Run locally:

LLaMA 4 and other open models make this viable.

Common Patterns

Summarization

Summarize in 3 bullets, focusing on conclusions.

Classification

Classify as: billing, technical, feature_request, other. Return only the category.

Extraction

Extract all company names. Return as JSON array.

Key Takeaways

Model selection - Match capability to task complexity.

Prompts are programs - Structure them clearly.

RAG extends knowledge - Combine retrieval with generation.

Functions enable agents - LLMs can take actions.

Costs are controllable - Tiering and caching help.

Quick Check

Quick check

When should you reach for RAG first?

Select an answer, then reveal

Common beginner mistakes

Treating prompts like magic spells instead of clear specs.
Adding RAG, but not checking whether retrieval is actually returning the right chunks.
Letting the model “call tools” without strict permissions/validation in your code.

What's Next?

In Part 8, we look ahead. Agents, AGI debates, and predictions for 2026 and beyond.

Complete Guide to LLMs · Part 7 of 8

1.Genesis of Language AI

2.Transformer Revolution

3.GPT Evolution

4.Model Landscape 2025

5.How LLMs Work

6.AI Alignment & Safety

7.Building with LLMs ←

8.Future of AI

← AI Alignment & Safety Future of AI →