·8 min read

Building with LLMs: A Practical Guide for 2025

Kishore Gunnam

Kishore Gunnam

Developer & Writer

You understand how LLMs work. Now let's build with them.

The practical LLM app stack:

  1. UI / API: Your app receives a user request
  2. Prompt: Turn that request into a clear instruction + context
  3. Model: The LLM generates a draft answer or chooses a tool
  4. Tools (optional): Call databases/APIs/functions for real data or actions
  5. Return: Show a final, user-ready response (with sources when possible)

Choosing Your Model

Use Case
Recommended
High volume, simple tasks
Fast & cheap
GPT-4o-mini, Claude Haiku
Complex reasoning
Best quality
GPT-4, Claude Opus
Document analysis
Long context
Gemini 2.5 Pro
Local deployment
Privacy
LLaMA 4

Compare all options in Part 4: Model Landscape.


Prompt Engineering

Effective Prompt Structure:

  1. Role: Define who the AI should be
  2. Context: Background information needed
  3. Task: Clearly state what you want
  4. Format: Specify output format
  5. Examples: 1-3 examples of desired output

A prompt template you can copytext

Role: You are a helpful support agent.

Context:
- Product: Acme Billing
- Plan: Pro
- User message: "I was charged twice"

Task:
Explain what might have happened and ask 2 clarifying questions.

Format:
- 3 bullet explanation
- 2 questions
- 1 next action

Beginner trick: be explicit about the output format. It reduces rambling and makes the response easier to use in a UI.

Prompts work best when they read like a spec you'd hand to a teammate. If the model keeps "going off track," the task is usually ambiguous or the output format isn't constrained.

Key Techniques

  • Few-shot: Include examples for new task formats
  • Chain of Thought: Use "think step by step" for complex reasoning
  • Structured Output: Request JSON for parsing programmatically

RAG: Retrieval-Augmented Generation

LLMs have knowledge cutoffs. They don't know your private data. RAG solves this.

Your docs stay in your system. At question time, fetch the most relevant snippets and paste them into the prompt. The model isn't "remembering your database"—it's reading pages you picked for it.

Key insight: RAG quality is mostly determined by the retrieved chunks. If the retrieved text is irrelevant, the model will still hallucinate—just more confidently.

How RAG Works:

  1. Index Documents: Split into chunks, generate embeddings
  2. User Asks: Receive the question
  3. Retrieve: Find relevant chunks via semantic search
  4. Augment: Add context to the prompt
  5. Generate: LLM answers using provided context

This reduces hallucinations by grounding responses in actual data.

RAG Walkthrough

1. User query2. Retrieved chunks3. Model answer

User asks

What's our refund policy for annual plans?

Same question, different retrieval. Most 'RAG failures' are retrieval failures, not model failures.


Function Calling

Modern LLMs can call external functions:

Function Calling Flow:

  1. Define Tools: Describe available functions
  2. User Request: "What's the weather in Tokyo?"
  3. Model Decides: LLM outputs: call get_weather()
  4. Execute: Your code calls the actual API
  5. Return Results: Feed result back to LLM
  6. Final Response: "It's 22°C and sunny"

This enables agents - LLMs that take actions. More on this in Part 8: Future of AI.

RAG vs Tools:

  • RAG: Use when the model needs to answer from text sources (docs, policies). It "reads" retrieved snippets you provide.
  • Tools: Use when you need live data or actions (database lookup, calendar event, payment status). Your code runs the tool; the model formats the request.

Important: People think the model "calls the API." It doesn't. The model only outputs a structured request like {'call get_weather({ city: "Tokyo" })'}. Your code decides whether to actually run it, with permissions, validation, rate limits, and logging. That separation is the difference between a fun demo and a safe product.


Cost Optimization

Cost optimization techniques:

  • Model tiering: Use smaller models when possible (10-50x cheaper)
  • Caching: Cache identical requests (100% savings on cache hits)
  • Prompt compression: Remove unnecessary context (proportional savings)

Price Comparison (per 1M tokens)

GPT-4o-mini
$0.15
Claude Haiku
$0.25
GPT-4o
$5
GPT-4
$30

Local Deployment

Don't want to send data to APIs? Run locally:

LLaMA 4 and other open models make this viable.


Common Patterns

Summarization

Summarize in 3 bullets, focusing on conclusions.

Classification

Classify as: billing, technical, feature_request, other. Return only the category.

Extraction

Extract all company names. Return as JSON array.


Key Takeaways

Model selection - Match capability to task complexity.

Prompts are programs - Structure them clearly.

RAG extends knowledge - Combine retrieval with generation.

Functions enable agents - LLMs can take actions.

Costs are controllable - Tiering and caching help.


Quick Check

Quick check

When should you reach for RAG first?

Select an answer, then reveal


Common beginner mistakes

  • Treating prompts like magic spells instead of clear specs.
  • Adding RAG, but not checking whether retrieval is actually returning the right chunks.
  • Letting the model “call tools” without strict permissions/validation in your code.

What's Next?

In Part 8, we look ahead. Agents, AGI debates, and predictions for 2026 and beyond.