Why Your LLM Application Feels Slow

Introduction

You spend weeks building an AI-powered application. The model gives great answers. Then users start complaining that it feels slow. You look at the model's response time and it seems fine, so what is the problem?

Here is the key insight: LLM latency is usually not the model's fault. In most production systems, the model itself is only one of many stages that add delay. The real bottlenecks are architectural, how requests flow through your system, which operations happen sequentially when they could run in parallel, and whether you are streaming output or waiting for the full response before showing anything to the user.

This article breaks down where latency actually comes from, why it matters so much for user experience, and how to fix it systematically.

The Transformer model architecture diagram showing encoder and decoder stacks with attention layers and feed-forward networks — **Figure:** The Transformer architecture at the core of modern LLMs. The prefill phase, where the model processes the entire input prompt before generating the first token, is directly responsible for Time to First Token (TTFT) latency. Longer prompts mean longer prefill, which is why prompt size is a primary latency lever. Source: Yuening Jia / Wikimedia Commons (CC BY-SA 3.0)

Problem Statement

When an LLM application feels slow, the instinctive response is to blame the model or switch to a faster one. This instinct is almost always wrong. The model is a single stage in a multi-stage pipeline, and optimising one stage while ignoring the others produces diminishing returns.

The deeper problem is that most teams do not measure latency at the right granularity. They check total response time and stop there. What they miss is that total response time is a composite of many smaller delays, network round trips, database queries, embedding calls, context assembly, and each one has a different root cause and a different fix.

Without decomposing latency into its components, you cannot know which component to fix. You end up optimising the wrong thing.

Core Concepts and Terminology

Term	Definition	Why It Matters
TTFT (Time to First Token)	Time elapsed from when the request is sent until the first token of the response is received	Determines how long the user perceives the app as "not responding", the most important metric for perceived speed
TBT (Time Between Tokens)	The interval between each successive output token during streaming	Determines how smooth the streaming experience feels; high TBT makes text appear in jerky bursts
E2E (End-to-End Latency)	Total wall-clock time from request sent to final token received	For short responses TTFT dominates; for long responses TBT accumulates and dominates
Prefill Phase	The stage in which the model processes the entire input prompt before generating the first output token	Longer prompts mean longer prefill time, which directly increases TTFT
RAG (Retrieval-Augmented Generation)	An architecture that retrieves relevant documents and injects them into the prompt before calling the model	Adds retrieval latency before every model call; frequently the dominant bottleneck in production
Cold Start	The delay caused by initialising a serverless function or container before it can handle a request	Can add hundreds of milliseconds to TTFT for the first request in a given time window
Prompt Caching	A feature offered by some LLM providers that reuses the computed key-value cache for repeated prompt prefixes	Eliminates repeated prefill cost for static system prompts, reducing TTFT significantly
Top-K Retrieval	The number of documents returned by a vector search query	Larger K means more context injected into the prompt, which slows both retrieval and prefill

How It Works

To understand where latency comes from, you need to follow a request through every stage of an LLM application pipeline. Think of it like tracking a letter through a postal system, the final delivery time is the sum of every leg of the journey, not just the time spent in the mail truck.

A typical LLM request travels through these stages in order:

Request reception. Your server receives the user's message and begins processing. Network latency and connection overhead apply here.
Authentication and validation. The request is checked for permissions and input validity. This is usually fast but adds a fixed overhead.
Embedding generation. In a RAG system, the user's query is converted into a vector by an embedding model. This is an API call or a local model inference step.
Vector database search. The query embedding is used to find the most similar document chunks in a vector store. Network latency to the database applies here.
Context assembly. Retrieved documents are formatted and inserted into the prompt template along with the user's question and any system instructions.
Model inference (prefill). The LLM reads the entire assembled prompt. This is the prefill phase, its duration scales with the number of input tokens. Only after this phase does the model begin generating output.
Model inference (generation). The model generates output tokens one by one. TTFT ends when the first token arrives; TBT measures the gap between subsequent tokens.
Post-processing. The response is parsed, formatted, filtered, or evaluated before being sent to the user.
Response delivery. The final response is sent back over the network to the client.

The critical insight is that if these stages run sequentially, each one waiting for the previous to finish, the total latency is the sum of all stages. Stages that could run in parallel instead compound the wait time.

Think of it like a restaurant where every order must go through the host, the waiter, the chef, and the runner, each waiting for the previous person to finish before they start. If the host and waiter could work simultaneously, every order would be faster. The same logic applies to your LLM pipeline.

Diagram of a single attention head in a Transformer model showing Query, Key, and Value vector transformations and the weighted scoring process — **Figure:** A single attention head in a Transformer, showing how Query, Key, and Value vectors are computed and scored. During the prefill phase, this computation runs across every token in the input prompt simultaneously, which is why a 5,000-token prompt causes substantially higher TTFT than a 500-token prompt. Source: Shuang Zhang et al. / Wikimedia Commons (CC BY 4.0)

Practical Example

Imagine a customer support chatbot built on top of a RAG pipeline. A user asks: "What is your return policy for international orders?"

In a naive sequential implementation, the pipeline runs like this. First, the request arrives and is authenticated, taking around 30 milliseconds. Then the query is embedded using an external embedding API, taking 90 milliseconds. The vector database is searched for relevant policy documents, taking 120 milliseconds. The top five retrieved chunks, three of which are irrelevant to the user's specific question, are assembled into a prompt along with a verbose 2,000-token system prompt, taking 10 milliseconds. The model runs prefill on this large prompt, taking 400 milliseconds. Generation of the answer takes another 1,800 milliseconds. Post-processing and delivery take 40 milliseconds. Total: roughly 2,490 milliseconds.

Now consider the optimised version. Authentication and embedding run in parallel immediately on request arrival, saving 30 milliseconds. Top-K retrieval is reduced from 5 to 3 chunks, cutting context size and shaving 60 milliseconds off prefill. The system prompt is cached, so its prefill cost is paid only once, saving another 150 milliseconds. Streaming is enabled so the user sees the first word at 640 milliseconds instead of waiting 2,490 milliseconds for silence. Logging is moved to an asynchronous background task. The result is a dramatically more responsive experience even though the model itself did not change.

Advantages

Decomposing latency by stage gives you a precise target. Instead of blindly switching models, you identify whether retrieval, prefill, or network is your bottleneck and fix only that stage. This is far more efficient than wholesale infrastructure changes.
Streaming transforms perceived responsiveness without changing actual generation time. A response that streams from second one feels five times faster than the same response delivered at second five as a block of text. This is one of the highest-impact changes you can make.
Parallelising independent pipeline stages compounds savings across every request. Running embedding and authentication simultaneously, or embedding retrieval in parallel with validation, reduces wall-clock time proportional to the number of stages you overlap.
Prompt caching eliminates redundant prefill cost for repeated system prompts. If your system prompt is 2,000 tokens and you pay the prefill cost on every request, caching that prefix removes a fixed overhead from every single API call.
Reducing top-K retrieval is a quick win with compound benefits. Fewer retrieved chunks mean a shorter prompt, faster prefill, lower token cost, and often better model output quality because irrelevant context is not diluting the relevant material.

Limitations and Trade-offs

Streaming requires client-side support. Not all interfaces can handle server-sent events or chunked transfer encoding. If your clients include legacy systems or simple HTTP consumers, streaming may require additional infrastructure work.
Reducing top-K retrieval can hurt recall. Fetching fewer documents speeds things up but risks missing relevant context. You need to measure retrieval quality at each K value before reducing it aggressively.
Parallelising stages adds architectural complexity. Asynchronous pipelines are harder to reason about, debug, and test than sequential ones. The latency savings are real, but the code becomes more complex.
Prompt caching is provider-specific. Not all LLM APIs support prefix caching, and those that do require specific prompt structures (the cached portion must be at the very beginning of the prompt). Migrating to a provider that supports caching may require prompt restructuring.
Warming serverless functions trades cost for latency. Keeping a function warm prevents cold starts but means you pay for idle compute time. This is a genuine trade-off, not a free optimization.

Common Mistakes

Blaming the model before measuring. Teams often switch to a smaller or faster model before auditing the pipeline. If retrieval takes 800 milliseconds, switching to a model that is 200 milliseconds faster does not solve the problem. Measure first.
Sending the full conversation history on every turn. Many early chatbot implementations append every previous message to the prompt on each request. For long conversations, this inflates input tokens dramatically. Use rolling summarisation or truncation based on relevance.
Retrieving far more documents than needed. Top-50 retrieval is common in early prototypes. In production, start with top-3 or top-5 and increase only if quality degrades. Every extra chunk is more prefill time and more API cost.
Treating logging as synchronous work. Writing logs to disk or a database synchronously inside the request path adds latency to every request. Move all logging, analytics, and memory updates to asynchronous background tasks that do not block the response.
Deploying the model in a different region from the application. Cross-region latency is constant and unavoidable, it cannot be optimised away. Deploy inference endpoints in the same region as your application server from the start.
Never enabling streaming for interactive applications. If a user sees no response for five seconds, they will assume the application is broken and either refresh or leave. Streaming is not a polish feature, it is a baseline expectation for interactive AI products.

Best Practices

Measure TTFT, TBT, and E2E separately for every request. Do not collapse them into a single "response time" metric, they point to different problems and have different fixes.
Enable streaming for all interactive endpoints as a non-negotiable baseline. Apply it first before any other optimization.
Start retrieval as early in the pipeline as possible, in parallel with validation and authentication rather than after it.
Set a strict top-K limit on retrieval and measure quality at each K value before settling on a number.
Use prompt caching for large, static system prompts so you pay the prefill cost once rather than on every request.
Move all non-blocking work, logging, analytics, memory persistence, to background tasks that execute after the response is delivered.
Keep your inference endpoint in the same cloud region as your application server. Treat cross-region latency as a hard constraint, not a configuration detail.
Profile your pipeline end-to-end before every major change to confirm that optimisations are having the expected effect.

Comparison: Sequential vs Parallelised Pipeline

Pipeline Design	Typical Total Latency	User Experience	Complexity	When to Use
Sequential, no streaming	Highest (all stages sum)	Long silent wait, then full response appears	Low, easy to implement and debug	Prototypes, internal tools where latency is not a priority
Sequential with streaming	Same total, but TTFT unchanged	Still waits for retrieval and prefill, then streams	Low, one additional parameter to enable	Quick improvement to any existing sequential pipeline
Parallelised, no streaming	Lower (overlapping stages)	Faster, but still silent until full response is ready	Medium, requires async coordination logic	Batch or non-interactive workloads
Parallelised with streaming	Lowest TTFT and E2E	Responsive immediately, text appears as generated	Medium-high, async pipeline plus streaming client	Production interactive applications where user experience matters
Cached prompts, optimised top-K	Further reduction in TTFT	Noticeably snappier for repeat-query patterns	Medium, requires prompt structure discipline	High-traffic production systems with stable system prompts

Frequently Asked Questions

Is TTFT or E2E more important to optimise?

It depends on your use case. For interactive chat applications, TTFT dominates user perception, users tolerate a slow stream far better than they tolerate silence. For batch processing or programmatic consumers that wait for the full response before doing anything with it, E2E matters more. Most user-facing products should prioritise TTFT first.

How much does streaming actually help if the total generation time is the same?

Streaming does not reduce total generation time, but it dramatically improves perceived responsiveness. Research on user experience consistently shows that users rate systems as faster when they receive partial output quickly, even if the full response takes the same amount of time. For a five-second response, the difference between streaming from second one and delivering at second five is the difference between an app that feels alive and one that feels broken.

What is the easiest first optimisation to make?

Enable streaming. It is usually a single parameter change in your LLM API call and requires minimal changes to your application logic. It has the highest impact on perceived responsiveness relative to the effort required and should be the first change made to any interactive LLM application.

Does reducing top-K retrieval always help latency?

Yes, but it may hurt retrieval quality if taken too far. Reducing top-K shortens the prompt (faster prefill) and reduces the vector database search overhead. The right trade-off depends on your corpus and query distribution. Start at top-5, measure retrieval quality, and only reduce further if quality remains acceptable.

When does prompt caching make sense?

Prompt caching is most valuable when you have a large, static system prompt that is repeated verbatim on every request. If your system prompt is 500 tokens, the savings are modest. If it is 3,000 tokens and you make thousands of requests per hour, caching pays for itself immediately. The catch is that the cached content must appear at the very beginning of the prompt and must be identical across requests for the cache to be valid.

References

Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
Pope, R., et al. (2023). Efficiently Scaling Transformer Inference. MLSys 2023.
vLLM Documentation. Efficient LLM Serving

Key Takeaways

LLM latency is primarily a system architecture problem. Audit the full pipeline before blaming the model, the model is rarely the bottleneck.
TTFT dominates user perception of speed. Optimise for it specifically rather than for total generation time alone.
In RAG systems, retrieval is frequently the dominant bottleneck. Cache embeddings, limit top-K retrieval aggressively, and start retrieval as early in the pipeline as possible.
Parallelise independent pipeline stages. Stages that do not depend on each other should not wait for each other.
Make streaming non-negotiable for interactive applications. It is the single highest-impact change you can make to how your application feels, and it costs almost nothing to implement.
Measure TTFT, TBT, and E2E separately. Each metric points to a different class of problem with a different fix.