Llm · February 12, 2026

How to Generate Better Embeddings for Vector Search

A practical guide to improving retrieval quality with chunking, cleaning, metadata, and embedding strategies

by Perivitta 20 mins read Advanced
Share
Back to all posts

How to Generate Better Embeddings for Vector Search

Introduction

If you have ever built a chatbot or a search tool backed by a vector database, you may have noticed something frustrating: sometimes the system retrieves completely wrong documents, even when the answer is clearly in your knowledge base.

The usual suspects. Pinecone, Weaviate, Qdrant, Chroma, FAISS, tend to get blamed. But the database is rarely the problem. In most cases, the issue is the quality of the embeddings going into the database.

This article explains what embeddings are, why retrieval fails, and how to fix it step by step, without needing a PhD in machine learning.


What Is an Embedding? (Quick Primer)

An embedding is a list of numbers that represents the meaning of a piece of text. An embedding model reads your text and produces something like:

[0.12, -0.44, 0.87, 0.03, ...]

The numbers themselves are not meaningful to a human. But here is the key insight: two pieces of text with similar meaning will produce vectors that are mathematically close together. Two unrelated pieces of text will produce vectors that are far apart.

This is what makes vector search powerful. Instead of matching keywords ("does the word 'price' appear?"), you match meaning ("is this document about pricing?").

Singular Value Decomposition diagram showing U, Sigma, and V matrix factorization
Figure: Dimensionality reduction illustrated: SVD underpins classical methods like LSA/LSI. Modern neural embedding models use transformer encoders rather than SVD, but the goal is the same — mapping high-dimensional text into a compact vector space where semantic similarity corresponds to geometric closeness. Source: Georg-Johann / Wikimedia Commons (CC BY-SA 3.0)

What Does "Better Embeddings" Actually Mean?

Better embeddings do not mean "fancier numbers." They mean your retrieval system works better in practice:

  • The correct document comes back more often.
  • Results are consistent even when users phrase questions differently.
  • Fewer irrelevant documents appear in the results.
  • The system handles vague natural-language questions well.

Improving embedding quality is primarily about fixing the text that goes into the embedding model, not necessarily upgrading the model itself.


Common Symptoms of Bad Embeddings

Before diving into fixes, it helps to recognise when you have an embedding problem:

  • Retrieval returns documents that share keywords but are on different topics.
  • Rephrasing the same question produces completely different results.
  • Short queries like "pricing" return random, unrelated chunks.
  • The system works in testing but breaks with real user questions.
  • Retrieved chunks feel incomplete, they cut off mid-sentence or lack context.

These are almost always caused by problems in how the text was prepared and split, not by the embedding model itself.


Step 1: Clean the Text Before Embedding

Garbage in, garbage out. If you embed messy text, you get messy vectors.

A common mistake is embedding raw scraped HTML, messy PDF output, or unprocessed markdown without cleaning it first. Imagine embedding a page that includes navigation menus, cookie banners, and repetitive footer text, those fragments will pollute your vectors.

A solid preprocessing pipeline should strip out:

  • Navigation menus and repeated headers/footers
  • Cookie notices and privacy popups
  • Ads and sidebar content unrelated to the topic
  • Page numbers, watermarks, and PDF conversion artifacts
  • Extra whitespace and broken line breaks

A simple rule: if a human reader would skip it, do not embed it.


Step 2: Chunking Matters More Than Model Choice

Chunking means splitting your documents into smaller pieces before embedding them. Each chunk becomes one entry in the vector database.

This step has a larger impact on retrieval quality than almost anything else, including which embedding model you pick. Here is why: the embedding model compresses a chunk into a single vector. If the chunk is too large, the vector becomes vague and matches too many queries. If the chunk is too small, it loses context and does not make sense on its own.

Chunking mistakes that hurt retrieval

  • Splitting text every N tokens regardless of sentence or paragraph boundaries
  • Cutting chunks in the middle of sentences
  • Embedding an entire document as one chunk
  • Embedding only headings without the content below them
  • Separating bullet lists from the paragraph that explains them

A good chunk should be something a human could read on its own and understand.


Chunk Size: What Works in Practice

There is no universal answer, but most production RAG systems land in predictable ranges:

Chunk Size Best For Typical Result
150–300 tokens FAQs, short Q&A datasets High precision, weaker context
300–700 tokens General RAG pipelines Strong balance of context and relevance
700–1200 tokens Technical docs and manuals More context, but risk of being too broad

For most chatbot and documentation retrieval use cases, 400–800 tokens is a reliable starting point.

To validate your chunk size, ask these questions about each chunk:

  • Can this chunk answer a question on its own?
  • Does it contain enough context to make sense in isolation?
  • Is it too broad to match a specific query?

Step 3: Use Overlap to Avoid Losing Context at Boundaries

When you split text into chunks, important information can get cut off at the boundary between two chunks. Chunk overlap solves this by repeating a small portion of text at the start of each new chunk.

Think of it like overlapping shingles on a roof, each one covers the gap from the one before it.

Common overlap values:

  • 10% overlap for general text
  • 15–20% overlap for technical documentation
  • 30% overlap only when chunks are very small

A practical starting point is 100–200 tokens of overlap. More than that and you get duplicated results that waste your context window.


Step 4: Split by Meaning, Not by Token Count

The best improvement you can make to chunking is splitting by document structure instead of arbitrary token counts. This is called semantic chunking.

Split at natural boundaries such as:

  • Headings and subheadings
  • Paragraph breaks
  • Code blocks and their accompanying explanation
  • Lists and their introductory sentences
  • Table sections

Chunks that follow structure represent real concepts, not random fragments.


Step 5: Add Context Headers Inside Each Chunk

This is one of the most underrated improvements in embedding pipelines. When you extract a chunk from a document, that chunk loses context, it no longer knows what document it came from or what section it belongs to. But that context matters for the embedding.

The fix is simple: prepend a short header to each chunk before embedding it.

Instead of embedding just:

Rate limits apply to all API calls...

Embed this:

Document: API Documentation
Section: Rate Limiting

Rate limits apply to all API calls...

This small change often improves retrieval quality significantly because the embedding now captures both the topic and the context.


Step 6: Handle Tables and Code Blocks Carefully

Tables and code blocks are common sources of embedding failure. When PDF or HTML extraction pulls out a table, it often produces unreadable collapsed text like:

Plan Price Limit Basic 10 1000 Pro 25 10000

This is nearly impossible for an embedding model to interpret correctly. Instead, represent tables in a structured format:

Plan Price Request Limit
Basic $10 1000 requests
Pro $25 10,000 requests
Enterprise Custom Unlimited

For code blocks, embedding works best when the code is kept together with the surrounding explanation text.

  • Keep code blocks together with the paragraph explaining them.
  • Do not embed minified or auto-generated code, it has no natural language meaning.
  • Never separate a code example from its heading.

Step 7: Choose the Right Embedding Model for Your Domain

Once your chunking and preprocessing are solid, embedding model choice becomes more important. When evaluating models, look for:

  • Strong performance on your content domain (technical, medical, legal, general)
  • Consistent similarity behaviour across different phrasings of the same question
  • Reasonable vector dimensions for your scale (smaller = cheaper storage)
  • Affordable cost per embedding call at your ingestion volume
  • Low latency for real-time query embedding

Published benchmarks are a useful starting point, but your own dataset is the only real benchmark that matters. Test on a sample of your actual content before committing to a model.


Step 8: Normalise Queries and Documents

A common mistake is embedding documents in one format and user queries in a completely different format. Documents are often long and structured. User queries are short and informal. This gap hurts retrieval.

You can improve results by lightly normalising queries:

  • Expand abbreviations ("db" → "database")
  • Remove noisy punctuation and extra whitespace
  • Convert vague follow-up questions into complete, standalone questions

This is especially important in chatbots, where users ask things like "what about pricing?" without any context. That query is nearly impossible to embed correctly without expansion.


Step 9: Use Query Rewriting for Better Recall

Query rewriting is one of the strongest production techniques for improving retrieval. Instead of embedding the user's raw query, you first run a lightweight LLM prompt that rewrites the query into a clearer, more descriptive form.

Example:

User query: "how do i do this?"
Rewritten query: "How do I configure vector database indexing for better similarity search performance?"

The rewritten query produces a far more useful embedding, which leads to better retrieval.


Step 10: Use Metadata Filters to Reduce Noise

Better retrieval is not only about vectors. Metadata filtering is equally important in real systems.

Store metadata alongside every chunk such as:

  • Document type (blog, docs, FAQ)
  • Category (pricing, troubleshooting, setup)
  • Language
  • Source URL
  • Timestamp or version
  • Product name

Then filter retrieval based on context. If the user asks a pricing question, filter to only pricing documents. This dramatically reduces irrelevant results.


Step 11: Use Hybrid Search (Dense + Keyword)

Vector (dense) search is excellent for semantic meaning, but it struggles with exact matches like:

  • Model version numbers ("GPT-4o", "claude-opus-4-5")
  • API endpoints ("/v1/embeddings")
  • Error codes ("404", "EINVAL")
  • File paths and identifiers

Hybrid search combines dense vector search with traditional keyword search (called BM25). The results from both are merged and reranked. If you work with technical documentation, hybrid search almost always improves retrieval quality.


Step 12: Evaluate Embedding Quality Properly

One of the biggest mistakes teams make is relying on random manual spot-checks to evaluate retrieval quality. This misses systematic problems.

A proper evaluation dataset contains:

  • Real user queries taken from logs
  • The expected relevant chunk for each query
  • A set of irrelevant chunks for comparison

Key metrics to measure:

  • Recall@K, does the correct chunk appear in the top K results?
  • Precision@K, how many of the top K results are actually relevant?
  • MRR (Mean Reciprocal Rank), how highly ranked is the first correct result?

Measuring these on your own dataset turns guesswork into evidence.


Step 13: Remove Duplicate and Near-Duplicate Chunks

Duplicate chunks cause the vector database to return the same content multiple times. This wastes space in your context window and makes results feel repetitive.

  • Use hashing to detect and remove exact duplicates.
  • Use similarity matching to remove near-duplicates (chunks that say the same thing slightly differently).
  • Remove boilerplate text that appears across many pages (standard disclaimers, repeated navigation text).

Step 14: Add a Reranker for Better Precision

Even with great embeddings, vector search is not perfect. Many teams improve final results by adding a reranker as a second stage.

The pipeline looks like this:

  1. Vector search retrieves the top 20 candidate chunks (fast but approximate).
  2. A reranker model scores each candidate against the query more carefully.
  3. The system selects only the top 5 most relevant chunks.

A reranker reads the query and the candidate document together, which gives it far more context than a similarity score alone. This improves precision significantly and is usually worth the extra latency in production systems where accuracy matters.


Step 15: Cache Embeddings to Reduce Cost

Embedding API calls can become expensive at scale, especially when embedding queries in real time.

  • Cache embeddings for repeated user queries.
  • Cache document embeddings for stable content that rarely changes.
  • Avoid re-embedding documents that have not been updated.

Caching does not improve accuracy, but it keeps your costs manageable as you scale.


A Recommended Embedding Pipeline

If you want a solid production baseline, follow these steps in order:

  1. Clean input text and remove boilerplate noise.
  2. Chunk by headings and paragraph boundaries (not raw token count).
  3. Apply overlap of 100–200 tokens.
  4. Prepend each chunk with the document title and section name.
  5. Generate embeddings using a reliable embedding model.
  6. Store metadata fields (source, category, version, timestamp).
  7. Deduplicate chunks before inserting into the database.
  8. Evaluate retrieval using Recall@K and MRR on real queries.

Most RAG systems improve significantly just from following this pipeline consistently.


Common Mistakes That Kill Retrieval Quality

  • Embedding raw HTML and PDF artifacts without cleaning.
  • Splitting text every N tokens without respecting structure.
  • Storing chunks without document or section context headers.
  • Keeping duplicate chunks in the database.
  • Retrieving too many chunks and letting noise reach the LLM.
  • Evaluating quality with manual spot-checks instead of systematic metrics.
  • Ignoring metadata filters and relying on similarity alone.

Conclusion

Most embedding and retrieval problems are not caused by the embedding model. They are caused by messy preprocessing, poor chunking, and a lack of evaluation.

The teams that get vector search right are not the ones using the most expensive model. They are the ones treating embedding quality as an engineering problem, with clean pipelines, consistent evaluation, and systematic improvement.

If you fix the text going into the embedding model, retrieval improves dramatically, often without needing to change your vector database or switch models at all.


Key Takeaways

  • Chunking and preprocessing have more impact than model choice, fix those first.
  • Prepend each chunk with the document title and section name to preserve context that would otherwise be lost.
  • Use hybrid search (dense vector + BM25 keyword) for technical content with exact terms like version numbers and error codes.
  • Evaluate with real user queries using Recall@K and MRR, not manual spot-checks, you need data, not intuition.

References

  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
  • Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
  • Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets Track.
  • Sentence Transformers Documentation
  • ANN-Benchmarks. Approximate Nearest Neighbor Benchmarks

Related Articles

Knowledge Distillation: How Small Models Learn from Big Ones
Knowledge Distillation: How Small Models Learn from Big Ones
Knowledge distillation trains a small student model to learn from a large...
Read More →
LLM as Judge: How to Evaluate AI Models Automatically at Scale
LLM as Judge: How to Evaluate AI Models Automatically at Scale
Human evaluation of LLM outputs is slow and expensive. LLM-as-judge uses a...
Read More →
Found this useful?