The Problem Nobody Warns You About
You add an AI chat feature to your internal tool. You point it at the codebase docs and call it done. Then a teammate asks the bot "why does the payment processor retry on 429?" and it confidently explains rate-limiting in general terms, completely ignoring the three custom retry strategies your team spent a sprint building.
That is the gap between "AI that knows about code" and "AI that knows your code." General knowledge of how software works is not the same as understanding the actual logic, the naming conventions, the weird legacy files, and the architectural decisions baked into your specific repo.
Retrieval-Augmented Generation (RAG) over a codebase is the standard answer, but the standard answer glosses over almost all of the hard parts. This guide covers the full picture: what to index, how to chunk source files properly, what embedding model to use in 2025, how to build the retrieval pipeline, and the failure modes that will catch you out in production.
What RAG Actually Does (and Does Not Do)
RAG is not fine-tuning. Fine-tuning bakes knowledge into model weights through additional training. RAG retrieves relevant documents at query time and stuffs them into the context window alongside the question. The model then generates its answer using both its pre-trained knowledge and the retrieved content.
For a codebase this means:
- You embed every file (or chunk) into vectors at index time.
- When a user asks something, you embed the query, find the nearest vectors, pull back the corresponding source code, and send that code to the model as context.
- The model answers using what you gave it, not from memory.
This is powerful for codebases because the model was never trained on your private repo. But RAG has a ceiling: if the relevant code is not in the retrieved chunks, the model cannot know it. The quality of your retrieval is the quality of your system.
Step 1: Deciding What to Index
The naive approach is to index every file. That works until you have 50,000 files with 300MB of minified build output, generated GraphQL types, lock files, and migration snapshots. You will burn tokens and retrieve noise.
A practical indexing filter for a TypeScript/Node monorepo:
find . -type f \( \
-name "*.ts" -o \
-name "*.tsx" -o \
-name "*.md" -o \
-name "*.json" \
\) \
! -path "*/node_modules/*" \
! -path "*/.next/*" \
! -path "*/dist/*" \
! -path "*/build/*" \
! -path "*/*.lock" \
! -name "*.d.ts" \
! -name "*.generated.ts" \
> files_to_index.txt
wc -l files_to_index.txtBeyond extension filtering, you should also skip files below a meaningful size threshold. A 12-line re-export barrel file (export * from './foo') has no answerable content. Set a floor of around 80 tokens (roughly 300 characters).
For *.json files, index package.json at the repo root and workspace roots (they document intent and dependencies), but skip deeply nested config snapshots.
Step 2: Chunking Source Code Properly
This is where most tutorials fail. They say "split at 512 tokens with 50-token overlap" and move on. That strategy was designed for prose. Source code has structure that prose does not: functions, classes, imports, comments. Splitting across a function boundary produces chunks that make no sense without their context.
A better strategy is AST-aware chunking. Parse the file, then emit one chunk per top-level declaration. For TypeScript this means one chunk per exported function, class, type alias, or constant that carries meaningful logic.
// chunker.ts
import { Project, SyntaxKind, Node } from "ts-morph";
export interface CodeChunk {
filePath: string;
chunkId: string;
text: string;
startLine: number;
endLine: number;
symbolName?: string;
kind: string;
}
const CHUNK_TOKEN_FLOOR = 80;
const CHUNK_TOKEN_CEILING = 1500;
function estimateTokens(text: string): number {
// cl100k_base approximation: ~4 chars per token
return Math.ceil(text.length / 4);
}
export function chunkFile(filePath: string, source: string): CodeChunk[] {
const project = new Project({ useInMemoryFileSystem: true });
const sf = project.createSourceFile("__temp__.ts", source);
const chunks: CodeChunk[] = [];
let index = 0;
const emit = (node: Node, name?: string) => {
const text = node.getFullText().trim();
const tokens = estimateTokens(text);
if (tokens < CHUNK_TOKEN_FLOOR) return;
if (tokens > CHUNK_TOKEN_CEILING) {
// Fallback: sliding window at line boundaries
const lines = text.split("\n");
let buffer: string[] = [];
let bufStart = node.getStartLineNumber();
for (let i = 0; i < lines.length; i++) {
buffer.push(lines[i]);
if (estimateTokens(buffer.join("\n")) >= CHUNK_TOKEN_CEILING) {
chunks.push({
filePath,
chunkId: `${filePath}::${index++}`,
text: buffer.join("\n"),
startLine: bufStart,
endLine: bufStart + buffer.length - 1,
symbolName: name,
kind: "overflow",
});
// 10-line overlap
buffer = buffer.slice(-10);
bufStart = bufStart + i - 10;
}
}
if (buffer.length) {
chunks.push({
filePath,
chunkId: `${filePath}::${index++}`,
text: buffer.join("\n"),
startLine: bufStart,
endLine: bufStart + buffer.length,
symbolName: name,
kind: "overflow_tail",
});
}
return;
}
chunks.push({
filePath,
chunkId: `${filePath}::${index++}`,
text,
startLine: node.getStartLineNumber(),
endLine: node.getEndLineNumber(),
symbolName: name,
kind: node.getKindName(),
});
};
sf.getFunctions().forEach((fn) => emit(fn, fn.getName()));
sf.getClasses().forEach((cls) => emit(cls, cls.getName()));
sf.getTypeAliases().forEach((t) => emit(t, t.getName()));
sf.getInterfaces().forEach((i) => emit(i, i.getName()));
sf.getVariableStatements().forEach((vs) => emit(vs));
// If the file produced no declarations, emit the whole file as one chunk
if (chunks.length === 0) {
const tokens = estimateTokens(source);
if (tokens >= CHUNK_TOKEN_FLOOR) {
chunks.push({
filePath,
chunkId: `${filePath}::0`,
text: source,
startLine: 1,
endLine: source.split("\n").length,
kind: "file",
});
}
}
return chunks;
}Notice the overflow handling. A 3,000-line class cannot fit in a single chunk; you fall back to sliding windows with line-level overlap so at least the boundary context survives. This is not pretty, but it beats silently truncating.
Step 3: Embedding Models and Vector Stores
In 2025 the dominant choices for code embedding are:
Voyage AI voyage-code-3 (1,024-dimensional, 16K context). Voyage code models consistently outperform general text embeddings on code retrieval benchmarks. If you are calling an external API and willing to pay for quality, this is the strongest off-the-shelf option as of mid-2026.
nomic-embed-text-v2 (MoE) runs locally, is Apache-licensed, and delivers competitive quality at zero variable cost. Good choice if your codebase contains anything sensitive that should not leave your network.
OpenAI text-embedding-3-small is cheap and acceptable. If your stack is already OpenAI-first and you are not indexing sensitive code, this removes one vendor dependency.
For the vector store, if you already run Postgres, pgvector is the path of least resistance. For a standalone service, Qdrant is fast and has a clean API. Pinecone works well if you want a fully managed service and do not mind the cost.
Here is a minimal indexer using the Voyage API and pgvector:
// indexer.ts
import Anthropic from "@anthropic-ai/sdk"; // not used here, shown for context
import { chunkFile } from "./chunker";
import { Pool } from "pg";
import fs from "fs";
import path from "path";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
async function embedBatch(texts: string[]): Promise<number[][]> {
const res = await fetch("https://api.voyageai.com/v1/embeddings", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.VOYAGE_API_KEY}`,
},
body: JSON.stringify({ model: "voyage-code-3", input: texts }),
});
if (!res.ok) throw new Error(`Voyage API error: ${res.status}`);
const data = (await res.json()) as { data: { embedding: number[] }[] };
return data.data.map((d) => d.embedding);
}
async function upsertChunks(
chunks: { chunkId: string; filePath: string; text: string; symbolName?: string; kind: string; startLine: number; endLine: number }[],
embeddings: number[][]
) {
const client = await pool.connect();
try {
await client.query("BEGIN");
for (let i = 0; i < chunks.length; i++) {
const c = chunks[i];
const vec = `[${embeddings[i].join(",")}]`;
await client.query(
`INSERT INTO code_chunks
(chunk_id, file_path, symbol_name, kind, start_line, end_line, content, embedding)
VALUES ($1,$2,$3,$4,$5,$6,$7,$8::vector)
ON CONFLICT (chunk_id) DO UPDATE
SET content = EXCLUDED.content,
embedding = EXCLUDED.embedding,
updated_at = now()`,
[c.chunkId, c.filePath, c.symbolName ?? null, c.kind, c.startLine, c.endLine, c.text, vec]
);
}
await client.query("COMMIT");
} catch (e) {
await client.query("ROLLBACK");
throw e;
} finally {
client.release();
}
}
export async function indexFiles(filePaths: string[]) {
const EMBED_BATCH = 64;
let allChunks: ReturnType<typeof chunkFile> = [];
for (const fp of filePaths) {
const source = fs.readFileSync(fp, "utf8");
allChunks.push(...chunkFile(fp, source));
}
for (let i = 0; i < allChunks.length; i += EMBED_BATCH) {
const batch = allChunks.slice(i, i + EMBED_BATCH);
const texts = batch.map((c) => c.text);
const embeddings = await embedBatch(texts);
await upsertChunks(batch, embeddings);
console.log(`Indexed ${Math.min(i + EMBED_BATCH, allChunks.length)} / ${allChunks.length}`);
}
}The pgvector schema:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE code_chunks (
chunk_id TEXT PRIMARY KEY,
file_path TEXT NOT NULL,
symbol_name TEXT,
kind TEXT NOT NULL,
start_line INT,
end_line INT,
content TEXT NOT NULL,
embedding vector(1024),
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON code_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);Use HNSW, not IVFFlat. IVFFlat requires you to specify the number of lists at creation time and needs a separate training step. HNSW builds incrementally and delivers better recall at comparable query latency for collections under a few million vectors.
Step 4: The Retrieval Pipeline
Pure vector similarity search on code has a known failure mode: queries that describe intent ("find where we validate JWT tokens") return high-scoring chunks that mention JWTs in comments or type definitions, not the actual validation function. Code is dense with domain vocabulary that appears in many contexts.
The solution most teams land on is a two-stage pipeline: vector search for candidate recall, followed by a fast re-ranker to reorder by relevance.
// retriever.ts
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
async function embedQuery(query: string): Promise<number[]> {
const res = await fetch("https://api.voyageai.com/v1/embeddings", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.VOYAGE_API_KEY}`,
},
body: JSON.stringify({ model: "voyage-code-3", input: [query] }),
});
const data = (await res.json()) as { data: { embedding: number[] }[] };
return data.data[0].embedding;
}
async function rerankWithVoyage(
query: string,
documents: string[]
): Promise<number[]> {
const res = await fetch("https://api.voyageai.com/v1/rerank", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.VOYAGE_API_KEY}`,
},
body: JSON.stringify({
model: "rerank-2",
query,
documents,
top_k: 8,
}),
});
const data = (await res.json()) as { data: { index: number }[] };
return data.data.map((d) => d.index);
}
export async function retrieve(query: string, topK = 20) {
const queryVec = await embedQuery(query);
const vecStr = `[${queryVec.join(",")}]`;
const { rows } = await pool.query<{
chunk_id: string;
file_path: string;
symbol_name: string | null;
content: string;
similarity: number;
}>(
`SELECT chunk_id, file_path, symbol_name, content,
1 - (embedding <=> $1::vector) AS similarity
FROM code_chunks
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[vecStr, topK]
);
if (rows.length === 0) return [];
// Re-rank the top-K candidates
const rerankedIndices = await rerankWithVoyage(
query,
rows.map((r) => r.content)
);
return rerankedIndices.map((i) => rows[i]);
}Voyage rerank-2 is fine for code. Cohere rerank-english-v3.0 is another option. If you want zero external calls, a BM25 re-rank using flexsearch or a simple TF-IDF over the candidate set works reasonably well and costs nothing.
Step 5: Putting It Together, the Answer Generation Step
You have retrieved 8 relevant chunks. Now you need to send them to a language model with enough context for it to reason about your specific code, not generic code.
The system prompt matters more than most people think. A prompt that just says "You are a helpful coding assistant" produces generic answers. A prompt that explains the repo's architecture, naming conventions, and where to find canonical implementations guides the model to give grounded answers.
// answer.ts
import Anthropic from "@anthropic-ai/sdk";
import { retrieve } from "./retriever";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY
const SYSTEM_PROMPT = `You are a code assistant for this repository.
When answering questions, cite specific file paths and function names from the provided context.
If the context does not contain enough information to answer, say so explicitly rather than guessing.
Do not describe what code typically does in general; describe what THIS code does based on the provided context.
Assume the user is a senior engineer who does not need basic concepts explained.`;
export async function answerQuery(userQuery: string): Promise<string> {
const chunks = await retrieve(userQuery, 20);
if (chunks.length === 0) {
return "No relevant code found for this query. Try rephrasing or narrowing the question.";
}
const contextBlock = chunks
.map(
(c, i) =>
`### [${i + 1}] ${c.file_path}${c.symbol_name ? ` > ${c.symbol_name}` : ""}\n\`\`\`\n${c.content}\n\`\`\``
)
.join("\n\n");
const message = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 2048,
system: SYSTEM_PROMPT,
messages: [
{
role: "user",
content: `Here is the relevant code from the repository:\n\n${contextBlock}\n\n---\n\nQuestion: ${userQuery}`,
},
],
});
const block = message.content[0];
return block.type === "text" ? block.text : "";
}One thing to pay attention to: ordering the chunks by relevance descending, with the most relevant chunk first, produces better answers than random or by-file ordering. Models pay more attention to content early in a long context window, a well-documented phenomenon sometimes called "lost in the middle." Put your best chunk at position 1.
Tradeoffs and Failure Modes
Chunk boundary problems
Even with AST-aware chunking, some code is genuinely contextual. A 40-line method in a service class makes no sense without knowing the class's fields, injected dependencies, and constructor. One mitigation is to always prepend the class header (up to the first method) to any method-level chunk. This adds tokens but dramatically improves answer quality for OOP-heavy codebases.
Stale index
The index goes stale the moment someone commits new code. A lazy mitigation is to re-index on a nightly schedule. A better approach is to hook into CI: after every merge to main, run an incremental indexer that only reprocesses files changed in the diff.
git diff --name-only HEAD~1 HEAD | \
grep -E '\.(ts|tsx|md)$' | \
node ./scripts/index-files.js --from-stdinKeeping a file-level content hash in the database lets you skip re-embedding files that have not changed, which matters if your repo has thousands of files and the embedding API charges per token.
Hallucination under sparse retrieval
If the query is about something genuinely not in the indexed codebase, the model will sometimes make up a plausible-sounding answer using its pre-trained knowledge of similar patterns. This is more dangerous than a "I don't know" because the answer looks authoritative.
One guard: include the similarity scores in your context and instruct the model to lower its confidence when all retrieved chunks are below a threshold (say, cosine similarity below 0.72 for this embedding model). You can also do this server-side and return a low-confidence flag to the UI.
Token budget overrun
8 chunks at up to 1,500 tokens each is 12,000 tokens of context before you even add the system prompt and the query. With a 200K context model this seems irrelevant, but cost scales with tokens and latency grows too. Keep your retrieved context to under 8,000 tokens total for routine queries. Use the re-ranker's output and trim aggressively.
The Gotcha Nobody Writes About
Here is something I learned the hard way: embedding models and re-rankers are not symmetric in what they find.
When we first built this for a Node.js service, the embedding retrieval was consistently pulling back configuration files and type definitions as top results for questions about business logic. The cosine similarity was high because the type names in the query appeared in those files. The re-ranker improved things, but there was a persistent pattern of type stubs outranking implementations.
The fix was a hybrid retrieval approach: run a parallel BM25 keyword search alongside the vector search, merge the result sets, then re-rank the merged pool. BM25 is good at exact identifier matching. Vector search is good at semantic intent. The merged pool has better recall than either alone.
// bm25-retriever.ts (simplified, using pg full-text search as a proxy)
export async function bm25Retrieve(query: string, topK = 20) {
const tsQuery = query
.replace(/[^a-zA-Z0-9_. ]/g, " ")
.trim()
.split(/\s+/)
.join(" & ");
const { rows } = await pool.query(
`SELECT chunk_id, file_path, symbol_name, content,
ts_rank(to_tsvector('english', content), to_tsquery($1)) AS rank
FROM code_chunks
WHERE to_tsvector('english', content) @@ to_tsquery($1)
ORDER BY rank DESC
LIMIT $2`,
[tsQuery, topK]
);
return rows;
}
// In retriever.ts, merge both result sets before re-ranking
export async function hybridRetrieve(query: string) {
const [vecResults, bm25Results] = await Promise.all([
retrieve(query, 20),
bm25Retrieve(query, 20),
]);
const seen = new Set<string>();
const merged: typeof vecResults = [];
for (const r of [...vecResults, ...bm25Results]) {
if (!seen.has(r.chunk_id)) {
seen.add(r.chunk_id);
merged.push(r);
}
}
// Re-rank the merged pool
const rerankedIndices = await rerankWithVoyage(
query,
merged.map((r) => r.content)
);
return rerankedIndices.map((i) => merged[i]).slice(0, 8);
}This hybrid approach is now standard enough that Qdrant, Weaviate, and others have built-in fusion support. If you are starting fresh on a managed vector store, look for "hybrid search" in the docs before implementing it yourself.
How to Apply This to Your Project
If you want to go from zero to a working codebase RAG in a weekend, here is the sequencing that avoids wasted effort:
Day 1, morning: indexing pipeline. Get the file list working, verify your chunker on 10 representative files, check that the chunks make sense when you print them. Do not touch embeddings yet. Bad chunking is the most common root cause of bad retrieval, and you cannot debug it once it is buried in vectors.
Day 1, afternoon: first embeddings and a sanity check. Index a subset of 200 files. Run 5 representative queries against pure vector search. Look at what comes back. If the top result for "where does the app bootstrap its database connection?" is a migration file, your chunker has a problem, not your embedding model.
Day 2, morning: re-ranking and hybrid search. Add the BM25 path and the re-ranker. Run the same 5 queries. Compare. You should see meaningful improvement.
Day 2, afternoon: wire up the LLM call and test it end-to-end. Spend time on the system prompt. Try asking questions you already know the answer to, so you can verify the model is using the retrieved code rather than general knowledge.
Then set up the CI incremental indexer before you ship anything. A stale index erodes trust faster than almost any other failure mode, because the system confidently answers questions about code that no longer exists.
Takeaways
- Index with intention, not by default. Filter out generated files, type stubs, and lock files before you embed a single token.
- Chunk at AST boundaries, not at character counts. A split function is worse than useless as a retrieval unit.
- Hybrid retrieval (vector plus BM25) consistently outperforms either alone for code. Add it early.
- Re-rank before sending to the LLM. Raw cosine similarity order is not relevance order.
- Put the most relevant chunk first in the context. Do not rely on the model to find the important part in 12,000 tokens.
- Stale indexes destroy user trust. Hook re-indexing into CI on merge.
- When retrieval misses, the model does not fail gracefully. Build confidence thresholds and communicate uncertainty to the user explicitly.
A well-built codebase RAG is not magic: it is a pipeline where every step can be inspected, measured, and improved. If the answers are wrong, you can usually trace it to chunking, retrieval, or prompt construction rather than something opaque inside a model. That debuggability is what makes it worth building properly.