AI engineer is no longer one interview shape
"AI engineer" can mean at least five things in 2026:
- LLM application engineer building product features on model APIs.
- ML engineer working with training, fine-tuning or model evaluation.
- AI infra engineer working on inference, scaling and reliability.
- LLMOps engineer owning deployment, monitoring, evals and safety gates.
- Forward deployed AI engineer turning customer workflows into working systems.
That is why generic AI interview prep wastes time. The role title is not enough. Read the job description for clues: RAG, tool calling, evals, latency, model serving, customer discovery, prompt design, safety, agents, vector search, data pipelines or infrastructure.
The research signal is strong. OpenAI's and Anthropic's hiring pages show broad engineering roles around models, tools and agents, not only classical ML. See OpenAI careers, Anthropic jobs, Stack Overflow's 2025 AI survey, and candidate discussions on ML and AI interview advice. Candidates also report that interview loops vary widely, which is why role-specific prep matters.
Expect a mixed loop
A realistic AI engineer loop may include:
| Round | What it tests |
|---|---|
| Coding | General programming, data handling, APIs, tests |
| LLM app design | RAG, prompts, tools, evals, latency, cost |
| System design | Scaling, reliability, observability, safety |
| ML fundamentals | Embeddings, classification, evaluation, fine-tuning |
| Product judgement | When AI is useful, failure modes, UX |
| Behavioural | Ownership, ambiguity, incidents, communication |
For most product AI roles, you do not need to be a research scientist. You do need to understand how model-powered systems fail. Hallucination, stale retrieval, prompt injection, latency spikes, rate limits, cost overruns and silent quality regressions are production problems.
Know the core LLM app architecture
A common interview prompt is: "Design an AI support assistant for our documentation."
A solid architecture:
- Ingest documents.
- Chunk and embed content.
- Store chunks in a vector database.
- Retrieve relevant chunks at query time.
- Build a prompt with citations.
- Call the model.
- Return an answer with source links.
- Log feedback and quality signals.
type RetrievedChunk = {
id: string;
url: string;
title: string;
text: string;
score: number;
};
export function buildSupportPrompt(question: string, chunks: RetrievedChunk[]) {
const context = chunks
.map((chunk, index) => {
return `[${index + 1}] ${chunk.title}\nURL: ${chunk.url}\n${chunk.text}`;
})
.join("\n\n");
return [
"Answer the user using only the provided sources.",
"If the sources do not contain the answer, say you do not know.",
"Cite sources by bracket number.",
"",
`Question: ${question}`,
"",
`Sources:\n${context}`,
].join("\n");
}
This code is not the whole system. It shows a habit interviewers care about: constrain the model, pass sources, and make unsupported answers explicit.
Prepare evals, not just prompts
Prompt iteration is not enough. A serious AI engineer needs evaluation.
At minimum, define:
- Golden questions with expected source documents.
- Refusal cases where the assistant should say it does not know.
- Safety cases for prompt injection or policy-sensitive requests.
- Regression checks before changing prompts, models or retrieval settings.
- Human review for ambiguous quality.
Example eval record:
type EvalCase = {
id: string;
question: string;
requiredSourceUrls: string[];
mustContain: string[];
mustNotContain: string[];
};
export function scoreAnswer(answer: string, evalCase: EvalCase) {
const lower = answer.toLowerCase();
const containsRequired = evalCase.mustContain.every((term) =>
lower.includes(term.toLowerCase()),
);
const avoidsForbidden = evalCase.mustNotContain.every(
(term) => !lower.includes(term.toLowerCase()),
);
return containsRequired && avoidsForbidden;
}
In a real product, scoring can be more sophisticated. The interview point is that you measure behaviour across changes. OpenAI's frontier governance framework and safety materials from OpenAI Safety show why model releases and safeguards need structured evaluation, not vibes.
Be ready for tradeoff questions
Common prompts:
- When would you use RAG instead of fine-tuning?
- How would you reduce hallucinations?
- How would you handle prompt injection?
- How would you manage model cost?
- How would you compare two model providers?
- What would you log, and what would you avoid logging?
Good answers are conditional. RAG helps when knowledge changes and sources matter. Fine-tuning can help with style, task format or specialised behaviour, but it does not magically add fresh private knowledge. Smaller models may reduce cost and latency but need quality checks. Caching can help repeated queries but can serve stale answers if invalidation is weak.
For privacy, do not log raw sensitive user data by default. Store enough metadata to debug quality and cost, but redact or hash where appropriate. If the role touches regulated data, ask about retention, access control and audit requirements.
Build one serious portfolio project
An AI engineer portfolio project should include:
- A real user workflow.
- Retrieval or tool use.
- Evals with failing cases.
- Observability: latency, token cost, errors.
- A README that explains tradeoffs.
- A demo that can be run without secrets, or clear setup instructions.
Avoid yet another generic PDF chatbot with no evals. A smaller project with honest failure analysis is stronger than a polished demo that cannot explain its own errors.
The market research notes that candidates need proof of production skill because AI has made polished artefacts cheap. That is consistent with HackerRank's real-world skills argument, Built In's AI cheating debate, and discussions around real-codebase interviews.
Continue your prep
Anchor your prep in the exact AI role: