Question 1

You are asked to red team a tool-using support agent that reads web pages, calls internal APIs, and drafts customer replies. How would you test it for indirect prompt injection?

Accepted Answer

Start by mapping the trust boundaries: user instructions, retrieved web content, tool outputs, system prompts, and privileged internal data. Build payloads where hostile content appears in external pages or documents, then check whether the agent treats that content as instructions rather than data. A strong answer separates exfiltration, unauthorised tool use, data corruption, and policy bypass as distinct failure classes. Good candidates discuss both manual probing and an eval harness that measures attack success rate across model and prompt versions. The common trap is testing only direct jailbreaks in chat and missing the real agentic path through retrieved content.

Question 2

Walk me through how you would classify the major families of LLM jailbreaks. What are the key mechanistic differences between, say, a role-play jailbreak, a many-shot jailbreak, and a gradient-based adversarial suffix like those produced by GCG?

Accepted Answer

A strong answer distinguishes semantic jailbreaks (role-play, hypothetical framing, persona injection) from optimization-based attacks (GCG, AutoDAN, PAIR) and from structural exploits (context stuffing, many-shot in-context learning). The candidate should explain why gradient-based suffixes transfer poorly across models, why many-shot attacks are hard to defend with RLHF alone, and how each family maps to different threat actors and deployment risk.

Question 3

A lot of people use prompt injection and jailbreaking interchangeably. How do you distinguish them, and why does the distinction matter when you are writing a threat model for an AI-powered product?

Accepted Answer

A jailbreak targets the model's alignment to override safety behaviors, typically by the end user. Prompt injection targets the application layer: untrusted content in retrieved data or tool outputs hijacks the model's instructions on behalf of a third party. The distinction matters for threat modeling because they have different attack surfaces, different mitigations (system prompt isolation vs. RLHF), and different responsible parties. A strong answer gives a concrete example of indirect prompt injection via a malicious document in a RAG pipeline.

Question 4

A target deployment has a confidential system prompt and you have black-box API access. Walk me through the techniques you would try to extract or reconstruct that system prompt, in order of effort.

Accepted Answer

A strong answer starts with low-effort probes: asking the model to repeat, summarize, or translate its instructions. Next comes indirect elicitation: asking what it is not allowed to do, or prompting in a foreign language where the guard rails may be weaker. More advanced techniques include differential probing (comparing behavior with and without particular phrasings to infer instruction content) and using in-context learning to get the model to roleplay as a chatbot that shows its prompt. The candidate should also mention that some models echo fragments under certain failure modes and that defense involves output filtering plus constitutional fine-tuning.

Question 5

You have just joined a team releasing a new instruction-tuned chat model. You are given two weeks to design and run a safety evaluation suite before launch. Walk me through what you would build and how you would prioritize.

Accepted Answer

A strong answer covers multiple axes: harm category coverage (CSAM, weapons, self-harm, bias, privacy), attack surface coverage (direct jailbreaks, indirect injection, multi-turn, multilingual, code generation), and both automated and human evaluation. The candidate should mention using existing benchmarks (AdvBench, HarmBench, StrongREJECT) alongside custom red-team prompts, using safety classifiers like LLaMA Guard or WildGuard as automated judges, calibrating false-positive and false-negative rates separately, and defining thresholds for launch versus hold. Prioritization should follow expected harm severity and breadth, with chemical/bio uplift and CSAM at the top.

Question 6

A startup is building a legal document assistant that uses GPT-4 with retrieval augmented generation over the client's case files. Walk me through how you would build a threat model for this product.

Accepted Answer

A strong answer uses a structured approach: identify assets (case files, client PII, attorney-client privileged content, the system prompt), identify threat actors (malicious clients, disgruntled insiders, competitor intelligence, nation-state actors), enumerate attack surfaces (client-supplied documents as injection vectors, API access, multi-tenancy isolation). Threats include indirect prompt injection via uploaded documents, cross-tenant data leakage if retrieval is not properly scoped, hallucinated legal citations leading to malpractice, and system prompt extraction leaking the product's IP. STRIDE or PASTA are good frameworks to mention.

Browse by topic

Top ai red team engineer interview questions

Test an agent for indirect prompt injectionRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Classify and compare major jailbreak categoriesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Distinguish prompt injection from jailbreakingRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Techniques for extracting a hidden system promptRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design a safety evaluation suite for a new chat modelRole-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

Build a threat model for an LLM-powered productRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools

Browse by topic

Top ai red team engineer interview questions

Test an agent for indirect prompt injectionRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Classify and compare major jailbreak categoriesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Distinguish prompt injection from jailbreakingRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Techniques for extracting a hidden system promptRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design a safety evaluation suite for a new chat modelRole-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

Build a threat model for an LLM-powered productRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools