Question 1

Why does RLHF not guarantee that a model is aligned with human values, even if the reward model has very high accuracy? What are the fundamental failure modes?

Accepted Answer

Fundamental failure modes include: reward hacking (the model learns to maximize the reward signal without satisfying the underlying intent), the reward model being a proxy that diverges from true human values at the tail, distributional shift between training and deployment, Goodhart's Law (once a measure becomes a target it ceases to be a good measure), and the fact that RLHF optimizes for the average human preference in the training set rather than any coherent ethical system. A strong answer also mentions specification gaming, where the model satisfies the letter of the reward criterion while violating the spirit.

Question 2

What harm taxonomy do you use when scoping a red-team engagement, and how do you map it to the major industry frameworks like the EU AI Act risk tiers or Anthropic's usage policies?

Accepted Answer

A strong answer describes a practical taxonomy: catastrophic harms (CBRN uplift, CSAM, attacks on critical infrastructure), severe harms (violence facilitation, harassment at scale, election interference), moderate harms (disinformation, fraud enablement, privacy violations), and mild harms (inappropriate content, bias, offensive language). The candidate should be able to map these to the EU AI Act prohibited AI practices, Anthropic's or OpenAI's published usage policies, and NIST AI RMF risk tiers. They should note that taxonomies are opinionated and that cross-organizational alignment on definitions is a real challenge.

Question 3

Users are trying to make a policy assistant ignore its instructions and reveal restricted guidance. How do you improve the prompt and the surrounding system?

Accepted Answer

A senior answer does not claim prompt wording alone solves jailbreaks. Put policy and tool permissions outside the user-controlled text, use clear instruction hierarchy, and make refusal behaviour specific to the domain. Add retrieval filters, output checks, and tool allowlists so the model cannot access or reveal restricted material just because the prompt was persuasive. Build an adversarial eval set from real attempts and track bypass rate by category. Candidates trip up when they write longer scolding instructions but leave the same unsafe tools and documents available.

Question 4

You are adding computer-use actions to an agent SDK so agents can operate web apps. What sandboxing, permissions, and audit controls do you require?

Accepted Answer

Run agents in isolated browser or desktop sessions with scoped credentials, network controls, and no ambient access to the user's local files unless explicitly granted. Separate observation from action and require confirmation for sensitive actions such as purchases, messages, deletes, or permission changes. Store an audit trail of observations, actions, screenshots where allowed, and user approvals. The SDK should expose policy hooks so product teams can block actions by domain, selector, data class, or user role. Strong answers treat model behaviour as untrusted automation inside a controlled runtime.

Question 5

A product team wants to launch an LLM feature that can summarise uploaded legal, financial, and HR documents. Walk me through your red-team threat model before launch.

Accepted Answer

Identify assets first: confidential documents, model outputs, user identities, audit logs, and any downstream actions triggered by summaries. Then enumerate abuse paths: cross-tenant data leakage, prompt injection from uploaded files, hallucinated legal claims, unsafe advice, and adversaries using the system to process stolen material. The answer should prioritise risks by likelihood and impact rather than producing a generic list. Strong candidates propose concrete tests, such as tenant-boundary probes, adversarial documents, output policy checks, and privacy-preserving logging review. Interviewers listen for clear reporting language that product teams can act on, not theatrical exploit naming.

Question 6

A model confidently states a fabricated chemical synthesis route. Is this a hallucination, a safety failure, or both? How does the distinction affect how you would address it?

Accepted Answer

It is potentially both, but the safety failure and the hallucination have different root causes and require different mitigations. The hallucination (factually wrong synthesis route) is a failure of factual grounding. The safety failure is that the model attempted to answer a dangerous question rather than refusing. From a red-team perspective, a model that hallucinate synthesis routes is arguably less dangerous than one that produces accurate routes, but the safety failure of engaging with the request at all is the primary concern. A strong answer notes that a safety-tuned model should refuse before hallucination even becomes relevant.

Question 7

How does in-context learning create security risks that fine-tuning does not? Give a concrete example of an attack that exploits in-context learning specifically.

Accepted Answer

In-context learning (ICL) means the model learns a task from examples in its current context window, without weight updates. The security risk is that an attacker who controls any part of the context can inject malicious demonstrations that teach the model a harmful behavior for this conversation only, without needing API or training access. Many-shot jailbreaks are the canonical example: by filling the context with fake examples of the model answering harmful questions, an attacker shifts its behavior for the session. Fine-tuning attacks require API access and leave a persistent artifact; ICL attacks are ephemeral but require only text access.

Question 8

What does the attack surface of an LLM supply chain look like, from pre-training data through deployment? Name the most realistic threat at each stage.

Accepted Answer

Pre-training data: data poisoning via injected malicious web content or adversarial examples designed to teach harmful behaviors. Pre-training pipeline: tampering with data preprocessing, tokenizer configuration, or training scripts. Base model: weight tampering, backdoor injection (hard to detect without full re-evaluation). Fine-tuning: customer abuse of fine-tuning API to degrade safety. Model serving: adversarial inputs via the API, model file tampering if served from a third-party host. Post-processing (filters, classifiers): bypassing or adversarially degrading the content filter. A strong answer gives a concrete known incident or paper for at least two of these stages.

Question 9

Is there an empirical tradeoff between safety fine-tuning and model helpfulness? What does the research say and how does it affect how you evaluate safety improvements?

Accepted Answer

There is evidence of a tradeoff: safety RLHF can increase over-refusal rates on benign queries and reduce performance on capability benchmarks, though the magnitude varies by training methodology and is decreasing as techniques improve. The research (including Anthropic's Claude papers and work on over-refusal) shows that naive safety training hurts helpfulness, but more sophisticated approaches (CAI, preference optimization with clear refusal calibration) reduce this. For evaluation, this means you must measure both safety regression and helpfulness regression together: a safety improvement that also increases over-refusal rates on common benign queries may not be a net win.

Question 10

How do the attention mechanisms in a transformer relate to why models are vulnerable to prompt injection? What does understanding attention tell you about where in a prompt injected instructions have the most influence?

Accepted Answer

Attention is permutation-sensitive but not strictly positional: a token's influence on the final output depends on the attention weights, which are determined by content similarity (keys and queries). Injected instructions placed in positions that receive high attention from the output tokens are more effective. Empirically, the beginning and end of the context often receive disproportionate attention (primacy and recency effects), which is why system prompt injections at the start and user injections at the end of a retrieved document are particularly effective. A strong answer connects this to why retrieval augmentation is dangerous: retrieved content is interleaved with trusted instructions in the same attention context.

Question 11

What is differential privacy and how does applying it during LLM training affect both the model's privacy guarantees and its utility? What are the practical limits of DP for large language models?

Accepted Answer

DP guarantees that the presence or absence of any single training example changes the model's output distribution by at most epsilon. For LLMs, DP-SGD clips gradient norms and adds calibrated noise per training step. Practical limits include: the epsilon-delta tradeoff (meaningful privacy requires large noise, which degrades model quality, especially on rare data), the composition problem (DP degrades over many training steps), and the fact that DP is a per-example guarantee while LLM privacy risks often involve correlations across examples. At frontier model scale, DP is not widely deployed in production due to utility costs.

Question 12

As models support context windows of 1 million tokens or more, what new security considerations emerge that did not exist with 4K or 8K contexts?

Accepted Answer

New considerations include: the 'lost in the middle' phenomenon meaning injected instructions may be more or less effective depending on position at scale, many-shot jailbreaks becoming trivially feasible when you can fit thousands of example turns, the ability to include entire codebases or document repositories in the context (expanding the indirect injection surface enormously), cost of processing malicious payloads increasing (a DoS-adjacent issue), and difficulty of monitoring or logging extremely long conversations. A strong answer also notes that retrieval-augmented approaches become less necessary at very long contexts, shifting the security risk from retrieval-layer injection to direct context stuffing.

Question 13

How do LLM output watermarking schemes like Kirchenbauer et al.'s scheme work? What are their security limits from an adversary's perspective?

Accepted Answer

The Kirchenbauer scheme biases the model's sampling toward a pseudorandom subset of the vocabulary at each step, creating a statistical signature detectable without the model. From an adversary's perspective, limits include: paraphrasing attacks (rewriting the output destroys the watermark), translation attacks, copy-paste mixing with non-watermarked content, and spoofing (if the watermarking key leaks, an attacker can produce content that falsely appears watermarked). A strong answer notes that watermarking is a probabilistic detection tool, not a cryptographic guarantee, and that its robustness degrades as paraphrase quality improves.

Question 14

Speculative decoding uses a small draft model to propose tokens that a larger model then verifies. What are the security implications of this architecture from a red-team perspective?

Accepted Answer

The security risk is that the draft model may have different safety properties than the verifier model. If an adversarial prompt causes the draft model to propose unsafe token sequences that the verifier accepts (either because the verifier is also vulnerable or because the acceptance probability is non-zero), the output is unsafe at the speed of the draft model. Additionally, a supply-chain attack on the draft model (which is smaller and may be less carefully vetted) can affect the full system's outputs. A strong answer notes that speculative decoding is relatively underexplored from a safety perspective and that the safety properties of the combined system are not straightforwardly derived from the components.

Questions

Why RLHF does not guarantee alignmentDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Standard harm taxonomy for AI safety workDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Harden a prompt against jailbreak attemptsDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

Design safe computer-use execution for agentsDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Threat model a model misuse scenarioDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

Distinguish hallucination from a safety failureDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Security implications of in-context learningDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

LLM supply chain attack surfaceDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

The empirical tradeoff between safety and helpfulnessDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

How attention mechanisms affect prompt injection riskDomain knowledgehardOccasional

As asked

Sample answer outline

Expect these follow-ups

Differential privacy and its limits for LLM trainingDomain knowledgemediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Security implications of very long context windowsDomain knowledgemediumOccasional

As asked

Sample answer outline

Expect these follow-ups

LLM output watermarking and its security limitsDomain knowledgehardOccasional

As asked

Sample answer outline

Expect these follow-ups

Security considerations in speculative decodingDomain knowledgehardRare

As asked

Sample answer outline

Expect these follow-ups

Related questions

Standard harm taxonomy for AI safety work

Harden a prompt against jailbreak attempts

Design safe computer-use execution for agents

Threat model a model misuse scenario

More ai red team engineer topics

Tools to sharpen your prep

Questions

Why RLHF does not guarantee alignmentDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Standard harm taxonomy for AI safety workDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Harden a prompt against jailbreak attemptsDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

Design safe computer-use execution for agentsDomain knowledgemediumCommon

As asked

Sample answer outline