As asked
How would you build and maintain an evaluation set for jailbreak resistance in a customer-facing AI assistant?
Sample answer outline
Separate broad coverage from high-signal regressions: one set should cover known jailbreak families, while another should contain failures your product has actually seen. Label expected behaviour precisely, because vague labels make eval scores meaningless. Include multi-turn attacks, role-play, encoding tricks, tool-use attempts, and benign prompts that look suspicious so the system does not become over-refusing. Strong answers mention versioning, holdout sets, scorer calibration, and review by policy or legal stakeholders when categories are sensitive. A weak answer treats jailbreak resistance as one static prompt list.
Expect these follow-ups
- How do you prevent overfitting to the public jailbreak set?
- What metric would you report to executives?
- How do you handle prompts where reviewers disagree on the correct response?