As asked
Users are trying to make a policy assistant ignore its instructions and reveal restricted guidance. How do you improve the prompt and the surrounding system?
Sample answer outline
A senior answer does not claim prompt wording alone solves jailbreaks. Put policy and tool permissions outside the user-controlled text, use clear instruction hierarchy, and make refusal behaviour specific to the domain. Add retrieval filters, output checks, and tool allowlists so the model cannot access or reveal restricted material just because the prompt was persuasive. Build an adversarial eval set from real attempts and track bypass rate by category. Candidates trip up when they write longer scolding instructions but leave the same unsafe tools and documents available.
Expect these follow-ups
- What belongs in the system prompt versus application code?
- How do you measure jailbreak resistance without overfitting?
- When should the assistant escalate to a human?