As asked
You are asked to red team a tool-using support agent that reads web pages, calls internal APIs, and drafts customer replies. How would you test it for indirect prompt injection?
Sample answer outline
Start by mapping the trust boundaries: user instructions, retrieved web content, tool outputs, system prompts, and privileged internal data. Build payloads where hostile content appears in external pages or documents, then check whether the agent treats that content as instructions rather than data. A strong answer separates exfiltration, unauthorised tool use, data corruption, and policy bypass as distinct failure classes. Good candidates discuss both manual probing and an eval harness that measures attack success rate across model and prompt versions. The common trap is testing only direct jailbreaks in chat and missing the real agentic path through retrieved content.
Expect these follow-ups
- What signals prove the model followed the web page rather than the user task?
- How would you turn one successful exploit into a regression test?
- Which mitigations belong in prompting and which belong in the tool layer?