As asked
You are not training a model from scratch, you are building a feature on top of a foundation model with prompting, retrieval, and maybe light fine-tuning. How does bias-variance thinking still apply to the choices you make, and how do you tell whether your system is underfitting or overfitting the task?
Sample answer outline
Translate the concept into the applied-AI setting. Underfitting shows up as a system that is too generic: weak prompts, no retrieval, and a model that gives plausible but off-target answers because it lacks the task context. Overfitting shows up as a system tuned so tightly to a handful of examples that it fails on real, varied inputs: brittle few-shot prompts, a fine-tune on a tiny biased dataset, or retrieval that only works for the queries you tested. The applied controls are the modern analogues of regularisation and capacity: richer context and retrieval reduce bias, while a held-out evaluation set, diverse examples, and resisting over-tuning to a demo reduce variance. The AI-engineer signal is measuring this with an evaluation harness on representative inputs rather than eyeballing a few prompts, and knowing when a prompt change is genuinely better versus fitted to your test cases.
Expect these follow-ups
- How do you build an evaluation set that reveals overfitting to your examples?
- When does adding retrieval reduce bias versus just adding noise?
- Why can a fine-tune on a small dataset make a model worse on real inputs?