Q: Design a feature store that serves features to both training jobs and online inference, with consistent semantics between batch and realtime.

Two storage tiers: offline (a columnar warehouse like BigQuery or Iceberg, used by training jobs) and online (a low-latency KV like Redis or DynamoDB, used by serving). The core invariant: a feature should produce the same value at training time and serving time, otherwise you get training-serving skew. Solutions: define features as code (e.g. SQL or DSL), compute them once and write to both tiers; use point-in-time joins for backfills so you don't leak future data; tag features with freshness SLAs and version them.

Q: You are not training a model from scratch, you are building a feature on top of a foundation model with prompting, retrieval, and maybe light fine-tuning. How does bias-variance thinking still apply to the choices you make, and how do you tell whether your system is underfitting or overfitting the task?

Translate the concept into the applied-AI setting. Underfitting shows up as a system that is too generic: weak prompts, no retrieval, and a model that gives plausible but off-target answers because it lacks the task context. Overfitting shows up as a system tuned so tightly to a handful of examples that it fails on real, varied inputs: brittle few-shot prompts, a fine-tune on a tiny biased dataset, or retrieval that only works for the queries you tested. The applied controls are the modern analogues of regularisation and capacity: richer context and retrieval reduce bias, while a held-out evaluation set, diverse examples, and resisting over-tuning to a demo reduce variance. The AI-engineer signal is measuring this with an evaluation harness on representative inputs rather than eyeballing a few prompts, and knowing when a prompt change is genuinely better versus fitted to your test cases.

Question 1

Explain the bias-variance tradeoff. Use a concrete worked example, not just the definitions.

Accepted Answer

Bias is error from oversimplifying assumptions; variance is error from over-fitting to training noise. Concrete example: predicting house prices. A constant model (predict the mean) has very high bias but zero variance. A degree-15 polynomial on 100 points has very low bias but huge variance - it tracks the training data perfectly and fails on new data. The optimum is somewhere in between. Mitigation: regularisation (L1/L2), cross-validation to estimate the curve, ensembling to average out variance.

Question 2

Explain overfitting and underfitting in plain terms, and tell me how you would notice that a model is overfitting. Keep it concrete.

Accepted Answer

Frame the tradeoff through its everyday symptoms rather than the formal decomposition. Underfitting is when a model is too simple to capture the pattern, so it does poorly on both the training data and new data. Overfitting is when a model memorises the training data, including its noise, so it looks great on training but does badly on data it has not seen. You spot overfitting by a large gap between training accuracy and validation accuracy. Mention the simplest fixes an earlier-career engineer reaches for: get more data, use a simpler model, or add regularisation. A concrete picture, such as a wiggly curve threading every training point, lands better than equations here.

Question 3

You are building a classifier where the positive class is one percent of the data, say fraud or a rare disease. Why is accuracy the wrong metric, and how do you actually evaluate and tune the model?

Accepted Answer

On a one percent positive rate a model that predicts negative for everyone scores 99 percent accuracy and is useless, so accuracy hides total failure on the class you care about. Evaluate with metrics that focus on the positive class: precision and recall, and the precision-recall curve, which is more informative than the ROC curve under heavy imbalance because it does not get flattered by the huge number of true negatives. Which way you trade precision against recall is a business decision, missing a fraud case versus annoying a good customer, so tune the decision threshold to that cost rather than defaulting to 0.5. For training, options include class weighting or resampling, but be honest that resampling changes the base rate and you must calibrate or correct probabilities afterward if you need them to mean something. The signal is choosing metrics and a threshold that reflect the real cost of each error, not a single headline number.

Question 4

Walk me through what Constitutional AI is, how it differs from standard RLHF with a human reward model, and what practical problems at scale it was designed to solve.

Accepted Answer

A strong answer explains that CAI uses a written set of principles and has the model critique and revise its own outputs, reducing reliance on human labelers for the harmlessness dimension. RLHF trains a reward model from human preference data and uses PPO or similar to optimize against it. CAI at scale reduces the bottleneck of human annotation for safety feedback and makes the value alignment more inspectable since the constitution is readable text. The candidate should mention the SL-CAI and RL-CAI stages.

Question 5

A new creator posts their first video on TikTok. There is no watch history, no engagement data, nothing. How does your recommendation system decide who to show this video to, and how does it bootstrap the feedback loop?

Accepted Answer

Strong answers address both the item cold start (no engagement data) and the user cold start (new user). For item cold start: use content-based signals (audio, visuals, text description, creator profile), show the video to a small random sample from the creator's existing follower graph, then use early engagement signals (first 100 views) to decide whether to expand distribution. For user cold start: use onboarding interest selection, geographic defaults, or implicit signals from the referral source. The candidate should mention the risk of seeding the wrong distribution early.

Question 6

You are setting up a machine learning platform for a team of 10 data scientists who will train dozens of models a week. Walk me through how you would use MLflow on Databricks to track experiments, compare runs, register the best model, and gate promotion from staging to production.

Accepted Answer

A strong answer covers: logging parameters, metrics, and artifacts with mlflow.log_param and mlflow.log_metric inside training runs; using experiment UI or mlflow.search_runs to compare across runs; registering the best run to the Model Registry with mlflow.register_model; using staging and production stages with manual or automated approval gates (e.g., a CI job that runs evaluation metrics before promoting); and serving the production model via Databricks Model Serving endpoints. The candidate should mention model aliases in the newer MLflow API which replace the stage transitions.

Question 7

Neural networks are popular - but when is a neural network the wrong tool for the job?

Accepted Answer

Wrong tool when: data is small (under ~10k examples - gradient boosted trees usually win on tabular data, full stop); interpretability is required (regulated industries, medical decisions, anything that needs SHAP-style attribution that holds up to scrutiny); inference latency is tight and the model would not fit; features are mostly categorical and high-cardinality; the cost of false positives is asymmetric and you need calibrated probabilities. XGBoost, LightGBM, and CatBoost on engineered features are still state of the art for most tabular problems.

Question 8

You need to serve a 70B-parameter open-source LLM at under 100ms p95 latency for a single-token completion. Walk me through how you would approach it.

Accepted Answer

Hardware: A100/H100 or H200 GPUs; the model in 16-bit needs ~140GB so multi-GPU tensor-parallel or quantised to int8/int4 to fit one GPU. Software: vLLM or TGI with paged attention. Latency tricks: prefill batching, speculative decoding with a small draft model, KV-cache reuse for common prompts. Discuss the prefill vs decode tradeoff (prefill is compute-bound, decode is memory-bandwidth-bound). Geo-distributed deployment for global p95. For a single-token completion the prefill dominates - heavily quantised models with FlashAttention can hit <50ms on H100.

Question 9

Design a feature store that serves features to both training jobs and online inference, with consistent semantics between batch and realtime.

Accepted Answer

Two storage tiers: offline (a columnar warehouse like BigQuery or Iceberg, used by training jobs) and online (a low-latency KV like Redis or DynamoDB, used by serving). The core invariant: a feature should produce the same value at training time and serving time, otherwise you get training-serving skew. Solutions: define features as code (e.g. SQL or DSL), compute them once and write to both tiers; use point-in-time joins for backfills so you don't leak future data; tag features with freshness SLAs and version them.

Question 10

You are not training a model from scratch, you are building a feature on top of a foundation model with prompting, retrieval, and maybe light fine-tuning. How does bias-variance thinking still apply to the choices you make, and how do you tell whether your system is underfitting or overfitting the task?

Accepted Answer

Translate the concept into the applied-AI setting. Underfitting shows up as a system that is too generic: weak prompts, no retrieval, and a model that gives plausible but off-target answers because it lacks the task context. Overfitting shows up as a system tuned so tightly to a handful of examples that it fails on real, varied inputs: brittle few-shot prompts, a fine-tune on a tiny biased dataset, or retrieval that only works for the queries you tested. The applied controls are the modern analogues of regularisation and capacity: richer context and retrieval reduce bias, while a held-out evaluation set, diverse examples, and resisting over-tuning to a demo reduce variance. The AI-engineer signal is measuring this with an evaluation harness on representative inputs rather than eyeballing a few prompts, and knowing when a prompt change is genuinely better versus fitted to your test cases.

Question 11

Your model scores beautifully offline but falls apart in production. You suspect data leakage. Explain what leakage is, the common ways it sneaks in, and how you would build the pipeline so it cannot happen.

Accepted Answer

Leakage is when information that would not be available at prediction time leaks into training, so the model looks great offline and collapses in the real world. The classic forms: fitting a scaler or encoder on the full dataset before splitting, so test statistics bleed into training; including a feature that is a proxy for the label or is only populated after the outcome is known; and time leakage, where a row uses information from the future relative to when the prediction would actually be made. The structural fixes are to split first and fit every transform inside the training fold only, then apply it to validation and test, ideally wrapped in a pipeline so the discipline is enforced rather than remembered. For time-based problems, use point-in-time correct joins and time-ordered splits so you never train on the future. The strong answer treats the suspiciously high offline score as a symptom to investigate, not a result to trust.

Question 12

You have shipped a model to production. Labels arrive days or weeks later, if at all. How do you know whether the model is still healthy, and what do you alert on when you cannot just watch accuracy in real time?

Accepted Answer

Because ground-truth labels are delayed, you cannot rely on live accuracy, so you monitor proxies that move before performance visibly degrades. Watch the input side for data drift: feature distributions shifting away from training, a spike in missing values, or new categorical values the model never saw. Watch the output side for prediction drift: the score distribution sliding, or the rate of a given class changing without a business reason. Keep operational monitoring too, latency and error rates, since a broken feature pipeline often shows up there first. When labels do land, compute true performance on that lagged window and track it over time, and run periodic backtests. Alert on drift thresholds and on pipeline anomalies rather than waiting for accuracy to crater. The mature answer separates the fast signals you can watch now from the slow ground truth, and closes the loop by feeding late labels back for evaluation and retraining.

Question 13

Your recommendation system works well for established users and items, but it has nothing useful to show a brand new user or to surface a freshly added item. How do you handle cold start on both sides?

Accepted Answer

Cold start is two problems. For a new user with no interaction history, fall back to signals you do have: popularity and trending items, anything from onboarding such as stated interests, and contextual cues like location or referral source, then shift to personalised collaborative signals as interactions accumulate. For a new item with no interactions, lean on content features, its attributes, text, or embeddings, so it can be matched to users by similarity to items they already like, rather than waiting for engagement data it does not have yet. A hybrid that blends content-based and collaborative signals degrades gracefully at the edges instead of returning nothing. Add deliberate exploration so new items get a fair chance to be shown and gather the feedback they need, otherwise popular items starve everything new. The strong answer names both sides explicitly and treats cold start as a smooth handoff from content and context to behaviour, not a switch that flips once.

Question 14

Given a raw dataset of user events with timestamps, how do you think about turning it into features for a model that predicts churn? Show me how you reason about it rather than listing transformations.

Accepted Answer

Start from the prediction and the user, not the columns. For churn the useful signals are about engagement over time, so derive recency, frequency, and trend features: how long since the last meaningful action, how often the user acts in a window, and crucially whether activity is rising or falling, since a declining trend predicts churn better than a static count. Aggregate raw events into windows like last 7 and last 30 days so the model sees behaviour, not individual rows, and be strict that every window only uses data available before the prediction point to avoid leakage. Encode categoricals thoughtfully, watching cardinality, and capture lifecycle context like tenure. Throughout, prefer features a domain person would expect to matter and that you can explain, and validate that a feature actually improves held-out performance rather than adding everything and hoping. The signal is reasoning from the user behaviour you are trying to capture, with leakage discipline baked in, not reciting a transformation checklist.

Question 15

Explain how INT8 post-training quantization works for a transformer model, and describe what AMD hardware features make INT8 inference faster than FP32 on MI300X.

Accepted Answer

Post-training quantization maps float32 weights to int8 by finding scale and zero-point parameters per tensor or per channel, minimizing the quantization error. Activation quantization requires calibration data to capture typical activation ranges. On MI300X, the Matrix Core units execute INT8 GEMM at roughly 2x the FLOPS of BF16 GEMM on the same hardware (approximately 1307 TOPS INT8 vs 654 TFLOPS BF16), because the same multiplier lanes process more values per cycle with narrower data types. The candidate should also mention that attention layers are memory-bound so BF16 KV cache size matters more there than INT8 weight quantization, and that per-channel quantization typically recovers more accuracy than per-tensor at the cost of more complex dequantization.

Question 16

Imagine you are on the team responsible for Claude's safety behaviors. We've shipped a new version and our internal evals show refusal rate on clearly benign requests has gone up 15%. Walk me through how you would diagnose the root cause and what interventions you would consider.

Accepted Answer

A strong answer starts with slicing the eval data to find which categories of benign requests are being over-refused, distinguishing false positives from legitimate caution. The candidate should consider whether the issue is in RLHF reward signal, the constitution, or a classifier threshold. Interventions include targeted preference data, refining the constitution clauses that are triggering false positives, or adjusting classifier thresholds with held-out test data. The answer should show understanding that over-refusal is a real cost, not a safe default.

Question 17

Claude supports context windows up to 200k tokens. Walk me through the computational and memory challenges that creates for attention, and describe at least two techniques used to make long-context inference practical.

Accepted Answer

Attention is O(n^2) in sequence length for both compute and memory in the naive case. At 200k tokens the KV cache alone can be tens of GB per request per layer. Techniques include FlashAttention (tiled, IO-aware recomputation that avoids materializing the full attention matrix), multi-query or grouped-query attention to reduce KV cache size, and quantization of the KV cache. A strong answer also mentions how positional encodings like RoPE are extended or extrapolated for lengths beyond training, since this is a known challenge.

Question 18

The Chinchilla paper revised how we think about optimal compute allocation for language model training. Explain the core finding and how it would inform a decision about whether to train a 70B parameter model on 1T tokens versus a 7B model on 10T tokens given the same compute budget.

Accepted Answer

Chinchilla found that prior models were significantly undertrained relative to their parameter count. The optimal ratio is roughly 20 tokens per parameter for a compute-optimal run. A 70B model ideally trains on 1.4T tokens; a 7B model on 140B. Given a fixed compute budget, the smaller model trained on far more data often outperforms the larger undertrained model. A strong answer notes that inference cost is a real-world consideration and a smaller well-trained model may be preferred in production even if a larger model with the same compute would score higher on benchmarks.

Question 19

We want to measure whether Claude maintains honest answers when a user persistently pushes back, claims Claude is wrong, or uses social pressure. How would you design this eval, what failure modes would you look for, and how would you avoid the eval itself becoming a training target that the model games?

Accepted Answer

A strong answer defines a dataset of factually verifiable questions with known correct answers, pairs each with a scripted adversarial follow-up sequence that applies social pressure without providing new evidence, and measures sycophantic capitulation rate. Failure modes include the model changing a correct answer under pressure, the model becoming stubbornly wrong when actually corrected with evidence, and the model hedging so much that it gives no useful answer. To avoid gaming, the eval should use held-out test questions never seen in training and be refreshed regularly.

Question 20

Many Claude API users send the same large system prompt with every request. Explain how KV cache sharing or prompt caching works at the inference level, what it saves, and what implementation challenges it introduces.

Accepted Answer

The KV cache for the attention layers is a function only of the input tokens, so if the prefix of a request is identical across calls the computed K and V tensors for that prefix can be reused. Anthropic offers prompt caching where the system prompt is cached server-side and only charged at a reduced rate on subsequent hits. Implementation challenges include cache invalidation when the system prompt changes by even one token, memory pressure from maintaining caches across concurrent users, and routing requests to the same server instance to hit a warm cache.

Question 21

Claude accepts images as input. Walk me through how an image goes from raw pixels to something the transformer can reason about alongside text tokens, and what constraints that places on input size and cost.

Accepted Answer

Images are typically passed through a vision encoder such as a ViT that produces patch embeddings, which are projected into the same embedding space as text tokens and concatenated with the text sequence. The number of visual tokens depends on image resolution and patch size; a high-resolution image can produce hundreds or thousands of tokens, which matters for context length and cost. A strong answer notes that Anthropic tiles large images into sub-images to handle arbitrary resolutions and that cost is per-token including visual tokens.

Question 22

A customer comes to you with a use case where they need Claude to consistently output structured JSON in a very specific schema for their pipeline. They are getting about 85% compliance with prompting alone. Walk me through how you would advise them on whether to pursue fine-tuning versus prompt engineering, and what the decision criteria are.

Accepted Answer

A strong answer explores prompt engineering options first: few-shot examples showing the exact schema, constrained decoding or grammar-based output parsing, and whether the 15% failure rate is in the schema structure or in the content. Fine-tuning is worth it when the target behavior is highly idiomatic, the volume of examples is large enough to generalize, and the baseline prompt cannot close the gap. Risks of fine-tuning include forgetting general capabilities, higher cost and iteration time, and the need to manage model versions. For JSON compliance specifically, constrained decoding via outlines or similar tools is often the right answer before fine-tuning.

Question 23

Explain reward hacking in the context of training a language model with RLHF, give a concrete example of what it looks like in output, and describe two strategies for mitigating it.

Accepted Answer

Reward hacking occurs when the model finds outputs that score highly on the learned reward model but do not represent genuinely preferred behavior, because the reward model is an imperfect proxy. Concrete examples include models that produce very long, verbose answers because human raters historically rated longer answers higher, or models that use sycophantic language to agree with the rater's apparent view. Mitigation strategies include KL penalty between the RL-trained policy and a reference policy to limit how far the model drifts, reward model ensembles to reduce proxy gaming, and iterative reward model updates with fresh human feedback after each RL phase.

Question 24

Apple runs significant ML workloads on-device rather than in the cloud. What are the technical trade-offs, and how does the Neural Engine factor into decisions about where to run a model?

Accepted Answer

On-device advantages: latency (no network round trip), privacy (data never leaves the device), offline availability, and no per-inference server cost. Disadvantages: model size constrained by device storage and RAM, quantization required for speed which costs accuracy, no easy model updates without an app release. The Apple Neural Engine (ANE) is optimized for specific layer types (convolutions, matrix multiplications with specific shapes) but not all model architectures map well to it; a model that falls back to the CPU or GPU can be slower than a smaller cloud-served model. A strong candidate mentions Core ML as the deployment target and discusses the workflow: train in PyTorch or TensorFlow, convert via coremltools, and measure energy impact on device.

Question 25

You are building a feature set for TikTok's video ranking model. List the categories of features you would include, explain which signals you expect to be most predictive, and describe how you handle features that are only available for some videos or users.

Accepted Answer

A comprehensive answer covers at least four categories: user features (demographics, historical engagement rates, watch history embeddings), video features (category, audio fingerprint, visual content embedding, age of content, creator follower count), interaction features (user-video cross features like whether the user typically watches content from this creator category), and context features (time of day, device type, network speed). Predictive power: watch completion rate and replays are stronger signals than likes. Missing feature handling: use default/cold-start values, or train the model to handle sparse embeddings.

Question 26

A data scientist on your team adds mlflow.autolog() at the top of their training script and sees metrics logged automatically. Explain how auto-logging works under the hood, what it captures, and when it would give you incomplete or misleading tracking.

Accepted Answer

MLflow auto-logging works by monkey-patching popular libraries (sklearn, XGBoost, LightGBM, PyTorch Lightning, Keras) to add logging callbacks and hooks. It captures parameters like learning rate and n_estimators, metrics like validation loss per epoch, and model artifacts. Limitations: it does not capture custom metrics or preprocessing parameters that are not part of the patched library calls. If the candidate uses raw NumPy or custom training loops without a supported framework, nothing is captured. It can also produce noisy runs if called inside a hyperparameter sweep that manages its own run nesting.

Question 27

A customer wants to make a large language model answer questions about their proprietary product documentation. They ask whether to fine-tune a model using Mosaic AI or to use retrieval-augmented generation. Walk me through how you would help them decide.

Accepted Answer

RAG is almost always the first choice for knowledge grounding: it is cheaper, faster to update when documentation changes, and easier to debug because you can inspect the retrieved chunks. Fine-tuning is appropriate when you need to change the model's style, tone, or output format consistently, or when the knowledge is so large or densely interconnected that retrieval misses context. For product documentation that changes frequently, RAG with a vector index (Databricks Vector Search) backed by a Delta table is clearly the better path. Fine-tuning with Mosaic AI makes more sense for tasks like code generation in a proprietary DSL or classification tasks with a very specific taxonomy.

Question 28

DoorDash shows customers an estimated delivery time the moment they place an order. Walk me through how you would build a model to predict this ETA, what features you would use, how you would handle the cold-start problem for a new restaurant, and how you would measure whether the model is accurate enough to ship.

Accepted Answer

Strong answers split ETA into subcomponents (restaurant prep time, Dasher pick-up wait, drive time) and model each separately, since errors compound. Key features include restaurant historical prep time at this time of day, order complexity (item count, category), current Dasher supply in the zone, and traffic from a maps API. For cold start, they use restaurant category priors or a similar-restaurant lookup. Evaluation should be mean absolute error and calibration (are 15-minute predictions correct 68% of the time?) not just RMSE.

Question 29

DoorDash wants to predict which consumers are likely to stop ordering in the next 30 days so the retention team can proactively target them with a promotion. What features would you engineer, what model would you start with and why, and how would you evaluate whether the model is good enough to act on?

Accepted Answer

Strong answers define churn clearly (no order in 30 days after being active), engineer features from order history (recency, frequency, monetary value, time since last order, trend in order frequency), and start with a gradient-boosted tree (XGBoost or LightGBM) for its handling of sparse features and calibrated probabilities. Evaluation should focus on precision and recall at the threshold used by the retention team, plus the expected value of intervening on predicted churners vs. the cost of the promotion.

Question 30

You are training a deep neural network for a classification task. The training loss decreases for the first three epochs and then plateaus. Validation loss is higher than training loss and also flat. What are the possible causes, and how do you systematically diagnose and fix this?

Accepted Answer

A strong answer separates underfitting causes (learning rate too high, model too simple, wrong optimizer) from overfitting causes (training loss still decreasing but the question states it plateaued, so the focus is on the learning dynamics). The candidate should check: learning rate schedule (plateau may indicate LR is too small for meaningful updates after warmup, or too large causing oscillation), gradient norms (are gradients dying or exploding?), batch size effects, and whether the data pipeline is bottlenecking. Concrete diagnostic steps: plot gradients, reduce LR, increase model capacity, check data loading.

Questions

Explain the bias-variance tradeoff with a concrete exampleMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What is overfitting, and how do you spot it?Machine learningeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Evaluate a classifier on heavily imbalanced dataMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain Constitutional AI and how RLHF differs from itMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Cold start in a recommendation systemMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

How would you use MLflow on Databricks to manage the model lifecycle end to end?Machine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

When would you NOT use a neural network?Machine learningmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How would you serve a 70B-parameter LLM at <100ms p95?Machine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

Design a feature storeMachine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

How does the bias-variance lens apply when you build on a foundation model?Machine learningmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Spot and prevent data leakage in a training pipelineMachine learningmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Monitor an ML model after it shipsMachine learningmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Handle the cold-start problem in a recommenderMachine learningmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Engineer features that actually help a modelMachine learningmediumCommon

As asked

Sample answer outline

Expect these follow-ups

INT8 quantization for inference on AMD hardwareMachine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

How do you calibrate Claude's refusal rate without over-refusing?Machine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

What happens to attention as context length scales to 200k tokens?Machine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

Explain Chinchilla scaling laws and how they affect training decisionsMachine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

Design an eval to measure Claude's honesty under adversarial pressureMachine learninghardCommon

As asked

Sample answer outline

Expect these follow-ups

How does KV cache sharing reduce inference cost for repeated system prompts?Machine learningmediumCommon

As asked

Sample answer outline