Question 1

A researcher says yesterday's experiment improved benchmark accuracy by 3 percent, but nobody can reproduce it. What do you check first?

Accepted Answer

Start with the experiment record: exact code commit, model config, data snapshot, preprocessing version, random seeds, hardware, and library versions. Then check whether the reported metric came from the intended split and whether the evaluation script changed between runs. Strong answers discuss determinism limits on GPUs while still insisting on reproducible artefacts and comparable results. The candidate should propose a minimal rerun and then a full rerun under tracked configuration. A common failure is focusing only on the seed while ignoring data leakage or an untracked preprocessing change.

Question 2

A fraud model performs well offline but starts making poor decisions in production. How would you investigate training-serving skew?

Accepted Answer

Compare feature values used during offline training with the values produced by the online serving path for the same entity and timestamp. Look for point-in-time leakage, stale online features, different default values, mismatched transformations, and schema evolution that did not reach both paths. Strong answers include replaying production events through the offline pipeline and logging model inputs at serving time with privacy controls. The candidate should separate skew from drift, because the first is a pipeline consistency bug and the second is a world-change problem. A common miss is checking only model metrics and not the raw feature distributions.

Question 3

Design a shared GPU cluster for research training jobs, batch fine-tunes, and low-latency inference. How do you schedule work fairly and efficiently?

Accepted Answer

Separate workload classes because inference and long training jobs have different latency and preemption tolerance. Use quotas, priority classes, gang scheduling for distributed training, and preemptible queues for exploratory work. Strong answers discuss GPU fragmentation, topology awareness, checkpointing, fair-share policies, and visibility into queue time by team. Inference should usually live in reserved pools with autoscaling and admission control rather than competing directly with training. The common failure is maximising utilisation while making urgent production workloads wait behind low-priority experiments.

Question 4

Standard self-attention has O(n^2) time and memory complexity in sequence length. Walk me through exactly where that quadratic term comes from, and describe at least two concrete algorithmic approaches that reduce it.

Accepted Answer

The quadratic cost comes from computing all n^2 query-key dot products to form the attention matrix, then multiplying by values. FlashAttention avoids materializing the full n x n matrix by tiling the computation in SRAM and fusing the softmax, achieving O(n) memory while keeping exact attention. Sparse attention methods (Longformer sliding window, BigBird random plus global patterns) reduce compute by restricting which token pairs attend to each other, trading exact for approximate attention.

Question 5

Explain what the key-value cache is during autoregressive decoding. Given a model with 32 layers, 32 attention heads, head dimension 128, and sequence length 4096, how much GPU memory does the KV cache consume per request in float16?

Accepted Answer

The KV cache stores the key and value projections for all past tokens so they do not need to be recomputed at each new generation step. Memory is 2 (K and V) times layers times heads times head_dim times sequence_length times bytes_per_element: 2 * 32 * 32 * 128 * 4096 * 2 bytes = 2 GB per request in float16. At scale this becomes the primary bottleneck, motivating techniques like grouped-query attention (GQA) which reduces the number of KV heads, and PagedAttention which manages the cache in non-contiguous blocks.

Question 6

You are training a large transformer and run out of GPU memory. A colleague suggests gradient checkpointing. Explain what it does, what the exact tradeoff is, and where you would place checkpoints in the network to get the best tradeoff.

Accepted Answer

Gradient checkpointing discards intermediate activations during the forward pass instead of storing them, then recomputes them during backpropagation when they are needed for gradients. The cost is one extra forward pass per checkpointed segment, so runtime increases by roughly 30-35% while memory drops significantly. The optimal placement recomputes about every sqrt(n) layers, which gives O(sqrt(n)) memory instead of O(n), balancing memory savings against recompute time. In practice checkpointing is placed at transformer block boundaries because each block's input is a clean recompute boundary.

Browse by topic

Top ai research engineer interview questions

Make a training run reproducibleMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Investigate training-serving skewMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design scheduling for a shared GPU clusterSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Why does self-attention scale quadratically and what fixes it?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

KV cache: what it stores, why it grows, and how to manage itRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Gradient checkpointing: the compute-memory tradeoffRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools

Browse by topic

Top ai research engineer interview questions

Make a training run reproducibleMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Investigate training-serving skewMachine learningmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design scheduling for a shared GPU clusterSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Why does self-attention scale quadratically and what fixes it?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

KV cache: what it stores, why it grows, and how to manage itRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Gradient checkpointing: the compute-memory tradeoffRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools