Question 1

Explain the math behind DDPM. What is the forward process doing to the data, what distribution does it converge to, and how does the reverse process reconstruct samples? Where does the neural network actually sit in this picture?

Accepted Answer

The forward process is a fixed Markov chain that gradually adds Gaussian noise over T steps following a variance schedule, driving the data distribution toward a standard normal. The reverse process learns to denoise step by step, and the network is trained to predict the noise (or the score) added at each step. A strong answer mentions the reparameterization that lets you jump to any noise level in one shot, which is what the L_simple loss exploits.

Question 2

How does classifier-free guidance work in diffusion models, and why does turning up the guidance scale improve prompt alignment but hurt sample diversity? What is actually happening to the score estimate?

Accepted Answer

CFG trains a single model for both conditional and unconditional generation by randomly dropping the conditioning signal. At inference, the score is extrapolated beyond the conditional direction: score = unconditional + w * (conditional - unconditional). Higher w sharpens the conditional direction, which reduces the effective entropy of the output distribution and collapses toward the mode, trading diversity for fidelity.

Question 3

Explain how LoRA fine-tuning works at the weight level. Why is it valid to represent the weight update as a product of two low-rank matrices, and how do rank and alpha hyperparameters interact in practice?

Accepted Answer

LoRA freezes the pretrained weights and adds a bypass W = W0 + BA where B is d x r and A is r x d, with r much smaller than d. The hypothesis is that the intrinsic dimensionality of the fine-tuning task is low, so a rank-r update captures most of the signal. Alpha scales the contribution: effective_lr scales as alpha/r, so increasing alpha without changing rank amplifies the LoRA contribution. A strong answer notes that merging BA into W0 at inference eliminates latency overhead.

Question 4

Describe the KV cache in transformer generation. How does memory consumption scale with sequence length and batch size, and what practical problems does this cause when you try to run long-context inference at scale?

Accepted Answer

During autoregressive decoding, keys and values for all previous tokens are cached to avoid recomputing them. Memory grows as O(layers x heads x seq_len x head_dim x batch_size), so a 4096-token sequence with a large batch can consume tens of gigabytes just for the cache. This forces tradeoffs between batch size and max sequence length, and motivates techniques like paged attention (vLLM), sliding window attention, and multi-query attention.

Question 5

DDIM is often described as deterministic. What does that actually mean at the math level, how does DDIM generalize DDPM, and what is the practical consequence of reducing steps from 1000 to 20 or 50?

Accepted Answer

DDIM defines a non-Markovian reverse process whose marginals match the DDPM forward process, but removes the stochastic noise injection at each step. This makes the mapping from noise to image deterministic given a fixed seed, enabling meaningful interpolations in latent space. At low step counts DDIM trades quality smoothly, whereas DDPM degrades faster because each of its fewer steps must inject random noise. The tradeoff is that DDIM at very low steps (under 10) tends to produce slightly over-smooth results.

Question 6

How does Rotary Position Embedding encode position, and why does it handle relative positions implicitly? When you try to extend a model's context window beyond its training length using RoPE scaling, what breaks and how do methods like YaRN fix it?

Accepted Answer

RoPE rotates query and key vectors in 2D subspaces by angles proportional to position, so the dot product QK depends only on the relative angle, encoding relative distance. When context exceeds training length, the model sees rotation angles it never encountered, causing attention to degrade. YaRN rescales different frequency dimensions differently (NTK-aware scaling), spreading the extrapolation burden more evenly across the frequency spectrum to recover quality at extended lengths.

Question 7

Standard scaled dot-product attention is memory-bound on modern GPUs. Explain how Flash Attention restructures the computation to reduce HBM reads and writes, and why it enables training longer sequences without running out of memory.

Accepted Answer

Standard attention materializes the full N x N attention matrix in HBM, costing O(N^2) memory reads and writes. Flash Attention tiles Q, K, V blocks, keeping them in SRAM and computing softmax incrementally using the online normalization trick, so the N x N matrix is never written to HBM. Memory for activations drops from O(N^2) to O(N), enabling 5 to 20x longer sequences for the same GPU memory, at the cost of a slightly more complex backward pass using recomputation.

Question 8

Stable Diffusion runs diffusion in a compressed latent space rather than pixel space. How does the VAE compression ratio affect generation quality, and what kinds of artifacts does a weak decoder introduce that the diffusion model cannot compensate for?

Accepted Answer

The VAE encodes images to a spatial latent with typically 8x downsampling and 4 to 16 channels. At high compression ratios, the decoder must hallucinate fine details the latent discards, introducing blurriness or tiling artifacts in textures and text. The diffusion model only controls structure in latent space, so VAE reconstruction quality sets a hard ceiling on output sharpness. A strong answer notes that SDXL improved the VAE decoder specifically to address color saturation and fine-detail issues.

Question 9

In RLHF pipelines, models often learn to exploit weaknesses in the reward model rather than truly improving. How does reward hacking manifest in practice for a text or image generation model, and what techniques do teams use to prevent or detect it?

Accepted Answer

Reward hacking occurs when the policy finds out-of-distribution outputs that score high on the proxy reward model but are low quality by human judgment. For text, this looks like verbose, repetitive outputs that game length-correlated rewards. Mitigation includes KL penalties from the reference policy (PPO's KL term), periodic refreshing of the reward model, using an ensemble of reward models, and interleaving human evaluations during RL. A strong answer names the KL-controlled objective: r(x) - beta * KL(policy || ref).

Question 10

You are considering whether to use DPO or PPO to align a generative model with human preferences. Walk me through the core difference in how each algorithm uses preference data, and what practical stability issues you have seen or would expect from each.

Accepted Answer

PPO requires a separate trained reward model and runs online RL, which is notoriously sensitive to learning rate, KL coefficient, and the quality of rollouts. DPO reformulates the RL objective as a supervised contrastive loss directly on preference pairs, removing the reward model and the RL training loop. DPO is simpler and more stable but is offline and susceptible to distributional shift if the preference data does not cover the model's generation distribution. A strong answer notes that DPO can degrade on long-tail prompts where reference model outputs are poor.

Question 11

Explain how gradient checkpointing reduces memory during training of large transformer or UNet models. What is the exact compute cost you are paying, and when is it not worth using?

Accepted Answer

Checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass as needed. This reduces activation memory from O(layers) to O(sqrt(layers)) with optimal checkpoint placement, at the cost of roughly one additional forward pass worth of compute, so throughput drops around 20 to 30 percent. It is not worth using when the bottleneck is GPU utilization rather than memory, or when you already fit comfortably with a large batch and do not need to increase it further.

Question 12

The UNet architecture is central to many diffusion models. What role do the encoder-to-decoder skip connections play specifically in the denoising task, and what happens empirically when you remove or weaken them?

Accepted Answer

Skip connections pass high-frequency spatial information from the encoder to the corresponding decoder resolution, allowing the model to recover fine structure that the bottleneck compresses away. For denoising, the model needs to reconstruct exact edges and textures, so these shortcuts are critical. Without them, outputs are blurry because the decoder has only the semantic bottleneck to work from. A strong answer may note that transformer-based diffusion models (DiT) abandon the UNet in favor of full self-attention and find that the patch sequence contains enough spatial information.

Question 13

FP8 training is becoming common for large generative models. What numerical stability problems arise compared to BF16, how do you handle them, and what parts of the model do people typically keep at higher precision?

Accepted Answer

FP8 has a much smaller dynamic range than BF16, so underflow and overflow are frequent. Mixed FP8 training uses per-tensor or per-block scaling factors and typically keeps the master weights in BF16 or FP32 for optimizer state accumulation. The softmax, layer norm, and loss computation are kept in higher precision because they are sensitive to rounding errors. A strong answer mentions that E4M3 (more mantissa bits) is preferred for activations and E5M2 (more range) for gradients.

Question 14

Byte-pair encoding tokenizers have known failure modes for generative models. Describe at least three situations where BPE tokenization causes problems at generation time and how they manifest in output quality.

Accepted Answer

BPE struggles with rare or out-of-vocabulary strings: long numbers tokenize unpredictably leading to arithmetic errors; code in unusual languages or with unusual formatting fragments into many tokens increasing sequence length and degrading coherence; non-Latin scripts often have bloated token counts relative to English which creates unfair context budget usage. A strong answer also notes that leading whitespace handling varies by tokenizer and can cause subtle prefix-matching issues in constrained decoding.

Question 15

Explain the speculative decoding algorithm. How does the verification step guarantee that outputs are statistically identical to what the target model would have produced without speculative decoding?

Accepted Answer

A small draft model generates k tokens speculatively. The target model evaluates all k tokens in one forward pass (since it is not autoregressive over the draft). Accepted tokens pass a rejection sampling check: if p_target(token) >= p_draft(token) it is accepted, otherwise accepted with probability p_target / p_draft. The final token is resampled from a corrected distribution. This provably preserves the target model's distribution while amortizing the per-token cost of the large model over k tokens at once.

Question 16

In Stable Diffusion, how is the CLIP text embedding injected into the UNet, and why is cross-attention preferred over concatenation or simple addition for this conditioning?

Accepted Answer

The CLIP text encoder produces a sequence of token embeddings. These are projected and used as keys and values in cross-attention layers within the UNet, while the spatial features from the noisy image serve as queries. Cross-attention lets each spatial location attend selectively to whichever text tokens are most relevant, which is far more expressive than broadcasting a single vector or concatenating a fixed text feature to every spatial position. Simple addition would lose the spatial selectivity that enables grounded text-image correspondence.

Question 17

A researcher wants to fine-tune a diffusion model on 10 reference images of a specific subject. Compare DreamBooth and textual inversion at the gradient level: what parameters change, what is preserved, and where does each approach fail?

Accepted Answer

Textual inversion learns only a new embedding vector in the text encoder's token space, keeping all model weights frozen, so the model's priors are fully preserved but expression is limited to what can be encoded as a token. DreamBooth fine-tunes the full UNet (and optionally the text encoder) on the subject images plus a class-preservation prior loss, allowing much richer subject fidelity but risking overfitting and language drift. DreamBooth fails mainly when the subject is too similar to training data concepts or when the prior loss weight is poorly tuned.

Question 18

FID is the most commonly reported metric for image generative models, but it has well-known failure modes. Describe at least three situations where a model with a lower FID is actually worse by human judgment, and what alternative metrics address those cases.

Accepted Answer

FID measures the distance between Inception-v3 feature distributions of real and generated images, so it is insensitive to individual sample quality, mode coverage versus precision, and domain shift of the Inception features from the evaluation domain (e.g., medical images or satellite imagery). A model that memorizes training images gets a near-zero FID but zero diversity. CLIP-FID, Precision and Recall metrics (Kynkaanniemi), and human preference scores (CLIP-aesthetics, PickScore) address complementary failure modes.

Question 19

During distributed training of a large generative model, your all-reduce step is becoming the bottleneck. Walk me through how NCCL implements ring all-reduce, what network topology assumptions it makes, and what you would do if you were training across multiple nodes with high inter-node latency.

Accepted Answer

Ring all-reduce splits each tensor into N chunks (N GPUs), passing reduce-scatter around the ring then all-gather, achieving near-optimal bandwidth of 2(N-1)/N. It assumes a ring topology where every peer-to-peer bandwidth is similar. With multiple nodes, intra-node NVLink is fast but inter-node InfiniBand or Ethernet is a bottleneck. Solutions include hierarchical all-reduce (reduce locally first, then across nodes), using FSDP's reduce-scatter to overlap with backward computation, or gradient compression.

Question 20

You are deploying a text generation model and need to choose between optimizing for throughput and optimizing for latency. Explain how batch size, continuous batching, and request scheduling affect both dimensions, and how you would decide which to optimize for a given use case.

Accepted Answer

Large static batches amortize model weight loading over many requests but increase time-to-first-token latency and require all requests in a batch to finish before new ones are admitted. Continuous batching (iteration-level scheduling in vLLM or TGI) adds new requests as others complete, improving GPU utilization without forcing a fixed batch boundary. For latency-sensitive use cases (interactive chat), you prioritize small batches and streaming. For throughput-optimized uses (offline batch generation), large batches and higher per-GPU utilization matter more.

Question 21

Near-duplicate documents in pretraining datasets degrade generalization and increase memorization risk. What deduplication strategies do you know, and how do you decide between exact deduplication, MinHash LSH, and embedding-based approaches at the scale of hundreds of billions of tokens?

Accepted Answer

Exact deduplication on URL or MD5 hash is fast but misses paraphrased or reformatted copies. MinHash LSH approximates Jaccard similarity over shingled n-grams and scales to trillion-token corpora with tunable band/row parameters controlling recall vs precision. Embedding-based near-dedup catches semantic duplicates but is expensive (requires GPU inference) and often overkill for raw web text. At petabyte scale, MinHash is the standard choice, with suffix array methods used for substring dedup as in the Pile or C4 processing pipelines.

Question 22

Explain exactly what temperature, top-p nucleus sampling, and top-k sampling are doing to the logit distribution before sampling. If you apply all three, in what order should they be applied, and what pathological outputs does each setting prevent?

Accepted Answer

Temperature scales logits before softmax: T<1 sharpens the distribution (less diversity), T>1 flattens it. Top-k truncates the vocabulary to the k most probable tokens before renormalizing. Top-p (nucleus) takes the smallest set of tokens whose cumulative probability exceeds p. The canonical order is: scale by temperature, then apply top-k, then apply top-p, then sample. Top-k prevents sampling from the very long tail at all times; top-p adapts to distribution sharpness, keeping more candidates when the model is uncertain.

Question 23

Mixture-of-experts models can suffer from routing collapse where a few experts handle all tokens. What causes this, how do you detect it during training, and what loss terms or architectural choices prevent it?

Accepted Answer

Routing collapse happens because the router learns early that certain experts reduce loss faster and over-routes to them, starving the rest. Detection: monitor expert load distribution (tokens per expert) across training steps, looking for entropy collapse. The auxiliary load-balancing loss (used in Switch Transformer and Mixtral) penalizes unequal routing by encouraging uniform expert utilization. Expert capacity limits (dropping tokens above a per-expert cap) add a hard constraint. A strong answer mentions that Top-2 routing with the load balancing loss is the current standard.

Question 24

Describe one concrete watermarking scheme for detecting LLM-generated text. How does the watermark get embedded without degrading generation quality, and what are its failure modes against a determined attacker?

Accepted Answer

The Kirchenbauer et al. red-green list watermark partitions the vocabulary into red (suppressed) and green (boosted) tokens at each position using a hash of the preceding context as a seed. During generation, logits for green tokens are slightly boosted, so generated text has a detectable statistical excess of green tokens. Detection runs z-test on green token count. Failure modes: paraphrasing attacks disrupt the hash sequence, translation removes it entirely, and short outputs have insufficient power to detect the signal. A strong answer notes that an attacker with the watermark algorithm can adapt.

Question 25

Your training run on a 80GB A100 crashes with CUDA out-of-memory in the middle of epoch 2. Walk me through your debugging process: what tools do you use, what information do you collect, and what are the most common culprits you look for first?

Accepted Answer

First check torch.cuda.memory_summary() just before the OOM to see reserved vs allocated breakdown. Use torch.profiler or nsight systems to identify which operation triggers the spike. Common culprits: activation accumulation when gradient checkpointing is off, growing evaluation buffers not cleared between batches, optimizer states not on CPU when using ZeRO, and large KV caches in generation steps during eval. A strong answer also mentions that memory can fragment over time even if peak usage is not obviously exceeded, making torch.cuda.empty_cache() a useful diagnostic step.

Question 26

ControlNet fine-tunes diffusion models on spatial conditioning signals like depth maps or edges. What is the purpose of the zero-convolution layers it introduces, and why would training fail or damage the pretrained model without them?

Accepted Answer

ControlNet copies the UNet encoder into a trainable branch and connects it back to the decoder via zero-initialized 1x1 convolutions. At the start of training, zero weights mean the conditioning branch contributes nothing, so the base model's outputs and gradients are preserved exactly. As training proceeds, the zeros slowly learn to inject the conditioning signal. Without this, random initial weights from the conditioning branch would corrupt the pretrained decoder immediately, destroying the quality the base model learned.

Question 27

You are training a large generative model and see a sudden loss spike at step 12,000 that does not recover on its own. What are the most likely causes, and what is your step-by-step process for diagnosing and recovering the run?

Accepted Answer

Common causes: corrupted or anomalous data batch (check gradient norm at the spike step, look for NaN or inf), too-high learning rate hitting an unstable region, gradient norm explosion bypassing the clip threshold, or a hardware flop error (NaN from a failing GPU memory cell). Process: roll back to the last checkpoint before the spike, verify the data batch that caused it by replaying it, check gradient norms leading up to the spike for pre-spike buildup, and consider lowering the learning rate or adding warmup. A strong answer mentions that loss spikes in early Chinchilla-style training often come from web-crawl data outliers.

Question 28

You need to deploy a 70B parameter model on a machine with 48GB of GPU memory using llama.cpp or similar. Walk me through the available quantization formats (Q4_K_M, Q5_K_S, Q8_0, etc.), what the K-quant grouping is doing, and how you decide which to use.

Accepted Answer

GGUF quantization assigns bit widths per weight, with Q4/Q5/Q8 indicating bits per weight. K-quants use mixed precision within a group: Q4_K_M uses 4-bit for most weights but 6-bit for attention and output matrices that are more sensitive. Q8_0 is near-lossless but doubles memory vs Q4. For 70B at 48GB, Q4_K_M fits (roughly 40GB) with acceptable perplexity degradation of around 0.1 to 0.3 points. Q5_K_M fits if you have 50GB and care about quality. Q8 does not fit. A strong answer mentions that importance-based quantization methods like GPTQ or AWQ are alternatives with better calibrated accuracy.

Question 29

How does ALiBi add position information to a transformer, and why does it generalize better to sequences longer than those seen during training compared to learned absolute positional embeddings?

Accepted Answer

ALiBi adds a fixed negative linear bias to the pre-softmax attention scores proportional to the distance between query and key positions, with different slopes per head. This biases the model to prefer nearby tokens without encoding absolute position. At inference with longer sequences, the bias continues to grow linearly, which the model can extrapolate because it has already seen all the relative distance values within its training window. Learned absolute embeddings cannot extrapolate because unseen position indices have no learned representation.

Question 30

The original DDPM used a linear beta schedule. Why did Improved DDPM switch to a cosine schedule, and what artifact does the linear schedule produce at the final timesteps that cosine fixes?

Accepted Answer

The linear schedule increases beta (noise variance) at a constant rate from a small value to 0.02 across T steps. This causes the data signal to be almost entirely destroyed well before step T, so the high-timestep steps of the schedule carry nearly no useful gradient signal for the model. The cosine schedule defines alpha_bar(t) = cos((t/T + s)/(1+s) * pi/2)^2, which decays more slowly at the start and more steeply toward T, keeping meaningful signal across a wider range of timesteps and making the loss more informative throughout training. In practice this shows up as improved sample quality, especially when sampling with fewer than 100 steps.

Question 31

Your monitoring shows that the hallucination rate of your deployed instruction-tuned model has increased significantly over the past week, but no model code was changed. What do you do?

Accepted Answer

First hypothesis: the input distribution changed (new user segments, new prompt patterns from a product feature launch). Compare recent prompt distributions against the baseline period. Second: check if serving infrastructure changed (batching, quantization, sampling parameters). Third: check if a data preprocessing or tokenization bug was introduced. Fourth: evaluate on a fixed benchmark to separate real quality regression from distribution shift. A strong answer resists immediately retraining and focuses on diagnosis first.

Question 32

You were planning a 256-GPU training run but the cluster allocation was cut in half. You still have the same deadline. What do you do?

Accepted Answer

A strong answer makes concrete tradeoffs: reduce model size proportionally, train for fewer steps and rely on better data quality instead, apply more aggressive quantization to reduce memory and increase per-GPU throughput, or cut non-essential ablation experiments and focus the compute on the single most important run. They should also check whether training longer on fewer GPUs hits the deadline or whether parallel experiments need to be serialized. Finally, communicate the adjusted scope proactively.

Question 33

During final evaluation before shipping a new fine-tuned model, the safety eval finds that it produces harmful content for 0.5% of a specific adversarial prompt category that the previous model handled safely. The release is scheduled for tomorrow. What do you do?

Accepted Answer

Do not ship. 0.5% of adversarial prompts is a real risk depending on scale. Immediate steps: notify the team lead and safety reviewers, characterize the failure mode (is it a specific topic, prompt structure, or random?), check whether the fine-tuning data introduced it or if the base model was already vulnerable. Mitigation options while retraining: apply a prompt-level safety filter for the affected category, or roll back to the previous model. Retraining should add the failing prompts with correct outputs to the RLHF or SFT data. Delay the release and communicate the reason clearly.

Question 34

Your automated benchmarks show a new model is 5% better on your primary task metric, but the LLM-as-judge evaluation says it is 3% worse on helpfulness, and a small human eval (50 raters) is inconclusive. You need to make a go/no-go decision by end of day. How do you proceed?

Accepted Answer

A strong answer diagnoses why the signals conflict before deciding: check whether the primary benchmark correlates with the helpfulness dimension historically, examine specific examples where the LLM judge rated the new model lower to understand the failure mode, and assess whether the human eval sample is too small to be conclusive (50 raters for a 3% difference likely is). If the new model is better on the primary task but worse on helpfulness, a partial deployment (canary or A/B test) with real user feedback collection is the right call rather than a full rollout.

Question 35

A colleague suspects that your latest training dataset may contain benchmark evaluation data, which would inflate your model's reported scores. How do you investigate this claim, and what do you do if you confirm it?

Accepted Answer

Run exact and near-exact string matching between benchmark questions/answers and the training corpus using n-gram overlap or embedding similarity. Check specific hard examples: does the model get them right only when phrased exactly as in the benchmark but fail on paraphrases? If contamination is confirmed, exclude the contaminated subset, retrain or at minimum recompute scores on a clean benchmark subset, and report both the inflated and corrected numbers. A strong answer addresses the reporting obligation: inflated results that drove product decisions need to be corrected internally even if not published.

Question 36

After deploying a new model version, p95 inference latency increased by 40% even though the model parameter count and architecture are identical. Where do you look first?

Accepted Answer

Check in this order: the KV cache configuration (sequence length ceiling, block size, number of blocks), batch scheduling parameters (max batch size, max tokens per batch), whether the tokenizer changed and is producing longer token sequences for the same prompts, any change in sampling parameters (more tokens per response), and whether the serving library version changed (vLLM or TGI). Profile a representative request with nsight or torch.profiler to identify which operation is slower. The 40% increase for an identical architecture most likely points to configuration or batching changes rather than the model weights.

Question 37

A product team asks you to choose between self-hosting an open-source generative model and using a proprietary API for a new feature that processes sensitive user documents and requires occasional fine-tuning on domain data. Walk through the technical and operational tradeoffs specific to generative model deployment and give a concrete recommendation.

Accepted Answer

Key dimensions: data privacy (sensitive documents should not leave your infrastructure, which favors open-source hosting), latency requirements (self-hosted allows lower p99 with dedicated GPU; APIs have variable latency), cost at scale (API per-token cost vs fixed GPU infra cost, crossover depends on request volume), fine-tuning needs (open-source allows task-specific adaptation), and operational burden (hosting a model requires MLops investment). For sensitive documents the privacy argument usually tips toward open-source self-hosting with appropriate security controls.

Question 38

You are fine-tuning a 13B model with LoRA on 2x A100 80GB GPUs and hitting OOM errors during the backward pass. You cannot get more GPUs. What are your options in order of what you would try first?

Accepted Answer

Try in this order: (1) reduce per-GPU batch size and increase gradient accumulation steps to maintain effective batch size, (2) enable gradient checkpointing on the base model to reduce activation memory, (3) switch to QLoRA (quantize the base model to 4-bit NF4, keeping only LoRA adapters in full precision), (4) reduce LoRA rank, (5) freeze more layers of the base model and apply LoRA only to fewer modules. Each step trades compute or quality for memory, and they compose: QLoRA plus gradient checkpointing is a common combination for very large models.

Question 39

Explain the math behind DDPM. What is the forward process doing to the data, what distribution does it converge to, and how does the reverse process reconstruct samples? Where does the neural network actually sit in this picture?

Accepted Answer

The forward process is a fixed Markov chain that gradually adds Gaussian noise over T steps following a variance schedule, driving the data distribution toward a standard normal. The reverse process learns to denoise step by step, and the network is trained to predict the noise (or the score) added at each step. A strong answer mentions the reparameterization that lets you jump to any noise level in one shot, which is what the L_simple loss exploits.

Question 40

How does classifier-free guidance work in diffusion models, and why does turning up the guidance scale improve prompt alignment but hurt sample diversity? What is actually happening to the score estimate?

Accepted Answer

CFG trains a single model for both conditional and unconditional generation by randomly dropping the conditioning signal. At inference, the score is extrapolated beyond the conditional direction: score = unconditional + w * (conditional - unconditional). Higher w sharpens the conditional direction, which reduces the effective entropy of the output distribution and collapses toward the mode, trading diversity for fidelity.

Question 41

Explain how LoRA fine-tuning works at the weight level. Why is it valid to represent the weight update as a product of two low-rank matrices, and how do rank and alpha hyperparameters interact in practice?

Accepted Answer

LoRA freezes the pretrained weights and adds a bypass W = W0 + BA where B is d x r and A is r x d, with r much smaller than d. The hypothesis is that the intrinsic dimensionality of the fine-tuning task is low, so a rank-r update captures most of the signal. Alpha scales the contribution: effective_lr scales as alpha/r, so increasing alpha without changing rank amplifies the LoRA contribution. A strong answer notes that merging BA into W0 at inference eliminates latency overhead.

Question 42

Describe the KV cache in transformer generation. How does memory consumption scale with sequence length and batch size, and what practical problems does this cause when you try to run long-context inference at scale?

Accepted Answer

During autoregressive decoding, keys and values for all previous tokens are cached to avoid recomputing them. Memory grows as O(layers x heads x seq_len x head_dim x batch_size), so a 4096-token sequence with a large batch can consume tens of gigabytes just for the cache. This forces tradeoffs between batch size and max sequence length, and motivates techniques like paged attention (vLLM), sliding window attention, and multi-query attention.

Question 43

DDIM is often described as deterministic. What does that actually mean at the math level, how does DDIM generalize DDPM, and what is the practical consequence of reducing steps from 1000 to 20 or 50?

Accepted Answer

DDIM defines a non-Markovian reverse process whose marginals match the DDPM forward process, but removes the stochastic noise injection at each step. This makes the mapping from noise to image deterministic given a fixed seed, enabling meaningful interpolations in latent space. At low step counts DDIM trades quality smoothly, whereas DDPM degrades faster because each of its fewer steps must inject random noise. The tradeoff is that DDIM at very low steps (under 10) tends to produce slightly over-smooth results.

Question 44

How does Rotary Position Embedding encode position, and why does it handle relative positions implicitly? When you try to extend a model's context window beyond its training length using RoPE scaling, what breaks and how do methods like YaRN fix it?

Accepted Answer

RoPE rotates query and key vectors in 2D subspaces by angles proportional to position, so the dot product QK depends only on the relative angle, encoding relative distance. When context exceeds training length, the model sees rotation angles it never encountered, causing attention to degrade. YaRN rescales different frequency dimensions differently (NTK-aware scaling), spreading the extrapolation burden more evenly across the frequency spectrum to recover quality at extended lengths.

Question 45

Standard scaled dot-product attention is memory-bound on modern GPUs. Explain how Flash Attention restructures the computation to reduce HBM reads and writes, and why it enables training longer sequences without running out of memory.

Accepted Answer

Standard attention materializes the full N x N attention matrix in HBM, costing O(N^2) memory reads and writes. Flash Attention tiles Q, K, V blocks, keeping them in SRAM and computing softmax incrementally using the online normalization trick, so the N x N matrix is never written to HBM. Memory for activations drops from O(N^2) to O(N), enabling 5 to 20x longer sequences for the same GPU memory, at the cost of a slightly more complex backward pass using recomputation.

Question 46

Stable Diffusion runs diffusion in a compressed latent space rather than pixel space. How does the VAE compression ratio affect generation quality, and what kinds of artifacts does a weak decoder introduce that the diffusion model cannot compensate for?

Accepted Answer

The VAE encodes images to a spatial latent with typically 8x downsampling and 4 to 16 channels. At high compression ratios, the decoder must hallucinate fine details the latent discards, introducing blurriness or tiling artifacts in textures and text. The diffusion model only controls structure in latent space, so VAE reconstruction quality sets a hard ceiling on output sharpness. A strong answer notes that SDXL improved the VAE decoder specifically to address color saturation and fine-detail issues.

Question 47

In RLHF pipelines, models often learn to exploit weaknesses in the reward model rather than truly improving. How does reward hacking manifest in practice for a text or image generation model, and what techniques do teams use to prevent or detect it?

Accepted Answer

Reward hacking occurs when the policy finds out-of-distribution outputs that score high on the proxy reward model but are low quality by human judgment. For text, this looks like verbose, repetitive outputs that game length-correlated rewards. Mitigation includes KL penalties from the reference policy (PPO's KL term), periodic refreshing of the reward model, using an ensemble of reward models, and interleaving human evaluations during RL. A strong answer names the KL-controlled objective: r(x) - beta * KL(policy || ref).

Question 48

You are considering whether to use DPO or PPO to align a generative model with human preferences. Walk me through the core difference in how each algorithm uses preference data, and what practical stability issues you have seen or would expect from each.

Accepted Answer

PPO requires a separate trained reward model and runs online RL, which is notoriously sensitive to learning rate, KL coefficient, and the quality of rollouts. DPO reformulates the RL objective as a supervised contrastive loss directly on preference pairs, removing the reward model and the RL training loop. DPO is simpler and more stable but is offline and susceptible to distributional shift if the preference data does not cover the model's generation distribution. A strong answer notes that DPO can degrade on long-tail prompts where reference model outputs are poor.

Question 49

Explain how gradient checkpointing reduces memory during training of large transformer or UNet models. What is the exact compute cost you are paying, and when is it not worth using?

Accepted Answer

Checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass as needed. This reduces activation memory from O(layers) to O(sqrt(layers)) with optimal checkpoint placement, at the cost of roughly one additional forward pass worth of compute, so throughput drops around 20 to 30 percent. It is not worth using when the bottleneck is GPU utilization rather than memory, or when you already fit comfortably with a large batch and do not need to increase it further.

Question 50

The UNet architecture is central to many diffusion models. What role do the encoder-to-decoder skip connections play specifically in the denoising task, and what happens empirically when you remove or weaken them?

Accepted Answer

Skip connections pass high-frequency spatial information from the encoder to the corresponding decoder resolution, allowing the model to recover fine structure that the bottleneck compresses away. For denoising, the model needs to reconstruct exact edges and textures, so these shortcuts are critical. Without them, outputs are blurry because the decoder has only the semantic bottleneck to work from. A strong answer may note that transformer-based diffusion models (DiT) abandon the UNet in favor of full self-attention and find that the patch sequence contains enough spatial information.

Question 51

FP8 training is becoming common for large generative models. What numerical stability problems arise compared to BF16, how do you handle them, and what parts of the model do people typically keep at higher precision?

Accepted Answer

FP8 has a much smaller dynamic range than BF16, so underflow and overflow are frequent. Mixed FP8 training uses per-tensor or per-block scaling factors and typically keeps the master weights in BF16 or FP32 for optimizer state accumulation. The softmax, layer norm, and loss computation are kept in higher precision because they are sensitive to rounding errors. A strong answer mentions that E4M3 (more mantissa bits) is preferred for activations and E5M2 (more range) for gradients.

Question 52

Byte-pair encoding tokenizers have known failure modes for generative models. Describe at least three situations where BPE tokenization causes problems at generation time and how they manifest in output quality.

Accepted Answer

BPE struggles with rare or out-of-vocabulary strings: long numbers tokenize unpredictably leading to arithmetic errors; code in unusual languages or with unusual formatting fragments into many tokens increasing sequence length and degrading coherence; non-Latin scripts often have bloated token counts relative to English which creates unfair context budget usage. A strong answer also notes that leading whitespace handling varies by tokenizer and can cause subtle prefix-matching issues in constrained decoding.

Question 53

Explain the speculative decoding algorithm. How does the verification step guarantee that outputs are statistically identical to what the target model would have produced without speculative decoding?

Accepted Answer

A small draft model generates k tokens speculatively. The target model evaluates all k tokens in one forward pass (since it is not autoregressive over the draft). Accepted tokens pass a rejection sampling check: if p_target(token) >= p_draft(token) it is accepted, otherwise accepted with probability p_target / p_draft. The final token is resampled from a corrected distribution. This provably preserves the target model's distribution while amortizing the per-token cost of the large model over k tokens at once.

Question 54

In Stable Diffusion, how is the CLIP text embedding injected into the UNet, and why is cross-attention preferred over concatenation or simple addition for this conditioning?

Accepted Answer

The CLIP text encoder produces a sequence of token embeddings. These are projected and used as keys and values in cross-attention layers within the UNet, while the spatial features from the noisy image serve as queries. Cross-attention lets each spatial location attend selectively to whichever text tokens are most relevant, which is far more expressive than broadcasting a single vector or concatenating a fixed text feature to every spatial position. Simple addition would lose the spatial selectivity that enables grounded text-image correspondence.

Question 55

A researcher wants to fine-tune a diffusion model on 10 reference images of a specific subject. Compare DreamBooth and textual inversion at the gradient level: what parameters change, what is preserved, and where does each approach fail?

Accepted Answer

Textual inversion learns only a new embedding vector in the text encoder's token space, keeping all model weights frozen, so the model's priors are fully preserved but expression is limited to what can be encoded as a token. DreamBooth fine-tunes the full UNet (and optionally the text encoder) on the subject images plus a class-preservation prior loss, allowing much richer subject fidelity but risking overfitting and language drift. DreamBooth fails mainly when the subject is too similar to training data concepts or when the prior loss weight is poorly tuned.

Question 56

FID is the most commonly reported metric for image generative models, but it has well-known failure modes. Describe at least three situations where a model with a lower FID is actually worse by human judgment, and what alternative metrics address those cases.

Accepted Answer

FID measures the distance between Inception-v3 feature distributions of real and generated images, so it is insensitive to individual sample quality, mode coverage versus precision, and domain shift of the Inception features from the evaluation domain (e.g., medical images or satellite imagery). A model that memorizes training images gets a near-zero FID but zero diversity. CLIP-FID, Precision and Recall metrics (Kynkaanniemi), and human preference scores (CLIP-aesthetics, PickScore) address complementary failure modes.

Question 57

During distributed training of a large generative model, your all-reduce step is becoming the bottleneck. Walk me through how NCCL implements ring all-reduce, what network topology assumptions it makes, and what you would do if you were training across multiple nodes with high inter-node latency.

Accepted Answer

Ring all-reduce splits each tensor into N chunks (N GPUs), passing reduce-scatter around the ring then all-gather, achieving near-optimal bandwidth of 2(N-1)/N. It assumes a ring topology where every peer-to-peer bandwidth is similar. With multiple nodes, intra-node NVLink is fast but inter-node InfiniBand or Ethernet is a bottleneck. Solutions include hierarchical all-reduce (reduce locally first, then across nodes), using FSDP's reduce-scatter to overlap with backward computation, or gradient compression.

Question 58

You are deploying a text generation model and need to choose between optimizing for throughput and optimizing for latency. Explain how batch size, continuous batching, and request scheduling affect both dimensions, and how you would decide which to optimize for a given use case.

Accepted Answer

Large static batches amortize model weight loading over many requests but increase time-to-first-token latency and require all requests in a batch to finish before new ones are admitted. Continuous batching (iteration-level scheduling in vLLM or TGI) adds new requests as others complete, improving GPU utilization without forcing a fixed batch boundary. For latency-sensitive use cases (interactive chat), you prioritize small batches and streaming. For throughput-optimized uses (offline batch generation), large batches and higher per-GPU utilization matter more.

Question 59

Near-duplicate documents in pretraining datasets degrade generalization and increase memorization risk. What deduplication strategies do you know, and how do you decide between exact deduplication, MinHash LSH, and embedding-based approaches at the scale of hundreds of billions of tokens?

Accepted Answer

Exact deduplication on URL or MD5 hash is fast but misses paraphrased or reformatted copies. MinHash LSH approximates Jaccard similarity over shingled n-grams and scales to trillion-token corpora with tunable band/row parameters controlling recall vs precision. Embedding-based near-dedup catches semantic duplicates but is expensive (requires GPU inference) and often overkill for raw web text. At petabyte scale, MinHash is the standard choice, with suffix array methods used for substring dedup as in the Pile or C4 processing pipelines.

Question 60

Explain exactly what temperature, top-p nucleus sampling, and top-k sampling are doing to the logit distribution before sampling. If you apply all three, in what order should they be applied, and what pathological outputs does each setting prevent?

Accepted Answer

Temperature scales logits before softmax: T<1 sharpens the distribution (less diversity), T>1 flattens it. Top-k truncates the vocabulary to the k most probable tokens before renormalizing. Top-p (nucleus) takes the smallest set of tokens whose cumulative probability exceeds p. The canonical order is: scale by temperature, then apply top-k, then apply top-p, then sample. Top-k prevents sampling from the very long tail at all times; top-p adapts to distribution sharpness, keeping more candidates when the model is uncertain.

Question 61

Mixture-of-experts models can suffer from routing collapse where a few experts handle all tokens. What causes this, how do you detect it during training, and what loss terms or architectural choices prevent it?

Accepted Answer

Routing collapse happens because the router learns early that certain experts reduce loss faster and over-routes to them, starving the rest. Detection: monitor expert load distribution (tokens per expert) across training steps, looking for entropy collapse. The auxiliary load-balancing loss (used in Switch Transformer and Mixtral) penalizes unequal routing by encouraging uniform expert utilization. Expert capacity limits (dropping tokens above a per-expert cap) add a hard constraint. A strong answer mentions that Top-2 routing with the load balancing loss is the current standard.

Question 62

Describe one concrete watermarking scheme for detecting LLM-generated text. How does the watermark get embedded without degrading generation quality, and what are its failure modes against a determined attacker?

Accepted Answer

The Kirchenbauer et al. red-green list watermark partitions the vocabulary into red (suppressed) and green (boosted) tokens at each position using a hash of the preceding context as a seed. During generation, logits for green tokens are slightly boosted, so generated text has a detectable statistical excess of green tokens. Detection runs z-test on green token count. Failure modes: paraphrasing attacks disrupt the hash sequence, translation removes it entirely, and short outputs have insufficient power to detect the signal. A strong answer notes that an attacker with the watermark algorithm can adapt.

Question 63

Your training run on a 80GB A100 crashes with CUDA out-of-memory in the middle of epoch 2. Walk me through your debugging process: what tools do you use, what information do you collect, and what are the most common culprits you look for first?

Accepted Answer

First check torch.cuda.memory_summary() just before the OOM to see reserved vs allocated breakdown. Use torch.profiler or nsight systems to identify which operation triggers the spike. Common culprits: activation accumulation when gradient checkpointing is off, growing evaluation buffers not cleared between batches, optimizer states not on CPU when using ZeRO, and large KV caches in generation steps during eval. A strong answer also mentions that memory can fragment over time even if peak usage is not obviously exceeded, making torch.cuda.empty_cache() a useful diagnostic step.

Question 64

ControlNet fine-tunes diffusion models on spatial conditioning signals like depth maps or edges. What is the purpose of the zero-convolution layers it introduces, and why would training fail or damage the pretrained model without them?

Accepted Answer

ControlNet copies the UNet encoder into a trainable branch and connects it back to the decoder via zero-initialized 1x1 convolutions. At the start of training, zero weights mean the conditioning branch contributes nothing, so the base model's outputs and gradients are preserved exactly. As training proceeds, the zeros slowly learn to inject the conditioning signal. Without this, random initial weights from the conditioning branch would corrupt the pretrained decoder immediately, destroying the quality the base model learned.

Questions

Walk through the DDPM forward and reverse processesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Classifier-free guidance: mechanism and quality vs diversity tradeoffRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

LoRA: why low-rank decomposition works for fine-tuningRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

KV cache internals and memory growth during generationRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

DDIM sampling: determinism and step count tradeoffsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

RoPE positional encoding: why it works and context length extensionRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Flash Attention: IO complexity and memory footprint improvementRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

How the VAE bottleneck shapes latent diffusion qualityRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Reward model hacking in RLHF and how to detect itRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

DPO vs PPO: training stability and data requirementsRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Gradient checkpointing: memory vs compute tradeoffRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What the UNet skip connections are doing in a diffusion modelRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

FP8 training: stability challenges and loss scalingRole-specifichardOccasional

As asked

Sample answer outline

Expect these follow-ups

BPE tokenization failures and their impact on generationRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Speculative decoding: how the draft model accelerates generationRole-specifichardOccasional

As asked

Sample answer outline

Expect these follow-ups

How CLIP text conditioning is injected into a diffusion UNetRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

DreamBooth vs textual inversion: what each actually learnsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

FID score limitations for evaluating generative modelsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

NCCL collective operations and their latency in distributed trainingRole-specifichardOccasional

As asked

Sample answer outline

Expect these follow-ups

Batching strategy for maximizing throughput vs minimizing latencyRole-specificmediumCommon

As asked

Sample answer outline