Q: When training a large diffusion model or LLM, Adam (or AdamW) is almost always preferred over SGD. Explain what the first and second moment estimates are tracking, why the per-parameter adaptive step size matters for transformer and diffusion UNet weight matrices specifically, and what goes wrong if you use vanilla SGD on a billion-parameter generative model.

Adam maintains exponential moving averages of the gradient (first moment) and squared gradient (second moment). The effective step size per parameter is gradient / sqrt(second_moment), normalizing by an estimate of gradient variance. This gives each parameter its own effective learning rate, adapting to the local curvature, which is critical in transformers where different parameter matrices have very different gradient scales. SGD with momentum is sometimes preferred for convolutional models where gradient scales are more uniform, and when a carefully tuned LR schedule is available.

Q: Explain what a denoising diffusion model is learning in plain terms. What function is the neural network approximating, and why is the iterative denoising approach necessary rather than just training a single-step generator?

The network is learning to predict the noise added at each step, which is equivalent to learning the score function (gradient of the log probability) of the data distribution at each noise level. Iterative denoising is necessary because mapping directly from noise to data in one step is an extremely complex function that the network cannot learn reliably. Breaking it into small steps makes each step a simpler, more learnable denoising task, and allows the model to course-correct across steps.

Q: How does overfitting manifest differently when fine-tuning a large generative model compared to training a classifier? What regularization and monitoring techniques do you use to detect and prevent it?

In generative fine-tuning, overfitting shows up as loss divergence between train and validation, catastrophic forgetting of the base model's general capabilities, reduced diversity in outputs (mode collapse toward fine-tuning examples), and verbatim memorization of training examples. Monitoring: track validation loss and perplexity on a held-out set, and also run capability benchmarks from the base model to check for forgetting. Regularization: early stopping, lower learning rate, fewer epochs, LoRA (which limits capacity), weight decay, and dropout in the fine-tuning head.

Q: KL divergence appears in both VAE training and RLHF objectives. Explain what KL(P||Q) is measuring, why the direction matters, and what it is penalizing in each context.

KL(P||Q) measures the expected extra bits to encode samples from P using a code designed for Q. It is asymmetric: KL(P||Q) penalizes placing low mass on regions where P has high mass (mode-covering), while KL(Q||P) penalizes spreading mass where P has low mass (mode-seeking). In VAEs, KL(posterior||prior) regularizes the latent distribution toward the prior to enforce disentanglement and compactness. In RLHF, KL(policy||reference) penalizes the policy for drifting too far from the SFT model, preventing reward hacking.

Q: Summarize the Chinchilla scaling laws. What does the paper conclude about the optimal relationship between model parameters and training tokens, and what are the practical limits of applying these laws to real training runs?

Chinchilla found that prior large models (GPT-3, Gopher) were significantly undertrained for their size, and that optimal loss is achieved with roughly equal compute allocated to parameters and training tokens: for a model of N parameters, train on approximately 20N tokens. Practical limits: the laws predict loss on the pretraining distribution, not downstream task performance; inference compute cost is not in the objective, so inference-efficient smaller models trained longer (LLaMA style) may be practically superior; the law assumes a single training run, not staged or continued training.

Q: Modern large language models use SwiGLU activation in their feedforward layers rather than the GELU or ReLU used in earlier transformers. What is SwiGLU doing, and what empirical advantage does it have?

SwiGLU splits the feedforward hidden dim into two parallel projections: one goes through a Swish (SiLU) nonlinearity and the other is a linear gate, and they are multiplied together. This gating mechanism gives the model more expressive control over which features pass through. Empirically, SwiGLU achieves similar loss to GELU with fewer parameters, so models using it are often sized with 8/3 * model_dim hidden units instead of 4x to keep FLOPs equal. The gain is modest but consistent enough that it is the standard choice in PaLM, LLaMA, and Gemma.

Q: What is the difference between causal (autoregressive) language modeling and masked language modeling (BERT-style), and when would you choose one over the other for a generative application?

Causal LM predicts each token from left context only, training autoregressively and directly supporting text generation. Masked LM replaces a random subset of tokens and predicts them from bidirectional context, producing richer encoder representations but not directly supporting generation without additional decoding heads. For generation tasks (chat, image captioning, code completion) causal LM is standard. Masked LM encoders are better for discriminative tasks (classification, span extraction) where bidirectional context is available at inference time.

Q: Modern large language models have moved from LayerNorm to RMSNorm, and diffusion UNets use GroupNorm in the convolutional blocks. Explain why batch normalization is unsuitable for autoregressive generation, what LayerNorm provides instead, and why RMSNorm has become the default in generative transformer models despite dropping the mean-centering step.

Batch norm computes statistics across the batch dimension, which requires a sufficiently large batch to be stable and creates a dependency between training samples. At inference with a single sample (or with variable sequence lengths in a batch), the batch statistics are unstable or require running mean/variance tracking. Layer norm normalizes across the feature dimension independently per token, making statistics independent of batch size and sequence length. This is critical for autoregressive decoding where each step may have batch size one.

Question 1

Explain the core components of a transformer as used in modern generative models: multi-head self-attention, the feedforward sublayer, layer normalization, and residual connections. How do the specific design choices in production generative models (pre-norm rather than post-norm, RMSNorm over LayerNorm, SwiGLU feedforward, GQA over MHA) change the picture from the original architecture?

Accepted Answer

Attention computes context-dependent weighted averages across the sequence, allowing each position to integrate information from anywhere. Multi-head splits attention into parallel subspaces, learning different interaction types. The feedforward sublayer is a per-position MLP that transforms representations nonlinearly after attention has mixed token information. Layer norm stabilizes activations to prevent gradient issues during deep training. Residual connections allow gradients to flow directly through depth, enabling training of very deep models.

Question 2

Explain why cross-entropy is the standard training objective for autoregressive language models. What quantity is the model actually learning to approximate, and how does minimizing cross-entropy relate to maximizing likelihood?

Accepted Answer

Cross-entropy minimization is equivalent to maximizing the log likelihood of the training data under the model's learned conditional distribution P(x_t | x_{<t}). The model learns to approximate the true conditional distribution of the next token given context. Perplexity is exp(cross-entropy) and measures how surprised the model is by the data on average. A strong answer notes that minimizing cross-entropy on web-scale data corresponds to a form of compression, and that the objective does not explicitly penalize hallucination.

Question 3

When training a large diffusion model or LLM, Adam (or AdamW) is almost always preferred over SGD. Explain what the first and second moment estimates are tracking, why the per-parameter adaptive step size matters for transformer and diffusion UNet weight matrices specifically, and what goes wrong if you use vanilla SGD on a billion-parameter generative model.

Accepted Answer

Adam maintains exponential moving averages of the gradient (first moment) and squared gradient (second moment). The effective step size per parameter is gradient / sqrt(second_moment), normalizing by an estimate of gradient variance. This gives each parameter its own effective learning rate, adapting to the local curvature, which is critical in transformers where different parameter matrices have very different gradient scales. SGD with momentum is sometimes preferred for convolutional models where gradient scales are more uniform, and when a carefully tuned LR schedule is available.

Question 4

Explain what a denoising diffusion model is learning in plain terms. What function is the neural network approximating, and why is the iterative denoising approach necessary rather than just training a single-step generator?

Accepted Answer

The network is learning to predict the noise added at each step, which is equivalent to learning the score function (gradient of the log probability) of the data distribution at each noise level. Iterative denoising is necessary because mapping directly from noise to data in one step is an extremely complex function that the network cannot learn reliably. Breaking it into small steps makes each step a simpler, more learnable denoising task, and allows the model to course-correct across steps.

Question 5

How does overfitting manifest differently when fine-tuning a large generative model compared to training a classifier? What regularization and monitoring techniques do you use to detect and prevent it?

Accepted Answer

In generative fine-tuning, overfitting shows up as loss divergence between train and validation, catastrophic forgetting of the base model's general capabilities, reduced diversity in outputs (mode collapse toward fine-tuning examples), and verbatim memorization of training examples. Monitoring: track validation loss and perplexity on a held-out set, and also run capability benchmarks from the base model to check for forgetting. Regularization: early stopping, lower learning rate, fewer epochs, LoRA (which limits capacity), weight decay, and dropout in the fine-tuning head.

Question 6

KL divergence appears in both VAE training and RLHF objectives. Explain what KL(P||Q) is measuring, why the direction matters, and what it is penalizing in each context.

Accepted Answer

KL(P||Q) measures the expected extra bits to encode samples from P using a code designed for Q. It is asymmetric: KL(P||Q) penalizes placing low mass on regions where P has high mass (mode-covering), while KL(Q||P) penalizes spreading mass where P has low mass (mode-seeking). In VAEs, KL(posterior||prior) regularizes the latent distribution toward the prior to enforce disentanglement and compactness. In RLHF, KL(policy||reference) penalizes the policy for drifting too far from the SFT model, preventing reward hacking.

Question 7

Summarize the Chinchilla scaling laws. What does the paper conclude about the optimal relationship between model parameters and training tokens, and what are the practical limits of applying these laws to real training runs?

Accepted Answer

Chinchilla found that prior large models (GPT-3, Gopher) were significantly undertrained for their size, and that optimal loss is achieved with roughly equal compute allocated to parameters and training tokens: for a model of N parameters, train on approximately 20N tokens. Practical limits: the laws predict loss on the pretraining distribution, not downstream task performance; inference compute cost is not in the objective, so inference-efficient smaller models trained longer (LLaMA style) may be practically superior; the law assumes a single training run, not staged or continued training.

Question 8

Modern large language models use SwiGLU activation in their feedforward layers rather than the GELU or ReLU used in earlier transformers. What is SwiGLU doing, and what empirical advantage does it have?

Accepted Answer

SwiGLU splits the feedforward hidden dim into two parallel projections: one goes through a Swish (SiLU) nonlinearity and the other is a linear gate, and they are multiplied together. This gating mechanism gives the model more expressive control over which features pass through. Empirically, SwiGLU achieves similar loss to GELU with fewer parameters, so models using it are often sized with 8/3 * model_dim hidden units instead of 4x to keep FLOPs equal. The gain is modest but consistent enough that it is the standard choice in PaLM, LLaMA, and Gemma.

Question 9

What is the difference between causal (autoregressive) language modeling and masked language modeling (BERT-style), and when would you choose one over the other for a generative application?

Accepted Answer

Causal LM predicts each token from left context only, training autoregressively and directly supporting text generation. Masked LM replaces a random subset of tokens and predicts them from bidirectional context, producing richer encoder representations but not directly supporting generation without additional decoding heads. For generation tasks (chat, image captioning, code completion) causal LM is standard. Masked LM encoders are better for discriminative tasks (classification, span extraction) where bidirectional context is available at inference time.

Question 10

Modern large language models have moved from LayerNorm to RMSNorm, and diffusion UNets use GroupNorm in the convolutional blocks. Explain why batch normalization is unsuitable for autoregressive generation, what LayerNorm provides instead, and why RMSNorm has become the default in generative transformer models despite dropping the mean-centering step.

Accepted Answer

Batch norm computes statistics across the batch dimension, which requires a sufficiently large batch to be stable and creates a dependency between training samples. At inference with a single sample (or with variable sequence lengths in a batch), the batch statistics are unstable or require running mean/variance tracking. Layer norm normalizes across the feature dimension independently per token, making statistics independent of batch size and sequence length. This is critical for autoregressive decoding where each step may have batch size one.

Question 11

Explain the core components of a transformer as used in modern generative models: multi-head self-attention, the feedforward sublayer, layer normalization, and residual connections. How do the specific design choices in production generative models (pre-norm rather than post-norm, RMSNorm over LayerNorm, SwiGLU feedforward, GQA over MHA) change the picture from the original architecture?

Accepted Answer

Attention computes context-dependent weighted averages across the sequence, allowing each position to integrate information from anywhere. Multi-head splits attention into parallel subspaces, learning different interaction types. The feedforward sublayer is a per-position MLP that transforms representations nonlinearly after attention has mixed token information. Layer norm stabilizes activations to prevent gradient issues during deep training. Residual connections allow gradients to flow directly through depth, enabling training of very deep models.

Question 12

Explain why cross-entropy is the standard training objective for autoregressive language models. What quantity is the model actually learning to approximate, and how does minimizing cross-entropy relate to maximizing likelihood?

Accepted Answer

Cross-entropy minimization is equivalent to maximizing the log likelihood of the training data under the model's learned conditional distribution P(x_t | x_{<t}). The model learns to approximate the true conditional distribution of the next token given context. Perplexity is exp(cross-entropy) and measures how surprised the model is by the data on average. A strong answer notes that minimizing cross-entropy on web-scale data corresponds to a form of compression, and that the objective does not explicitly penalize hallucination.

Question 13

When training a large diffusion model or LLM, Adam (or AdamW) is almost always preferred over SGD. Explain what the first and second moment estimates are tracking, why the per-parameter adaptive step size matters for transformer and diffusion UNet weight matrices specifically, and what goes wrong if you use vanilla SGD on a billion-parameter generative model.

Accepted Answer

Adam maintains exponential moving averages of the gradient (first moment) and squared gradient (second moment). The effective step size per parameter is gradient / sqrt(second_moment), normalizing by an estimate of gradient variance. This gives each parameter its own effective learning rate, adapting to the local curvature, which is critical in transformers where different parameter matrices have very different gradient scales. SGD with momentum is sometimes preferred for convolutional models where gradient scales are more uniform, and when a carefully tuned LR schedule is available.

Question 14

Explain what a denoising diffusion model is learning in plain terms. What function is the neural network approximating, and why is the iterative denoising approach necessary rather than just training a single-step generator?

Accepted Answer

The network is learning to predict the noise added at each step, which is equivalent to learning the score function (gradient of the log probability) of the data distribution at each noise level. Iterative denoising is necessary because mapping directly from noise to data in one step is an extremely complex function that the network cannot learn reliably. Breaking it into small steps makes each step a simpler, more learnable denoising task, and allows the model to course-correct across steps.

Question 15

How does overfitting manifest differently when fine-tuning a large generative model compared to training a classifier? What regularization and monitoring techniques do you use to detect and prevent it?

Accepted Answer

In generative fine-tuning, overfitting shows up as loss divergence between train and validation, catastrophic forgetting of the base model's general capabilities, reduced diversity in outputs (mode collapse toward fine-tuning examples), and verbatim memorization of training examples. Monitoring: track validation loss and perplexity on a held-out set, and also run capability benchmarks from the base model to check for forgetting. Regularization: early stopping, lower learning rate, fewer epochs, LoRA (which limits capacity), weight decay, and dropout in the fine-tuning head.

Question 16

KL divergence appears in both VAE training and RLHF objectives. Explain what KL(P||Q) is measuring, why the direction matters, and what it is penalizing in each context.

Accepted Answer

KL(P||Q) measures the expected extra bits to encode samples from P using a code designed for Q. It is asymmetric: KL(P||Q) penalizes placing low mass on regions where P has high mass (mode-covering), while KL(Q||P) penalizes spreading mass where P has low mass (mode-seeking). In VAEs, KL(posterior||prior) regularizes the latent distribution toward the prior to enforce disentanglement and compactness. In RLHF, KL(policy||reference) penalizes the policy for drifting too far from the SFT model, preventing reward hacking.

Question 17

Summarize the Chinchilla scaling laws. What does the paper conclude about the optimal relationship between model parameters and training tokens, and what are the practical limits of applying these laws to real training runs?

Accepted Answer

Chinchilla found that prior large models (GPT-3, Gopher) were significantly undertrained for their size, and that optimal loss is achieved with roughly equal compute allocated to parameters and training tokens: for a model of N parameters, train on approximately 20N tokens. Practical limits: the laws predict loss on the pretraining distribution, not downstream task performance; inference compute cost is not in the objective, so inference-efficient smaller models trained longer (LLaMA style) may be practically superior; the law assumes a single training run, not staged or continued training.

Question 18

Modern large language models use SwiGLU activation in their feedforward layers rather than the GELU or ReLU used in earlier transformers. What is SwiGLU doing, and what empirical advantage does it have?

Accepted Answer

SwiGLU splits the feedforward hidden dim into two parallel projections: one goes through a Swish (SiLU) nonlinearity and the other is a linear gate, and they are multiplied together. This gating mechanism gives the model more expressive control over which features pass through. Empirically, SwiGLU achieves similar loss to GELU with fewer parameters, so models using it are often sized with 8/3 * model_dim hidden units instead of 4x to keep FLOPs equal. The gain is modest but consistent enough that it is the standard choice in PaLM, LLaMA, and Gemma.

Question 19

What is the difference between causal (autoregressive) language modeling and masked language modeling (BERT-style), and when would you choose one over the other for a generative application?

Accepted Answer

Causal LM predicts each token from left context only, training autoregressively and directly supporting text generation. Masked LM replaces a random subset of tokens and predicts them from bidirectional context, producing richer encoder representations but not directly supporting generation without additional decoding heads. For generation tasks (chat, image captioning, code completion) causal LM is standard. Masked LM encoders are better for discriminative tasks (classification, span extraction) where bidirectional context is available at inference time.

Question 20

Modern large language models have moved from LayerNorm to RMSNorm, and diffusion UNets use GroupNorm in the convolutional blocks. Explain why batch normalization is unsuitable for autoregressive generation, what LayerNorm provides instead, and why RMSNorm has become the default in generative transformer models despite dropping the mean-centering step.

Accepted Answer

Batch norm computes statistics across the batch dimension, which requires a sufficiently large batch to be stable and creates a dependency between training samples. At inference with a single sample (or with variable sequence lengths in a batch), the batch statistics are unstable or require running mean/variance tracking. Layer norm normalizes across the feature dimension independently per token, making statistics independent of batch size and sequence length. This is critical for autoregressive decoding where each step may have batch size one.

Questions

Core components of the transformer architectureDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Cross-entropy loss for language model trainingDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How Adam optimizer differs from SGD and when each is preferredDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Intuitive explanation of what a diffusion model learnsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Overfitting symptoms and cures during generative model fine-tuningDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

KL divergence: role in VAE training and RLHFDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Chinchilla scaling laws: what they predict and their limitsDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Why transformers use SwiGLU instead of ReLU or GELUDomain knowledgemediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Causal vs masked language modeling: when to use eachDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Layer norm vs batch norm in transformersDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Core components of the transformer architectureDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Cross-entropy loss for language model trainingDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How Adam optimizer differs from SGD and when each is preferredDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Intuitive explanation of what a diffusion model learnsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Overfitting symptoms and cures during generative model fine-tuningDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

KL divergence: role in VAE training and RLHFDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Chinchilla scaling laws: what they predict and their limitsDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Why transformers use SwiGLU instead of ReLU or GELUDomain knowledgemediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Causal vs masked language modeling: when to use eachDomain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Layer norm vs batch norm in transformersDomain knowledgeeasyCommon

As asked

Sample answer outline