As asked
Explain the core components of a transformer as used in modern generative models: multi-head self-attention, the feedforward sublayer, layer normalization, and residual connections. How do the specific design choices in production generative models (pre-norm rather than post-norm, RMSNorm over LayerNorm, SwiGLU feedforward, GQA over MHA) change the picture from the original architecture?
Sample answer outline
Attention computes context-dependent weighted averages across the sequence, allowing each position to integrate information from anywhere. Multi-head splits attention into parallel subspaces, learning different interaction types. The feedforward sublayer is a per-position MLP that transforms representations nonlinearly after attention has mixed token information. Layer norm stabilizes activations to prevent gradient issues during deep training. Residual connections allow gradients to flow directly through depth, enabling training of very deep models.
Expect these follow-ups
- What is pre-norm vs post-norm layer normalization and which is more stable during training?
- Why does the feedforward hidden dim typically use 4x the model dimension?