Question 1

Explain the bias-variance tradeoff. How does it manifest when you are choosing between a linear model and a deep neural network for a tabular dataset with 10,000 examples?

Accepted Answer

Bias is the systematic error from model assumptions; variance is sensitivity to training data fluctuations. A linear model has high bias if the true relationship is nonlinear but low variance. A deep network has low bias but high variance, tending to overfit on 10,000 examples unless regularized. With 10k tabular examples, gradient boosted trees or regularized neural nets often outperform deep models because they control variance better on small data. The right choice depends on the feature-target relationship's complexity.

Question 2

Explain exactly what batch normalization does during training and inference. Why does it speed up training, and what are the failure modes in small batch or sequence model settings?

Accepted Answer

During training, BN normalizes activations to zero mean and unit variance within the batch, then applies learned scale gamma and shift beta. This reduces internal covariate shift, allowing higher learning rates. During inference, it uses running estimates of mean and variance accumulated during training, not the batch statistics. It fails with very small batches because the batch statistics are too noisy, and in RNNs where sequences have variable statistics across timesteps. Layer normalization is the standard fix for both cases.

Question 3

Explain the mathematical difference between L1 and L2 regularization and what it implies about the sparsity of the learned weights. When would you choose one over the other in practice?

Accepted Answer

L2 adds a penalty proportional to the squared magnitude of weights, shrinking them toward zero but rarely exactly to zero. L1 adds a penalty proportional to the absolute magnitude, and because its gradient is constant near zero, it can push weights exactly to zero, producing sparse models. L1 is useful for feature selection when you expect most features to be irrelevant. L2 is more stable when features are correlated. Elastic net combines both. In neural networks, L2 weight decay is most common; explicit L1 is rare because sparsity is handled better with structured pruning.

Question 4

What is the difference between ROC-AUC and precision-recall AUC, and in what situations does ROC-AUC give a misleadingly optimistic picture?

Accepted Answer

ROC-AUC plots true positive rate against false positive rate at all thresholds. When negative examples vastly outnumber positives, a model can have a high TPR while its FPR remains low even with many false positives, because the FPR denominator (true negatives) is large. PR-AUC plots precision against recall and is sensitive to false positives in absolute terms, not relative to the negative class size. For fraud detection or rare disease classification at 0.1 percent positive rate, PR-AUC is the more informative metric.

Question 5

Walk me through the key components of a transformer encoder: what each component does, the dimensionalities at each step for an input of (batch, seq, d_model), and what is lost if you remove positional encodings.

Accepted Answer

The embedding layer maps token IDs to (batch, seq, d_model). Positional encodings add position information because self-attention is permutation-invariant. Multi-head attention projects to Q, K, V of (batch, heads, seq, d_head), computes scaled dot-product attention, and concatenates heads back to (batch, seq, d_model). A feed-forward sublayer applies two linear transformations with a nonlinearity. Layer norm and residual connections wrap both sublayers. Without positional encodings, the model cannot distinguish word order, turning 'dog bites man' and 'man bites dog' into identical representations.

Question 6

How do you detect that a neural network is overfitting, and what interventions do you apply in what order?

Accepted Answer

The primary signal is a growing gap between training and validation loss over epochs. First verify the gap is real and not due to data leakage. Interventions in rough priority order: increase training data, reduce model capacity, add dropout, add weight decay, use early stopping, and apply data augmentation. For large pretrained models, fine-tuning only the last few layers or using LoRA prevents overfitting on small datasets. Learning rate warmup can also help by avoiding memorizing early batches.

Question 7

Walk me through the memory hierarchy in a modern AMD GPU, from the farthest off-chip memory to the closest on-chip storage a shader sees. How does each level affect kernel performance?

Accepted Answer

A strong answer covers HBM2e or HBM3 at the board level (high bandwidth but high latency), the L2 cache shared across all compute units, L1/texture cache per compute unit, Local Data Share (LDS, AMD's name for shared memory) inside each compute unit, and vector/scalar register files inside each SIMD lane. The candidate should quantify typical bandwidth numbers for RDNA3 or CDNA2, explain cache line sizes, and name the access patterns that cause bank conflicts in LDS.

Question 8

Claude's behavior is shaped by three parties: Anthropic, operators who access the API to build products, and end users. Explain how Claude is supposed to handle conflicts between instructions from each of these parties, and give a concrete example of a conflict and the expected resolution.

Accepted Answer

Anthropic's values and guidelines take highest precedence and are baked into the model via training. Operators can customize behavior within Anthropic's policies via the system prompt, such as restricting topics or enabling adult content on appropriate platforms. Users can adjust within what the operator permits. A conflict example: an operator says 'never discuss competitor products' but a user asks about them. Claude should follow the operator restriction since it is a plausible business reason, not an attempt to harm the user. Claude should not follow operator instructions that actively harm users or deceive them in damaging ways.

Question 9

Compare stochastic gradient descent with momentum to Adam. In what settings does SGD with momentum outperform Adam on a test set, and what does this say about generalization?

Accepted Answer

Adam adapts per-parameter learning rates using first and second moment estimates, which makes it easier to train and more forgiving of learning rate choice. However, several papers have shown that SGD with momentum and a well-tuned learning rate achieves better generalization on image classification benchmarks, possibly because Adam finds sharper minima. The intuition is that SGD's uniform step size leads to flatter minima that generalize better. In practice, Adam is preferred for transformers and NLP, while SGD is competitive for CNNs with careful tuning.

Question 10

Explain what it means for a model to be well-calibrated. How do you measure calibration and what techniques improve it?

Accepted Answer

A calibrated model produces predicted probabilities that match empirical frequencies: if it predicts 0.8 probability for 100 examples, about 80 should actually be positive. Measure calibration with a reliability diagram or Expected Calibration Error. Common miscalibrations: neural networks are overconfident, Naive Bayes is often underconfident. Platt scaling fits a logistic regression on predicted scores on a held-out set. Temperature scaling is a single-parameter version that divides logits by a learned temperature and is highly effective for neural networks.

Question 11

Explain how Word2Vec CBOW or Skip-gram learns embeddings, and intuitively why vectors that capture semantic relationships emerge from a simple prediction task on text.

Accepted Answer

Skip-gram trains a shallow network to predict surrounding context words given a center word. Words appearing in similar contexts get similar gradient updates, so their vectors converge to similar directions. The resulting geometry reflects distributional semantics: words used similarly are nearby. The famous vector analogy king - man + woman = queen emerges because the difference vector encodes the gender transformation applied consistently across many word pairs in the training data.

Question 12

In what situations does k-fold cross-validation give an overly optimistic estimate of generalization performance for an ML model?

Accepted Answer

K-fold is misleading when the data has temporal ordering and future data leaks into train folds: use time-series split instead. It overestimates performance when hyperparameter selection was done on the same folds used for evaluation: use nested cross-validation. It overestimates when the dataset has duplicate examples that appear in both train and validation folds. It underestimates when there is high variance across folds because the dataset is too small to give stable estimates.

Question 13

You have a CUDA kernel that uses warp-level primitives like __shfl_down_sync and warp shuffle reductions. Walk me through how you would port that kernel to HIP for AMD GPUs, and what differences you would watch for.

Accepted Answer

A strong answer explains that AMD's wavefront is 64 lanes wide on CDNA hardware (MI series), not 32 like an NVIDIA warp, which means warp-level primitives need a mask or size change. In HIP, __shfl_down_sync becomes __shfl_down and the implicit warp size assumption of 32 must be replaced with warpSize or the wavefront size constant. The candidate should mention hipify-perl or hipify-clang for automated translation and flag the cases those tools miss, such as warp-size-dependent logic and texture fetch intrinsics. They should note that RDNA3 consumer GPUs use 32-lane wave32 as the DEFAULT mode (unlike CDNA, which defaults to wave64), so code targeting both GPU families must query or configure the wavefront size explicitly rather than assuming either value.

Question 14

Explain compute unit occupancy on AMD GPUs. How do register usage, LDS usage, and wavefront count interact, and how would you diagnose and improve low occupancy on a matrix multiplication kernel?

Accepted Answer

A strong answer defines occupancy as the ratio of active wavefronts to the maximum the hardware can support per compute unit. It explains that each register file is a fixed-size pool shared across wavefronts: more registers per thread means fewer wavefronts fit. LDS is similarly partitioned. The candidate should mention using rocprof or Omniperf to read actual occupancy, using __attribute__((amdgpu_num_vgpr)) hints to cap register usage, and explain the occupancy-latency hiding tradeoff where high occupancy hides memory latency but too many wavefronts fighting for cache can hurt performance.

Question 15

What is the key architectural difference between the MI300X and previous MI200-series cards, and why does that difference matter for large model inference?

Accepted Answer

A strong answer explains that MI300X is a 3D-stacked APU-class design with 192 GB of HBM3 memory unified between compute dies, which eliminates the CPU-to-GPU PCIe bottleneck for model weight transfers during inference. The candidate should contrast this with MI250X which uses two GCDs connected by Infinity Fabric with separate HBM stacks, explaining how MI300X's unified memory pool simplifies serving large LLMs without tensor parallelism across multiple cards. Mention of 5.2 TB/s aggregate HBM3 bandwidth and how that affects attention layer throughput is a differentiator.

Question 16

You have a HIP kernel running slower than the roofline model predicts. Describe your profiling workflow using AMD's tooling to identify the bottleneck, from first suspicion to a confirmed fix.

Accepted Answer

A strong answer walks through rocprof for a first-pass timeline and hardware counter collection, then Omniperf for a detailed roofline analysis that separates compute-bound from memory-bound bottlenecks. The candidate should mention specific Omniperf panels (L2 cache efficiency, LDS utilization, vector ALU utilization), explain how to read the roofline chart to see which arithmetic intensity puts the kernel in which bound region, and describe the fix pattern for each: loop unrolling for compute-bound, coalescing or prefetching for memory-bound, LDS padding for bank conflicts.

Question 17

Enterprise customers embed Claude in their products and pass untrusted user content in the prompt. Describe a realistic prompt injection attack against such a system and at least three layers of defense you would build into the platform.

Accepted Answer

A strong answer describes the attack concretely: malicious text in user-supplied input instructs the model to ignore the system prompt and exfiltrate or reveal sensitive data. Defenses include structural separation of trusted system instructions from untrusted user content using clearly delimited sections, model-level training to treat differently-sourced content with appropriate skepticism, output classifiers that detect anomalous responses like data exfiltration patterns, and application-level monitoring for unusual tool call or action patterns. The candidate should note there is no single silver bullet and defense-in-depth is the right framing.

Question 18

Claude's extended thinking feature lets you allocate a token budget for the model to reason before answering. How would you advise a customer on choosing that budget, and what engineering trade-offs does a large thinking budget create for the serving infrastructure?

Accepted Answer

Budget choice depends on task complexity: simple factual questions get no benefit from large budgets; hard multi-step reasoning or coding tasks may saturate at a certain budget and see diminishing returns. Infra trade-offs include longer time-to-first-token for the user, higher GPU memory occupancy per request blocking other requests, and higher cost per call. A strong answer also notes the budget must be chosen per use case rather than globally, and that Anthropic's evals can help identify where returns flatten.

Question 19

When Anthropic prepares a model card or responsible scaling policy evaluation before a major Claude release, what categories of harm do they assess, and how do those assessments influence the release decision?

Accepted Answer

Anthropic's model cards and the Responsible Scaling Policy cover uplift to CBRN weapons, cyberoffense capability, persuasion and deception at scale, and autonomy risks from agentic systems. Assessments use red-teaming, structured threat modeling, and third-party auditors. The RSP defines safety levels and stipulates what mitigations must be in place before a model can be deployed at a given capability tier. A strong answer shows the candidate has read the actual RSP or model cards rather than speaking in generalities.

Question 20

Anthropic is about to ship a new capability that lets Claude autonomously browse the web. You are on the red team. Describe your methodology for finding failure modes, what threat models you would focus on, and how you would prioritize findings for the release decision.

Accepted Answer

A strong answer starts with threat modeling: what new attack surfaces does web browsing open? Prompt injection via malicious web pages is the first-order risk. Other threats include data exfiltration, click fraud, account compromise if credentials are accessible, and the model being manipulated into taking harmful actions in the real world. Methodology includes structured attack scenarios, open-ended exploration by diverse red teamers including domain experts, and automated probing with adversarial page content. Findings are prioritized by impact times likelihood, with any critical path to CBRN uplift or mass harm blocking release.

Question 21

A customer wants a single governance layer across multiple Databricks workspaces in different cloud regions. They also want to know which notebooks and jobs read from or write to a given Delta table. How does Unity Catalog address both of those requirements?

Accepted Answer

Unity Catalog provides a three-level namespace (catalog, schema, table) that is shared across all workspaces attached to the same metastore. Permissions are granted once at the catalog or schema level and apply everywhere. For lineage, Unity Catalog captures column-level and table-level lineage automatically from Spark, SQL, and Delta operations, storing which queries, notebooks, jobs, and users read or wrote each table. This lineage is queryable through the Unity Catalog UI and API without any manual tagging by engineers.

Question 22

Explain how the MESI protocol keeps caches coherent across cores on an Intel multi-core processor. Walk me through what happens at the hardware level when two cores both have a cache line in the Shared state and one of them tries to write to it.

Accepted Answer

A strong answer traces the write from the Shared state: the writing core sends an Invalidation request on the interconnect, the other core transitions its line to Invalid, the writing core transitions to Modified, and the cache controller handles snoop responses before the store completes. The candidate should mention what happens if the Modified owner has not flushed before another core requests the line (write-back or intervention), and ideally connect this to software consequences like false sharing.

Question 23

You run a CUDA kernel and Nsight Compute reports 8 TFLOPS of achieved throughput on an A100 that has 19.5 TFLOPS FP32 peak and 2 TB/s memory bandwidth. What does this tell you about the kernel's bottleneck, and how would you confirm it?

Accepted Answer

A strong answer explains the roofline model: the ridge point is at peak FLOPS divided by peak bandwidth, roughly 9.75 FLOP/byte for A100. If the kernel's arithmetic intensity falls below that ridge, it is memory-bound; above it, compute-bound. 8 TFLOPS at only 41% of peak suggests either the kernel is memory-bound and the bandwidth ceiling is actually what limits it, or it is compute-bound but has high warp stall due to data dependencies. Confirmation steps include checking Nsight Compute's 'Memory Throughput %' and 'SM Active Cycles vs Eligible Cycles', and looking at 'Warp State Statistics' to see if stalls are on L1TEX, L2, or DRAM. The candidate should also mention that Tensor Core utilization matters for mixed-precision workloads.

Question 24

A team has a PyTorch model for real-time video classification that is too slow at inference. Walk me through how you would use TensorRT to optimize it, including the export steps, precision choices, and how you would validate accuracy after optimization.

Accepted Answer

A strong answer covers: (1) exporting the model to ONNX using torch.onnx.export with dynamic axes for variable batch and sequence sizes; (2) building a TensorRT engine using the ONNX parser, setting the optimization profile for input shapes, and choosing INT8 or FP16 precision; (3) for INT8, running calibration with a representative dataset to build a calibration cache; (4) serializing the engine and loading it at runtime with a TRT IRuntime; (5) validating accuracy by comparing logits from PyTorch FP32 and TRT INT8 on a held-out set, and checking for ops unsupported by TRT that require a custom plugin. The candidate should note that Polygraphy or trtexec are standard tools for this pipeline.

Question 25

Standard self-attention on a sequence of length N uses O(N squared) memory. Explain the core insight behind FlashAttention that reduces HBM memory reads and writes, and why this matters for training large models on NVIDIA GPUs.

Accepted Answer

A strong answer explains that standard attention materializes the full N-by-N attention matrix in HBM, which for N=8192 on a 13B model can be tens of gigabytes per layer. FlashAttention avoids this by tiling the Q, K, and V matrices into blocks that fit in SRAM (shared memory), computing a partial softmax in the online fashion using the log-sum-exp trick to maintain numerical equivalence, and never writing the full attention matrix to HBM. The result is O(N) HBM memory usage instead of O(N squared), trading compute for memory bandwidth. On Ampere and Hopper where compute is cheap but HBM bandwidth is the bottleneck for this kernel, FlashAttention is 2 to 4x faster end-to-end. FlashAttention-2 adds further optimizations around warp partitioning to maximize Tensor Core utilization.

Question 26

A team wants to quantize a transformer model from FP32 to INT8 to run faster on NVIDIA Turing or Ampere GPUs. Explain how INT8 quantization works, what calibration is, and how you would decide whether the accuracy loss is acceptable.

Accepted Answer

A strong answer explains that INT8 quantization maps floating-point tensors to 8-bit integers using a scale and zero-point, reducing memory by 4x and enabling INT8 Tensor Core compute which is 2 to 4x faster than FP16 on Turing and Ampere. Post-training quantization (PTQ) requires calibration: running the model on a representative dataset to determine the activation range for each layer, then choosing the scale factor. Symmetric vs asymmetric quantization and per-tensor vs per-channel quantization are tradeoffs to explain. Accuracy loss is assessed by comparing task metrics (BLEU, F1, accuracy) on a held-out eval set; layers sensitive to quantization (like the final projection or embeddings) can be kept in FP16. QAT (quantization-aware training) fine-tunes with fake quant nodes to recover accuracy if PTQ degrades too much.

Question 27

Walk me through two or three architectural changes in the H100 Hopper GPU compared to A100 Ampere that are most relevant to large model training and inference workloads.

Accepted Answer

A strong answer covers: (1) the Transformer Engine, which dynamically selects FP8 or FP16 precision per layer using a per-tensor scaling table, enabling roughly 3.9 PFLOPS of FP16 Tensor Core throughput on H100 SXM5 (dense, no sparsity) compared to 312 TFLOPS on A100, and up to 3.9 PFLOPS FP8 dense (7.9 PFLOPS with structured sparsity); (2) NVLink 4.0 doubling per-GPU bidirectional bandwidth to 900 GB/s compared to 600 GB/s on A100 NVLink 3.0, reducing all-reduce bottlenecks in data-parallel training; (3) the Thread Block Clusters feature (sm_90) that allows shared memory to span multiple SMs via distributed shared memory, useful for fused attention kernels that exceed a single SM's 228 KB shared memory limit. The candidate should show they understand these are not just marketing numbers but affect the design of kernels and training recipes.

Question 28

You are implementing distributed training and need to synchronize gradients across 64 GPUs. Explain the NCCL collective operations available, which one you use for gradient synchronization, and what the communication complexity is.

Accepted Answer

A strong answer covers the four main NCCL collectives: AllReduce (every GPU ends with the sum of all GPUs' tensors, used for gradient synchronization), AllGather (every GPU ends with a concatenation of all GPUs' tensors, used in ZeRO-3 to reconstruct full parameters), ReduceScatter (each GPU gets a different slice of the reduced tensor, used in ZeRO-2 gradient sharding), and Broadcast (one GPU sends to all others, used for parameter broadcast at the start of training). For gradient synchronization, AllReduce is standard. In a ring-all-reduce, communication volume per GPU is 2 times (N-1)/N times message size, which is effectively 2x the tensor size regardless of GPU count, making it linearly scalable. The candidate should mention NCCL uses NVLink for intra-node and InfiniBand for inter-node, and that it auto-selects the optimal algorithm.

Question 29

Walk me through the components of a transformer decoder layer and tell me which operations dominate compute time for a large model during training. Where does the FLOP budget go?

Accepted Answer

A strong answer covers: attention block (Q/K/V projections, scaled dot-product, output projection) and MLP block (two linear layers with an activation, typically SwiGLU in modern models). For large models with d_model = 8192 and ffn dim = 4 times d_model, the MLP accounts for roughly two-thirds of FLOPs per layer, and attention accounts for roughly one-third for typical sequence lengths below 8k. The attention FLOPs are O(B times S squared times d_model) and MLP FLOPs are O(B times S times d_model squared times 8) for SwiGLU. At long context lengths (32k plus), attention starts to dominate. The candidate should know that backward pass is approximately 2x the forward pass FLOPs.

Question 30

OpenAI publishes a significant amount of technical work: GPT papers, the o-series reasoning models, InstructGPT, RLHF, and alignment research. How closely do you follow that research, and can you walk me through one paper or technique from the past year that changed how you think about building AI systems?

Accepted Answer

The candidate should name a specific paper or technique, not a vague area. A strong answer explains the key contribution in plain terms, then connects it to a concrete implication for how they would build or evaluate systems. This is not about reciting papers but showing genuine intellectual engagement with the field.

Questions

Explain the bias-variance tradeoff in practical model selection termsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How does batch normalization work and why does it help training?Domain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Compare L1 vs L2 regularization and when to use eachDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

ROC-AUC versus precision-recall AUC for imbalanced classificationDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Walk me through the transformer architecture componentsDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

How do you detect and address overfitting in a neural network?Domain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

GPU memory hierarchy: HBM, VRAM, L2, shared memoryDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain Claude's operator, user, and Anthropic trust hierarchyDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Compare SGD, Adam, and their practical tuning differencesDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

What is model calibration and how do you measure and fix it?Domain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

How are word or entity embeddings learned and why do they capture semantics?Domain knowledgeeasyCommon

As asked

Sample answer outline

Expect these follow-ups

When is k-fold cross-validation misleading for ML model selection?Domain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

ROCm vs CUDA: porting a CUDA kernel to HIPDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

Compute unit occupancy and how to tune itDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

MI300X architecture and its AI workload advantagesDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Profiling a slow ROCm kernel with OmniperfDomain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

How would you defend the Claude API against prompt injection attacks?Domain knowledgehardCommon

As asked

Sample answer outline

Expect these follow-ups

How do you decide the right token budget for extended thinking?Domain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

What categories of harm does Anthropic assess before releasing a model?Domain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

How would you red-team a new Claude capability before release?Domain knowledgehardCommon

As asked

Sample answer outline