Question 1

Walk me through what happens when you call loss.backward() in PyTorch. How does autograd build the computation graph, and what are the memory implications of retaining it?

Accepted Answer

A strong answer explains that PyTorch builds a dynamic computation graph during the forward pass, attaching grad_fn nodes to tensors. backward() traverses the graph in reverse, applying the chain rule at each node. retain_graph=True is needed for multiple backward passes but prevents freeing intermediate activations, which can double GPU memory usage in practice.

Question 2

Describe how torch.cuda.amp works under the hood. What is the GradScaler doing, and what training instabilities have you seen when using mixed precision?

Accepted Answer

AMP runs the forward pass in float16 for speed and memory savings, while keeping a float32 master copy of weights. GradScaler multiplies the loss by a large factor before backward to avoid underflow in float16 gradients, then unscales before the optimizer step and checks for infs or NaNs, skipping the update if found. Common instabilities include loss spikes when the scale factor is too aggressive and divergence in layers that are numerically sensitive like LayerNorm.

Question 3

Your training job shows GPU utilization fluctuating between 40 and 60 percent. Walk me through how you would diagnose and fix this.

Accepted Answer

Low GPU utilization usually means the GPU is starved for data, waiting on CPU-side preprocessing, or blocked on communication. Start with nvidia-smi or PyTorch Profiler to identify the idle periods. Profile the data loader by checking DataLoader num_workers and whether pin_memory is enabled. Check if memory transfers are saturating PCIe bandwidth. If the bottleneck is communication in distributed training, verify NCCL configuration and whether gradient compression helps.

Question 4

Your model's prediction distribution has shifted over the past week, but labels are not available yet. How do you detect and diagnose the root cause?

Accepted Answer

Without labels, you monitor input feature distributions using statistical tests like Population Stability Index or Kolmogorov-Smirnov against a reference window. If PSI is above 0.2 on a key feature that is high in SHAP importance, that feature is the likely culprit. Check upstream data pipelines for schema changes, null rate spikes, or encoding changes. Model output distribution monitoring with Jensen-Shannon divergence can flag drift before labels arrive.

Question 5

Walk me through the decision of whether to serve a ranking model in real time versus running it in daily batch. What factors tip you toward each approach?

Accepted Answer

Batch inference is cheaper and simpler when predictions can be precomputed, like email campaign scoring or overnight recommendation refreshes. Real-time serving is required when context changes rapidly or per-request features are needed, like session-based recommendations. Cost, latency SLA, feature freshness requirements, and user session lifetime all factor in. A common hybrid is precomputing candidate sets in batch and re-ranking in real time.

Question 6

You need to run a hyperparameter sweep over learning rate, batch size, and weight decay for a transformer fine-tuning job. How do you set this up in Weights and Biases and avoid wasting compute on clearly bad configurations early?

Accepted Answer

W&B Sweeps support random, grid, and Bayesian search strategies. For a three-dimensional space, Bayesian search with early termination using the Hyperband scheduler eliminates low-performing runs before they finish, typically reducing compute by 60 to 80 percent. Log validation loss at each epoch so the scheduler can prune early. Use a logarithmic prior on learning rate and weight decay. Set a minimum number of steps before pruning to avoid killing runs that need warmup.

Question 7

Your fraud detection dataset has 0.1 percent positive labels. What techniques would you apply at the data, model, and evaluation level, and how do you choose between them?

Accepted Answer

At the data level, undersample the majority class or oversample with SMOTE, or use weighted random sampling in the DataLoader. At the model level, use focal loss or set class_weight='balanced' in sklearn models. At the evaluation level, use precision-recall AUC or F1 rather than accuracy or ROC-AUC, and set the decision threshold to the business-optimal point on the precision-recall curve. For fraud specifically, the cost of false negatives versus false positives should drive the threshold choice.

Question 8

When would you choose PyTorch FSDP over DDP for a training job, and what are the concrete memory and communication tradeoffs you are making?

Accepted Answer

DDP replicates the full model on every GPU and syncs gradients via all-reduce, so each GPU must fit the whole model. FSDP shards parameters, gradients, and optimizer state across GPUs, allowing models that do not fit on a single GPU, but at the cost of additional all-gather and reduce-scatter communication. FSDP is the right choice when a single GPU cannot hold the model or optimizer state.

Question 9

You are building a feature store for a real-time recommendation system. Walk me through how you keep the online store consistent with the offline store, and how you handle feature staleness during serving.

Accepted Answer

The offline store holds historical features computed in batch and is used for training, while the online store holds the latest feature values served at low latency. Consistency is maintained via a materialization pipeline that pushes updated features from the offline store to a low-latency store like Redis. Training-serving skew occurs when the online feature computation differs from training, so both pipelines should share transformation logic. Staleness is managed with TTLs and monitoring dashboards tracking feature freshness.

Question 10

Explain the difference between post-training quantization and quantization-aware training. When does INT8 quantization cause unacceptable accuracy degradation, and how do you recover from it?

Accepted Answer

Post-training quantization calibrates scale factors from a representative dataset without retraining, which is fast but can lose accuracy on models with wide activation distributions. QAT simulates quantization noise during training so the model adapts, typically recovering most of the accuracy loss. Degradation is worst for models with outlier activations, like large transformer models, where per-channel or per-token quantization is needed. SmoothQuant or AWQ can migrate outlier difficulty from activations to weights.

Question 11

Explain how the KV cache works in transformer inference, how it grows with sequence length and batch size, and what tradeoffs you make when deciding how large to allow it.

Accepted Answer

The KV cache stores computed key and value tensors from past tokens so they are not recomputed on each generation step. Its memory footprint scales as 2 x num_layers x num_heads x head_dim x sequence_length x batch_size x bytes_per_element. For large batches or long sequences this dominates GPU memory. Techniques like PagedAttention in vLLM manage KV cache as non-contiguous pages, improving utilization. Flash attention reduces the memory footprint during prefill but does not affect cached inference.

Question 12

Describe how you would design a model registry and promotion workflow so that moving a model from staging to production is safe and auditable, and rollback is fast.

Accepted Answer

A model registry stores versioned artifacts with metadata like training run ID, dataset version, and evaluation metrics. Promotion gates require passing offline eval thresholds and a shadow deployment or A/B test in staging. Rollback is fast when the serving infrastructure reads a model version pointer rather than baking the model into an image, so changing the pointer and restarting serves the previous version in seconds. MLflow, Vertex AI Model Registry, or SageMaker Model Registry all support this pattern.

Question 13

Your offline AUC improved by 1.5 points but the A/B test showed no significant lift in the business metric. What are the most likely explanations and how do you debug this?

Accepted Answer

The most common causes are evaluation data leakage or label bias that does not represent live traffic, positional bias in offline labels from a system that serves them differently online, or the metric gap being too small to manifest as a detectable business lift given the A/B test power. Check the offline eval pipeline for temporal leakage, verify label construction matches online behavior, and run a power analysis to confirm the test duration was sufficient to detect the expected lift.

Question 14

Explain exactly what gradient checkpointing does to memory during training and what you pay in compute to get those savings.

Accepted Answer

Gradient checkpointing trades memory for compute by not storing all intermediate activations during the forward pass. During backward, it recomputes the discarded activations from the nearest checkpoint. This reduces activation memory from O(num_layers) to O(sqrt(num_layers)) with the standard uniform checkpoint strategy, at the cost of approximately 33 percent more compute. It is most useful when activation memory is the binding constraint, not parameters or optimizer state.

Question 15

When running inference with ONNX Runtime, how do you choose and configure an execution provider, and what graph optimizations does ORT apply automatically?

Accepted Answer

ONNX Runtime selects an execution provider based on hardware: CUDA EP for NVIDIA GPUs, TensorRT EP for further optimization, CoreML EP for Apple Silicon, and CPU EP as the fallback. ORT applies graph optimizations in passes: basic optimizations like constant folding and dead node elimination, extended optimizations like operator fusion, and layout optimizations. TensorRT EP provides additional kernel auto-tuning for the specific GPU. Setting opt_level and enabling memory arena can significantly affect throughput.

Question 16

What does idempotency mean in the context of a daily feature engineering DAG, and what concrete design decisions make a pipeline idempotent versus fragile to reruns?

Accepted Answer

An idempotent DAG produces identical output if run multiple times for the same logical date. Concretely, write output to a partition keyed on the logical execution date, overwrite rather than append, and avoid using wall-clock time inside tasks. Use Airflow's ds macro for date context. If a task reads from a mutable source like a database table, snapshot it at the start of the run so reruns see the same input. Avoid deleting and recreating tables when an INSERT OVERWRITE partition is sufficient.

Question 17

Your recommendation model has a p50 latency of 20ms but a p99 of 400ms. What are the most common causes and how do you systematically reduce p99 without hurting throughput?

Accepted Answer

P99 tail latency spikes often come from garbage collection pauses in JVM-based serving stacks, CUDA kernel launch overhead under bursty load, dynamic batching queue stalls, or CPU-GPU memory transfer jitter. Profile with percentile histograms at each layer: load balancer, serving framework, and kernel. Fixes include pinning memory for async transfers, setting a stricter max_queue_delay, using streaming batch inference with a fixed window, and moving the serving runtime to C++ or Rust to eliminate GC pauses.

Question 18

Describe how you use DVC to version datasets and link them to model training runs. What breaks when someone skips versioning and how have you seen that play out?

Accepted Answer

DVC tracks dataset files by content hash and stores the actual data in a remote like S3 or GCS, committing only the .dvc pointer files to Git. This means a Git tag on a training commit captures the exact dataset used. When versioning is skipped, teams lose the ability to reproduce a model, debug performance regressions, or audit what data a production model was trained on. A common failure mode is a data pipeline silently overwriting the training parquet files after the model ships.

Question 19

Your Spark feature pipeline is slow because a few user IDs generate millions of events while most generate dozens. How do you address the data skew?

Accepted Answer

Identify the skewed keys with a groupBy count and sort descending. Solutions include salting the skewed keys by appending a random suffix and aggregating in two stages, using broadcast joins instead of shuffle joins when one side is small, or applying AQE's skew join optimization in Spark 3 which automatically splits skewed partitions. For feature aggregation specifically, you can precompute heavy-hitter user features in a separate job that reads a filtered partition.

Question 20

Describe exactly how you would implement a shadow deployment for a new ranking model. What data do you collect, how long do you shadow, and what signals trigger a full promotion?

Accepted Answer

In shadow mode, live traffic is duplicated and sent to the new model, but only the existing model's predictions are served. Both models' predictions are logged with the same request context. You compare prediction distributions, latency profiles, and resource consumption. After collecting statistically meaningful volume, typically a few days to a week, you verify the new model's offline metrics on live traffic and check for any edge-case crashes. Promotion is triggered when metrics pass thresholds and no anomalies appear in the shadow period.

Question 21

How do you write tests for an ML training pipeline? What do you test with unit tests, integration tests, and end-to-end smoke tests?

Accepted Answer

Unit tests cover individual transforms and feature logic with hand-crafted inputs and known outputs, catching regressions in preprocessing logic. Integration tests run the pipeline on a tiny synthetic dataset to verify the full execution graph, checking shapes, dtypes, and that loss decreases over a handful of steps. End-to-end smoke tests run a short training job on a small data slice and assert that final eval metrics are within a reasonable range. You also want tests that catch training-serving skew by running the same feature transform through both the training and serving code paths.

Question 22

You need to fine-tune a 7B parameter model across 4 GPUs using Hugging Face Accelerate. Walk me through the accelerate config and what to watch for to avoid OOM errors.

Accepted Answer

Run accelerate config to generate a config file specifying multi-GPU with DDP or DeepSpeed backend, mixed precision, and gradient accumulation steps. For a 7B model, DeepSpeed ZeRO stage 2 or 3 is usually needed to avoid OOM. Monitor per-GPU memory with torch.cuda.memory_reserved() at the first step. Common OOM causes are the optimizer state (8 bytes per param in Adam), activations from large batch sizes, and the tokenizer expanding sequence length beyond expectation.

Question 23

Walk me through how LoRA works mathematically, how you choose the rank, and how you decide which weight matrices to apply it to.

Accepted Answer

LoRA freezes the pretrained weights W and adds a low-rank decomposition delta_W = A * B where A is (d, r) and B is (r, k) with r much smaller than d and k. The rank r controls expressivity versus parameter count; typical values are 4 to 64 with 8 or 16 being a common default. Higher rank helps on tasks far from pretraining but risks overfitting on small datasets. Applying LoRA to attention query and value projections is the standard starting point; adding it to the MLP layers or all linear layers helps for more task-specific adaptation.

Question 24

Explain how continuous batching works in a system like vLLM, how it differs from static batching, and what specific GPU utilization problem it solves.

Accepted Answer

Static batching groups a fixed set of requests and waits for all to finish before starting the next batch, so a long request blocks short ones from completing and the GPU idles while waiting. Continuous batching inserts new requests into the batch as soon as a slot frees up from a completed sequence. This dramatically improves GPU utilization when request lengths vary widely, which is typical in production. vLLM's PagedAttention manages the KV cache as dynamic pages so arriving and departing sequences do not require contiguous memory.

Question 25

You are serving a pipeline that runs a preprocessing step, a main model, and a postprocessing step. How do you configure this in Triton, and how does dynamic batching interact with ensemble scheduling?

Accepted Answer

Triton ensemble models define a pipeline graph in config.pbtxt where each step is a model with named inputs and outputs wired together. Dynamic batching is configured per model instance with preferred_batch_size and max_queue_delay_microseconds. In an ensemble, requests flow through sequentially so dynamic batching applies to each constituent model independently. The preprocessing and postprocessing steps often use the Python or BLS backend to handle arbitrary logic, while the main model uses the TensorRT or ONNX backend for peak throughput.

Question 26

You try to export a PyTorch model to TensorRT and the builder throws an unsupported op error on a custom attention variant. What is your step-by-step approach?

Accepted Answer

First export to ONNX and inspect the graph with Netron to identify which op is unsupported. Options include implementing a TensorRT plugin in CUDA C++, decomposing the op into supported primitives before export, or using torch.compile with the TensorRT backend which handles more ops automatically. If the custom op is a performance optimization, check whether the ONNX path with a standard fallback meets latency requirements before spending time on a plugin.

Question 27

Standard self-attention is quadratic in sequence length. What approximations exist, when do they preserve model quality, and what have you seen fail in practice?

Accepted Answer

Standard attention is O(n^2) in both time and memory. Flash Attention is an exact algorithm that reduces memory to O(n) by tiling computations without materializing the full attention matrix. Approximate methods like Longformer (sliding window plus global tokens), BigBird (sparse patterns), and Linformer (low-rank projection) trade off some accuracy for linear complexity. In practice, Flash Attention is almost always preferred over approximations for quality-sensitive tasks, while sparse attention works well for document retrieval where many positions are irrelevant.

Question 28

What adversarial attack vectors concern you most for a production image classification API, and what defenses would you put at the model and infrastructure level?

Accepted Answer

The most practical threats are adversarial perturbations crafted to flip predictions, and model extraction attacks where an adversary queries the API to clone the model. Defenses at the model level include adversarial training and input smoothing. At the infrastructure level, rate limiting, anomaly detection on request distributions, and watermarking outputs deter extraction. For high-stakes applications, randomized smoothing provides certifiable robustness bounds. Input preprocessing like JPEG compression or random resizing can disrupt gradient-based attacks without certified guarantees.

Question 29

Your distributed training job hangs indefinitely after a few thousand steps. nvidia-smi shows GPUs are idle. How do you diagnose and resolve NCCL communication issues?

Accepted Answer

NCCL hangs usually mean one rank is stuck waiting on a collective while others have moved on, often due to a Python exception on one rank, an OOM on one GPU that is not propagated cleanly, or a network connectivity issue in multi-node jobs. Set NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=ALL to get verbose logs. Use torch.distributed.monitored_barrier() to detect which rank is lagging. In multi-node setups, verify InfiniBand or RDMA connectivity, check that NCCL_SOCKET_IFNAME matches the right interface, and confirm all nodes can reach each other on the NCCL port.

Questions

How does PyTorch autograd compute gradients during backward pass?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain automatic mixed precision training and its failure modesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Diagnosing low GPU utilization during a training runRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Detecting and responding to feature drift in production pipelinesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

When to choose batch inference over real-time servingRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Organizing hyperparameter sweeps and comparing runs in W&BRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Handling extreme class imbalance in a fraud detection modelRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

DDP vs FSDP for large model training tradeoffsRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Online vs offline feature stores: consistency and stalenessRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

INT8 quantization: calibration, accuracy loss, and when to use itRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

KV cache in transformer inference: memory and batching tradeoffsRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Model registry promotion workflow and rollback strategyRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Offline eval metrics versus online A/B test outcomes: when they divergeRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Gradient checkpointing: memory savings vs compute overheadRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

ONNX Runtime execution providers and graph optimizationsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Designing idempotent Airflow DAGs for ML feature pipelinesRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Reducing p99 latency in a low-latency model serving endpointRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

DVC and dataset versioning for reproducible ML experimentsRole-specificeasyCommon

As asked

Sample answer outline

Expect these follow-ups

Handling data skew in Spark feature engineering jobsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Shadow deployment pattern for safe model rolloutRole-specificmediumCommon

As asked

Sample answer outline