Question 1

A data scientist gives you a gradient boosting model that achieves great offline AUC but you notice inference is 200ms per sample. Explain how gradient boosting builds an ensemble compared to random forests, which architectural properties drive that latency, and what MLOps levers you have to bring it within a 10ms serving SLA.

Accepted Answer

A strong answer explains that random forests build trees independently in parallel on bootstrapped samples and averages predictions (bagging), while gradient boosting builds trees sequentially, each fitting the residuals of the previous ensemble (boosting). Gradient boosting is more prone to overfitting because each tree corrects errors including noise; the key regularization knobs are learning rate (shrinkage), max depth (tree complexity), min samples per leaf, subsampling rate, and number of estimators.

Question 2

A feature preprocessing step that fits a StandardScaler runs inside your Spark training job but you need to replicate it in a low-latency Python inference service. Walk me through how min-max normalization and z-score standardization differ, where training-serving skew is most likely to creep in during this replication, and how you validate the serving path matches the training path.

Accepted Answer

Min-max normalization scales to [0,1] and is sensitive to outliers; use it for neural networks with bounded activations or image pixels. Z-score standardization gives zero mean and unit variance; use it for linear models, SVMs, or when the feature distribution is roughly Gaussian. Fitting on the full dataset causes data leakage: the scaler learns statistics from the test set, inflating evaluation metrics and giving an overly optimistic estimate of production performance.

Question 3

Why is k-fold cross-validation inappropriate for time-series data, and what cross-validation strategy do you use instead when evaluating a model that predicts next-week churn?

Accepted Answer

k-fold shuffles data randomly, creating folds where future data leaks into training splits, inflating validation metrics for temporally correlated data. Time-series cross-validation uses walk-forward (expanding window or sliding window) splits: each fold's validation set is always strictly after its training set in time. For weekly churn, a sliding window of 6 months training and 1 month validation, advanced by 1 month per fold, gives 12 non-leaking evaluation points over a year of data.

Question 4

You are setting up automated retraining for a fraud detection model on a 0.1% positive class dataset. A model that always predicts not-fraud achieves 99.9% accuracy. What techniques would you apply during training, what metrics would you gate model promotion on in your CI pipeline instead of accuracy, and how do you monitor for class imbalance shift in production after deployment?

Accepted Answer

A strong answer covers: using precision-recall AUC and F-beta score instead of accuracy, applying class weights in the loss function (inversely proportional to class frequency), SMOTE or random oversampling of the minority class, undersampling the majority class, and adjusting the decision threshold on predicted probability rather than using 0.5. It should note that SMOTE can cause overfitting on small fraud datasets and that ensemble methods like balanced random forest are purpose-built for this.

Question 5

A medical diagnosis model is being promoted to production and you are designing the evaluation gate in CI. The clinical team wants to minimise false negatives. Explain the precision-recall tradeoff, how you encode the acceptable false-negative rate as a hard threshold in the promotion gate, and how you would surface threshold drift to model owners if the score distribution shifts post-deployment.

Accepted Answer

A strong answer defines precision (of flagged patients, what fraction truly need follow-up) and recall (of all true cases, what fraction are flagged), explains that lowering the threshold increases recall at the cost of precision (more false alarms), and notes that for this use case recall should be prioritized to minimize missed diagnoses, accepting lower precision (more unnecessary follow-ups). The threshold choice should involve clinical stakeholders to agree on the acceptable false positive rate.

Question 6

How does Docker layer caching work, and how would you structure a Dockerfile for a PyTorch training image to maximize cache hits during iterative development?

Accepted Answer

Docker caches each layer by hashing its instruction and the layers beneath it. A cache miss on any layer invalidates all subsequent layers. For an ML training image, the order should be: base image (PyTorch CUDA), system dependencies (apt-get), Python dependencies (requirements.txt), then the training code. Changing training code only rebuilds the final layer; changing requirements.txt rebuilds from step 3. The candidate should mention separating requirements into stable (torch) and frequently changing (project code) layers.

Question 7

Your team is deploying a transformer-based ranking model where the self-attention layer is the inference latency bottleneck at sequence lengths above 512. Explain how self-attention works at the matrix operation level, what the O(n^2) complexity means for your serving SLA as sequence length grows, and which inference optimisations you would evaluate to address it.

Accepted Answer

A strong answer describes computing query, key, and value matrices from the input, computing attention scores as softmax(QK^T / sqrt(d_k)), and multiplying by V to get context-weighted representations. Time complexity is O(n^2 * d) where n is sequence length and d is head dimension because every token attends to every other token. Memory is O(n^2) for the attention matrix, which is the primary bottleneck for long sequences. It should mention that this is why Flash Attention and sparse attention patterns were developed.

Question 8

You are designing a Kafka topic that carries ML inference request logs at 200,000 messages per second. How do you choose the number of partitions and design the consumer group for a drift monitoring application?

Accepted Answer

Partition count should be at least as large as the maximum desired consumer parallelism, and account for producer throughput per partition (typically 10MB/s per partition for fast producers). At 200K messages/sec with moderate message size, 20 to 50 partitions is reasonable. The drift monitoring consumer group should have one consumer per partition for maximum parallelism, use a stateless design that accumulates statistics in a shared Redis or writes to a time-series DB, and manage offsets carefully to allow replay when backfilling reference distributions.

Question 9

Write a SQL query that computes, for each customer transaction, the running 30-day sum of spending and the 7-day average transaction value as features, ensuring no future data leaks into each row's feature values.

Accepted Answer

A strong answer uses SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) and AVG(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). It should explain that using RANGE BETWEEN with date arithmetic is more correct when there are multiple transactions per day, and note that this window definition is inherently point-in-time correct since it uses only the current and preceding rows.

Question 10

Compare gRPC and REST HTTP/JSON for exposing a model serving endpoint that will handle 50,000 requests per second with binary tensor payloads. What are the key differences in performance and operational overhead?

Accepted Answer

A strong answer notes that gRPC uses HTTP/2 for multiplexed persistent connections and Protocol Buffers for compact binary serialization, reducing per-request overhead significantly for high-throughput binary payloads (avoiding JSON parsing and base64-encoding of tensors). REST with JSON requires per-request connection setup in HTTP/1.1 or JSON parsing overhead. gRPC adds operational complexity (protobufs, service definitions, less tooling support) but at 50K RPS with tensor payloads the performance gains typically justify it. Triton serves both protocols.

Questions

How does gradient boosting differ from random forestsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Normalization vs standardization in ML preprocessingDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Cross-validation strategy for time-series dataDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Techniques for handling severe class imbalanceDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Precision-recall tradeoff in model deployment decisionsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Docker layer caching for ML training imagesDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Self-attention mechanism in transformer modelsDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Kafka partition count and consumer group design for ML pipelinesDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Window functions for training dataset construction in SQLDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

gRPC vs REST for high-throughput model servingDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Related questions

Normalization vs standardization in ML preprocessing

Cross-validation strategy for time-series data

Techniques for handling severe class imbalance

Precision-recall tradeoff in model deployment decisions

More mlops engineer topics

Tools to sharpen your prep

Questions

How does gradient boosting differ from random forestsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Normalization vs standardization in ML preprocessingDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Cross-validation strategy for time-series dataDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Techniques for handling severe class imbalanceDomain knowledgemediumVery common

As asked

Sample answer outline

Expect these follow-ups

Precision-recall tradeoff in model deployment decisionsDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Docker layer caching for ML training imagesDomain knowledgeeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Self-attention mechanism in transformer modelsDomain knowledgemediumCommon

As asked

Sample answer outline

Expect these follow-ups

Kafka partition count and consumer group design for ML pipelinesDomain knowledgemediumCommon

As asked

Sample answer outline