Machine Learning Engineer Interview Prep

The four rounds you are preparing for

A machine learning engineer loop usually spans four kinds of round, and treating them as one blurry "ML interview" is how people underprepare. There is an ML fundamentals round on concepts and modelling judgement, a coding round much like a standard software interview, an ML system design round on building a model-serving system end to end, and often a round on deployment, monitoring, and the operational side. Strong candidates know which signal each round is after and prepare for them separately.

The role sits between data science and software engineering, so interviewers want both. You should be able to reason about bias and variance and also write clean, testable code and talk about latency budgets. Lean too far toward theory and you look like you have never shipped a model. Lean too far toward engineering and you look like you do not understand what you are deploying.

Here is how the four rounds usually map to the signal a panel is trying to extract.

Round	What it really tests	The failure mode it screens for
ML fundamentals	Modelling judgement, knowing tradeoffs	Memorised definitions, no instinct for when to use what
Coding	Clean code under time pressure	"I only do notebooks" engineers who cannot ship
ML system design	End-to-end thinking, data first	Architecture astronauts who skip data and metrics
Deployment and monitoring	Operational maturity	People who think the job ends at offline accuracy

A useful frame: data scientists are hired to find an answer, ML engineers are hired to keep that answer working in production at 3am. Most loops are calibrated to find the second kind of person.

Which version of the ML role you are actually interviewing for

Three job titles share this loop and the same four rounds, but the center of gravity moves between them, and preparing for the wrong center is a quiet way strong candidates underperform. Read the job description for which round will carry the most weight before you decide where to spend your reps. Chip Huyen's Machine Learning Interviews book makes a related point: almost every company asks a systems design question, and it is the round candidates find hardest, so it rewards deliberate practice more than any other.

Title	Where the loop's weight sits	What to over-prepare
ML engineer	ML system design and serving, balanced with modelling	End to end design: metric, data, baseline, serving, monitoring
MLOps engineer	Deployment, pipelines, reproducibility, infrastructure	Drift detection, retraining triggers, feature-store consistency
AI engineer	LLMs, retrieval, and evaluating generative systems	Prompt and context design, offline evals, retrieval quality

If the posting is for an AI engineer, the modelling questions bend toward retrieval and evaluating generative systems rather than training from scratch; the AI engineer interview prep guide covers that flavour in depth.

One more axis changes the bar without changing the questions: how much you are expected to drive the room. Earlier in a career it is acceptable to be led through the design round when prompted. The more senior the role, the more the panel expects you to surface the metric, the data problems, and the training and serving skew yourself before anyone asks. At the staff bar the conversation moves from "build the model" to "should we use machine learning here at all, and what does being wrong cost the organisation." That is a difference in who drives, not in what gets asked.

ML fundamentals, judgement over recall

The fundamentals round is not a vocabulary quiz. It checks whether you can choose the right approach for a problem and explain the tradeoffs. Expect questions on the bias variance tradeoff, overfitting and regularisation, how you would handle imbalanced classes, and when a simpler model beats a neural network.

The strongest answers connect a choice to consequences. If asked how you would handle a fraud dataset where positives are one in a thousand, do not just say "use SMOTE." Talk about why accuracy is a useless metric here, why you would look at precision, recall, and the precision-recall curve, how class weighting or resampling each have costs, and how the right operating point depends on the business cost of a false negative versus a false positive.

Here is what the gap between a weak and a strong answer looks like on that exact prompt.

Weak answer: "The data is imbalanced so I would oversample the minority class with SMOTE, then train a random forest and check accuracy."

Strong answer: "First I would not trust accuracy, because predicting 'not fraud' every time scores 99.9 percent and catches nothing. I would optimise for recall at a fixed precision, or look at the area under the precision-recall curve, because that is what survives class imbalance. SMOTE can help but it fabricates points near the decision boundary and can leak if I resample before splitting, so I would split first, then try class weighting as the cheaper baseline. The real lever is the operating threshold: a missed fraud might cost the business a hundred pounds while a false alarm costs a customer ten seconds of friction, so I would tune the threshold against that cost ratio, not against a default of 0.5."

The second answer is not longer because it is padded. It is longer because every sentence names a consequence. That is the texture interviewers grade on.

Two of those moves rest on facts worth citing if a follow-up pushes you. On heavily imbalanced data a model that always predicts the majority class can score 99 percent accuracy while catching nothing, which is exactly why Google's classification crash course steers you to precision and recall over accuracy. And resampling before you split leaks test information into training, so scikit-learn's own pitfalls guide tells you to split first and fit any transform on the training subset alone.

Be ready for "when would you not use a neural network." A good answer covers small datasets, the need for interpretability, tight latency or cost budgets, and cases where a gradient boosted tree on tabular data simply performs better with far less effort. Knowing when not to reach for the complex tool is a senior signal.

A few fundamentals that come up constantly and reward a precise answer:

Regularisation: L1 drives weights to exactly zero so it doubles as feature selection, L2 shrinks them smoothly. Dropout is regularisation for neural nets. Be able to say which one and why.
Evaluation splits: why a random split leaks for time-series or user-level data, and when you need a temporal or grouped split instead.
Calibration: a model can rank well (good AUC) yet output probabilities that are meaningless. If a downstream system multiplies the probability by a cost, you need calibration, not just ranking.
Cross-validation: when k-fold is worth the cost and when a single held-out set is enough.

The coding round still matters

Many candidates neglect coding because they think the ML rounds carry the offer. They do not. A sloppy coding round can sink an otherwise strong loop. The format is usually standard data structures and algorithms, sometimes with a numerical or array-heavy flavour.

You may also be asked to implement a small piece of ML from scratch, which tests that you understand the maths underneath the libraries. Gradient descent is a common one.

def gradient_descent(x, y, lr=0.01, steps=1000):
    w, b = 0.0, 0.0
    n = len(x)
    for _ in range(steps):
        preds = [w * xi + b for xi in x]
        dw = (2 / n) * sum((preds[i] - y[i]) * x[i] for i in range(n))
        db = (2 / n) * sum(preds[i] - y[i] for i in range(n))
        w -= lr * dw
        b -= lr * db
    return w, b

w, b = gradient_descent([1, 2, 3, 4], [2, 4, 6, 8])
print(round(w, 2), round(b, 2))  # 1.99 0.03  (recovers y = 2x: w near 2, b near 0)

Be ready to explain each line, what the learning rate controls, and what happens if it is too large or too small. If the rate is too large the loss oscillates or diverges, if it is too small training crawls and may stall in a flat region. Treat this like any coding round: state your approach, talk through complexity, and check edge cases.

Interviewers also like to push one level further. Expect a follow-up such as "now vectorise this with NumPy" or "add an evaluation metric." Being able to implement a metric cleanly signals that you understand what you are optimising rather than calling sklearn on autopilot.

def f1_score(y_true, y_pred):
    tp = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
    fp = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
    fn = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))
    if tp == 0:
        return 0.0
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    return 2 * precision * recall / (precision + recall)

# 2 of 3 real positives caught, 1 false alarm: precision 0.667, recall 0.667
round(f1_score([1, 0, 1, 1, 0], [1, 0, 0, 1, 1]), 4)  # 0.6667

The guard for tp == 0 is the kind of edge case that quietly separates a careful candidate from a hasty one. Name it out loud when you write it.

ML system design, the round that decides offers

This is where ML engineer offers are usually won or lost. You will be given a prompt like "design a system to recommend products" or "design fraud detection for checkout," and you need to reason from the data and the product backwards, not jump to a model.

Use a repeatable structure so you never freeze on a blank prompt:

Clarify the goal and the metric. Recommendations optimised for clicks behave very differently from recommendations optimised for long-term retention. Pin down what success means before anything else.
Establish scale and latency. Requests per second, acceptable p99 latency, batch versus real-time. These constraints decide most architecture choices for you.
Data and labels. Where do the labels come from, how delayed are they, how noisy. This is usually the hard part and the part candidates skip.
Features and a feature store. What features, computed when, and how you keep them consistent between training and serving.
Model and a baseline. Always name a simple baseline first.
Training and evaluation. Offline metric, the split strategy, and how it connects to the online metric.
Serving. Online inference, batch precompute, or a hybrid.
Monitoring and iteration. How you know it is still working and how you retrain.

A worked example: design fraud detection for checkout

Here is roughly how a strong forty-five minute answer flows. It is compressed, but it shows the shape.

Clarify. "What is the cost asymmetry? Blocking a real customer at checkout is expensive in lost revenue and trust, missing a fraud is a chargeback. I will assume a missed fraud costs roughly ten times a false block, so I will tune for high recall but cap the false-positive rate." Then scale: "Say 2,000 checkouts per second at peak, and the decision must return inside 100ms because it sits on the critical purchase path."

Data and labels. "Labels arrive late. A transaction is only confirmed fraudulent when a chargeback lands, which can be 30 to 90 days later. That label delay shapes everything: my training data is always describing an old world, so drift monitoring is not optional." Mention that some labels are noisy because not all fraud is reported.

Features. "Velocity features matter most here: transactions per card in the last minute, hour, day; distance between this purchase location and the last; device and IP reputation. The trap is that the one-minute velocity feature must be computed identically online and offline, so I would compute it once in a feature store and read it from both paths."

Baseline then model. "I would ship logistic regression or a gradient boosted tree first. It is interpretable, it trains in minutes, and fraud teams need to explain decisions for compliance. A deep model comes later only if the simple one plateaus."

Serving. "Real-time scoring behind the feature store, with a hard latency budget and a fail-open or fail-closed policy I would decide with the business: if the model service times out, do we let the transaction through or hold it?"

Monitoring. "Because labels are delayed, I cannot watch accuracy in real time. So I watch input feature drift and the score distribution as leading indicators, and the confirmed chargeback rate as the lagging ground truth. A sudden shift in the score distribution triggers an investigation before the chargebacks even arrive."

Two points separate strong answers from average ones in any design prompt. First, always propose a simple baseline before the fancy model, because interviewers want to see that you would not reach for a deep model when logistic regression would ship faster and be easier to debug. Second, be explicit about training and serving skew. If a feature is computed one way in training and another way at serving time, the model silently degrades, and naming that risk shows real experience. Both instincts trace to Google's Rules of Machine Learning: its very first rule is that you should not be afraid to launch without machine learning when a simple heuristic gets you most of the way, and it devotes a whole section to training-serving skew, the gap between how features are built in each pipeline.

Deployment, monitoring, and the operational side

A model that scores well offline can still fail in production, and interviewers increasingly probe this. Be ready to talk about how you would deploy a model safely and watch it once it is live.

Roll out behind a shadow or canary so you can compare the new model against the current one on real traffic before switching over.
Monitor input feature distributions for drift, because the world changes and a model trained last quarter may see different data today.
Track the prediction distribution and the downstream business metric, not just request latency, since a model can keep serving fast while quietly getting worse.
Have a rollback plan and a retraining cadence, and know what would trigger an emergency retrain.

It helps to distinguish the kinds of drift, because interviewers probe whether you actually understand what fails. The line monitoring tools such as Evidently draw is between data drift, where the input distribution moves, and concept drift, where the relationship between the inputs and the label moves; naming which one you suspect tells the panel you have watched a model decay rather than only trained one.

Type of drift	What changes	A concrete example
Data drift	The input distribution moves	A new device type appears that the model never saw in training
Concept drift	The relationship between inputs and label moves	Fraudsters change tactics, so the same signals now mean something different
Label delay	Ground truth arrives late or never	Chargebacks confirm fraud months after the prediction

If you can describe how you would detect that a live model has degraded, and what you would do about it, you stand out from candidates who only discuss offline accuracy. The senior move is to say what you would monitor as a leading indicator when the true label is delayed, rather than waiting for accuracy to drop.

Where ML loops actually go wrong

Jumping to a complex model before stating the metric or proposing a baseline.
Ignoring data and labels, which are usually the hard part, and over-indexing on architecture.
Forgetting training and serving skew, then being unable to explain why the live model underperforms.
Neglecting the coding round and assuming ML knowledge alone carries the offer.
Quoting metrics with no link to the business, for example optimising AUC when the product cares about revenue per session.
Designing only the happy path and never mentioning what happens when the model service is down or the feature store is stale.

How to practise

Run three or four full ML system design prompts out loud on a timer: a recommender, a fraud system, a search ranker, and a content moderation pipeline cover most of the patterns. Separately, drill standard coding problems and implement a few small ML pieces like gradient descent and a basic evaluation metric from scratch. After each design run, check that you stated the metric, proposed a baseline, named the training and serving skew, and described monitoring. That balance of theory and engineering is exactly what the role demands.

A simple self-scoring checklist after each mock design:

Did I clarify the metric before naming a model?
Did I talk about where the labels come from and how delayed they are?
Did I propose a baseline first?
Did I name training and serving skew explicitly?
Did I describe what I would monitor and what would trigger a retrain?

If you can tick all five without being prompted, you are interviewing at a senior bar.

Frequently asked questions

How much maths do I need? Enough to explain the intuition and implement the basics from scratch. You should be comfortable with gradients, loss functions, and probability, but you will rarely be asked to derive backpropagation by hand. Judgement about tradeoffs matters more than proofs.

Do I need deep learning to pass? Often no. Many production ML problems, especially on tabular data, are best served by gradient boosted trees and logistic regression. Knowing when a deep model is and is not justified is itself a strong signal.

What if I have not deployed a model in production? Be honest, then lean on the parts you can reason about: the framework, the tradeoffs, and how you would detect and handle failure. Walk through the worked example structure above and you will cover most of what a deployment round is checking for.

How long should I prepare? For most engineers with some ML exposure, three to four weeks of focused practice across the four rounds is realistic. The system design round usually needs the most reps because it is the hardest to fake.

Where to take your ML prep next

Because the design round decides most ML engineer loops, the highest-leverage practice is more reps on end-to-end system design.

Machine learning engineer interview questions to drill the fundamentals and coding rounds against real prompts.
System design interview prep for senior roles for the drive-the-room habits the senior bar expects.
Backend system design deep dive for the serving, storage, and scaling reasoning your model has to live inside.
Design a rate limiter for a worked, numbers-first design walk-through to model your own answers on.

Sources

Google, Rules of Machine Learning: baseline before ML (Rule 1) and the training-serving skew section.
scikit-learn, Common pitfalls and recommended practices: data leakage from preprocessing before the split.
Google, Classification: Accuracy, recall, precision: why accuracy misleads on imbalanced data.
Evidently AI, What is data drift: data drift versus concept drift.
Chip Huyen, Machine Learning Interviews book: the ML interview rounds and why system design is the hardest.

The four rounds you are preparing for

Here is how the four rounds usually map to the signal a panel is trying to extract.

Round	What it really tests	The failure mode it screens for
ML fundamentals	Modelling judgement, knowing tradeoffs	Memorised definitions, no instinct for when to use what
Coding	Clean code under time pressure	"I only do notebooks" engineers who cannot ship
ML system design	End-to-end thinking, data first	Architecture astronauts who skip data and metrics
Deployment and monitoring	Operational maturity	People who think the job ends at offline accuracy

A useful frame: data scientists are hired to find an answer, ML engineers are hired to keep that answer working in production at 3am. Most loops are calibrated to find the second kind of person.

Which version of the ML role you are actually interviewing for

Title	Where the loop's weight sits	What to over-prepare
ML engineer	ML system design and serving, balanced with modelling	End to end design: metric, data, baseline, serving, monitoring
MLOps engineer	Deployment, pipelines, reproducibility, infrastructure	Drift detection, retraining triggers, feature-store consistency
AI engineer	LLMs, retrieval, and evaluating generative systems	Prompt and context design, offline evals, retrieval quality

ML fundamentals, judgement over recall

Here is what the gap between a weak and a strong answer looks like on that exact prompt.

Weak answer: "The data is imbalanced so I would oversample the minority class with SMOTE, then train a random forest and check accuracy."

The second answer is not longer because it is padded. It is longer because every sentence names a consequence. That is the texture interviewers grade on.

A few fundamentals that come up constantly and reward a precise answer:

Regularisation: L1 drives weights to exactly zero so it doubles as feature selection, L2 shrinks them smoothly. Dropout is regularisation for neural nets. Be able to say which one and why.
Evaluation splits: why a random split leaks for time-series or user-level data, and when you need a temporal or grouped split instead.
Calibration: a model can rank well (good AUC) yet output probabilities that are meaningless. If a downstream system multiplies the probability by a cost, you need calibration, not just ranking.
Cross-validation: when k-fold is worth the cost and when a single held-out set is enough.

The coding round still matters

You may also be asked to implement a small piece of ML from scratch, which tests that you understand the maths underneath the libraries. Gradient descent is a common one.

def gradient_descent(x, y, lr=0.01, steps=1000):
    w, b = 0.0, 0.0
    n = len(x)
    for _ in range(steps):
        preds = [w * xi + b for xi in x]
        dw = (2 / n) * sum((preds[i] - y[i]) * x[i] for i in range(n))
        db = (2 / n) * sum(preds[i] - y[i] for i in range(n))
        w -= lr * dw
        b -= lr * db
    return w, b

w, b = gradient_descent([1, 2, 3, 4], [2, 4, 6, 8])
print(round(w, 2), round(b, 2))  # 1.99 0.03  (recovers y = 2x: w near 2, b near 0)

def f1_score(y_true, y_pred):
    tp = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
    fp = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
    fn = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))
    if tp == 0:
        return 0.0
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    return 2 * precision * recall / (precision + recall)

# 2 of 3 real positives caught, 1 false alarm: precision 0.667, recall 0.667
round(f1_score([1, 0, 1, 1, 0], [1, 0, 0, 1, 1]), 4)  # 0.6667

The guard for tp == 0 is the kind of edge case that quietly separates a careful candidate from a hasty one. Name it out loud when you write it.

ML system design, the round that decides offers

Use a repeatable structure so you never freeze on a blank prompt:

Clarify the goal and the metric. Recommendations optimised for clicks behave very differently from recommendations optimised for long-term retention. Pin down what success means before anything else.
Establish scale and latency. Requests per second, acceptable p99 latency, batch versus real-time. These constraints decide most architecture choices for you.
Data and labels. Where do the labels come from, how delayed are they, how noisy. This is usually the hard part and the part candidates skip.
Features and a feature store. What features, computed when, and how you keep them consistent between training and serving.
Model and a baseline. Always name a simple baseline first.
Training and evaluation. Offline metric, the split strategy, and how it connects to the online metric.
Serving. Online inference, batch precompute, or a hybrid.
Monitoring and iteration. How you know it is still working and how you retrain.

A worked example: design fraud detection for checkout

Here is roughly how a strong forty-five minute answer flows. It is compressed, but it shows the shape.

Clarify. "What is the cost asymmetry? Blocking a real customer at checkout is expensive in lost revenue and trust, missing a fraud is a chargeback. I will assume a missed fraud costs roughly ten times a false block, so I will tune for high recall but cap the false-positive rate." Then scale: "Say 2,000 checkouts per second at peak, and the decision must return inside 100ms because it sits on the critical purchase path."

Data and labels. "Labels arrive late. A transaction is only confirmed fraudulent when a chargeback lands, which can be 30 to 90 days later. That label delay shapes everything: my training data is always describing an old world, so drift monitoring is not optional." Mention that some labels are noisy because not all fraud is reported.

Features. "Velocity features matter most here: transactions per card in the last minute, hour, day; distance between this purchase location and the last; device and IP reputation. The trap is that the one-minute velocity feature must be computed identically online and offline, so I would compute it once in a feature store and read it from both paths."

Baseline then model. "I would ship logistic regression or a gradient boosted tree first. It is interpretable, it trains in minutes, and fraud teams need to explain decisions for compliance. A deep model comes later only if the simple one plateaus."

Serving. "Real-time scoring behind the feature store, with a hard latency budget and a fail-open or fail-closed policy I would decide with the business: if the model service times out, do we let the transaction through or hold it?"

Monitoring. "Because labels are delayed, I cannot watch accuracy in real time. So I watch input feature drift and the score distribution as leading indicators, and the confirmed chargeback rate as the lagging ground truth. A sudden shift in the score distribution triggers an investigation before the chargebacks even arrive."

Deployment, monitoring, and the operational side

A model that scores well offline can still fail in production, and interviewers increasingly probe this. Be ready to talk about how you would deploy a model safely and watch it once it is live.

Roll out behind a shadow or canary so you can compare the new model against the current one on real traffic before switching over.
Monitor input feature distributions for drift, because the world changes and a model trained last quarter may see different data today.
Track the prediction distribution and the downstream business metric, not just request latency, since a model can keep serving fast while quietly getting worse.
Have a rollback plan and a retraining cadence, and know what would trigger an emergency retrain.

Type of drift	What changes	A concrete example
Data drift	The input distribution moves	A new device type appears that the model never saw in training
Concept drift	The relationship between inputs and label moves	Fraudsters change tactics, so the same signals now mean something different
Label delay	Ground truth arrives late or never	Chargebacks confirm fraud months after the prediction

Where ML loops actually go wrong

Jumping to a complex model before stating the metric or proposing a baseline.
Ignoring data and labels, which are usually the hard part, and over-indexing on architecture.
Forgetting training and serving skew, then being unable to explain why the live model underperforms.
Neglecting the coding round and assuming ML knowledge alone carries the offer.
Quoting metrics with no link to the business, for example optimising AUC when the product cares about revenue per session.
Designing only the happy path and never mentioning what happens when the model service is down or the feature store is stale.

How to practise

A simple self-scoring checklist after each mock design:

Did I clarify the metric before naming a model?
Did I talk about where the labels come from and how delayed they are?
Did I propose a baseline first?
Did I name training and serving skew explicitly?
Did I describe what I would monitor and what would trigger a retrain?

If you can tick all five without being prompted, you are interviewing at a senior bar.

Frequently asked questions

Where to take your ML prep next

Because the design round decides most ML engineer loops, the highest-leverage practice is more reps on end-to-end system design.

Machine learning engineer interview questions to drill the fundamentals and coding rounds against real prompts.
System design interview prep for senior roles for the drive-the-room habits the senior bar expects.
Backend system design deep dive for the serving, storage, and scaling reasoning your model has to live inside.
Design a rate limiter for a worked, numbers-first design walk-through to model your own answers on.

Sources

Google, Rules of Machine Learning: baseline before ML (Rule 1) and the training-serving skew section.
scikit-learn, Common pitfalls and recommended practices: data leakage from preprocessing before the split.
Google, Classification: Accuracy, recall, precision: why accuracy misleads on imbalanced data.
Evidently AI, What is data drift: data drift versus concept drift.
Chip Huyen, Machine Learning Interviews book: the ML interview rounds and why system design is the hardest.

Machine Learning Engineer Interview Prep

The four rounds you are preparing for

Which version of the ML role you are actually interviewing for

ML fundamentals, judgement over recall

The coding round still matters

ML system design, the round that decides offers

A worked example: design fraud detection for checkout

Deployment, monitoring, and the operational side

Where ML loops actually go wrong

How to practise

Frequently asked questions

Where to take your ML prep next

Sources

Continue your prep

ML engineer interview questions

AI engineer interview questions

MLOps engineer interview questions

Machine Learning Engineer Interview Prep

The four rounds you are preparing for

Which version of the ML role you are actually interviewing for

ML fundamentals, judgement over recall

The coding round still matters

ML system design, the round that decides offers

A worked example: design fraud detection for checkout

Deployment, monitoring, and the operational side

Where ML loops actually go wrong

How to practise

Frequently asked questions

Where to take your ML prep next

Sources

Continue your prep

ML engineer interview questions

AI engineer interview questions

MLOps engineer interview questions