Data Scientist Case Study Interviews

What the case study round is really for

The data scientist case study is where companies find out whether you can do the actual job, not just pass a stats quiz. A typical prompt is loose on purpose: "our subscription product has rising churn, what would you do," or "design an experiment for the new onboarding flow." The interviewer watches whether you can turn a business problem into a precise question, choose the right method, reason about what could mislead you, and explain it to a non-statistician.

These rounds reward structure and judgement far more than a perfect model. Many strong candidates stumble because they reach for a method before defining the problem. The fix is a repeatable approach for when the prompt is vague.

It helps to know what the interviewer is scoring. Most rubrics weigh four things, in roughly this order.

What they score	What it looks like in the room
Problem framing	You restate the goal as a decision before discussing data or models
Statistical judgement	You spot bias and leakage, and you pick metrics that match the cost of being wrong
Method fit	You choose the simplest tool that answers the question, and justify it
Communication	You finish with a caveated recommendation a product manager could act on

Modelling skill is folded into "method fit" and is rarely the deciding factor. A candidate who frames a clean question and lands a clear recommendation usually beats one with a better model who never connects it to a decision.

Frame the problem before touching methods

Spend the first few minutes converting the prompt into a sharp question. For "churn is rising," that means asking what churn means here, over what window, for which users, and what decision the analysis supports. An analysis that informs a retention campaign looks different from one that forecasts revenue.

Restate the goal out loud and check it: "so we want to understand which users are likely to churn in the next thirty days and what is driving it, so the product team can intervene." That single sentence shows you start from the decision, not the dataset. Then state your assumptions explicitly, because real problems have gaps and naming them is a strength.

A short clarifying volley early on buys the rest of the case. The questions that pay off:

What decision will this analysis support, and who owns it?
How is the metric defined today, and is there a definition the business trusts?
What is the time horizon: explaining the past, predicting next month, or forecasting a quarter?
What is the cost of a wrong call in each direction?
What data realistically exists, and how clean is it?

Strong candidates treat the first five minutes as scoping, not stalling. Weak candidates either dive into modelling immediately or ask clarifying questions with no purpose behind them. The test is whether each question changes what you would do next. If the answer would not move your plan, do not ask it.

What good versus weak framing looks like

A weak opening skips the decision: "I would build a churn model and look at the important features." A strong opening pins down the question first: "Are we counting voluntary cancellations, failed payments, or both? Is the goal to forecast revenue, or to hand the product team a list of at-risk users this week?" The second answer has not touched a technique yet, and it is already ahead.

Reason about the data you would need

Before any modelling, talk through the data: what you would pull, the grain of each table, and what could be missing or biased. For churn you would want usage events, subscription status over time, support contacts, and billing history. Watch for the traps that quietly break analyses:

Survivorship bias, where you only look at users who stayed and miss the signal in those who left.
Leakage, where a feature secretly encodes the outcome, such as "account closed" predicting churn.
Selection effects, where your sample is not representative of the population you care about.
Look-ahead bias, where a feature uses information not available at prediction time.
Definition drift, where the way churn was logged changed mid-window, so a trend is an artefact of instrumentation.

Saying you would run a small data audit first lands well: check for duplicate events, plot the metric over time to catch logging changes, and confirm join keys are unique before trusting any aggregate. That habit separates real analysis from tutorials.

Leakage deserves a worked moment, because it is the most common reason a model looks brilliant in validation and useless in production. If the churn label is "subscription cancelled" and a feature is "support ticket category equals account closure request," that feature has leaked the outcome into the input. The tell is a single feature with suspiciously high importance and near-perfect separation. "I would be suspicious of any feature that alone predicts the outcome almost perfectly, and check it is available before the event we are predicting" is the judgement these rounds reward. The scikit-learn guide to common pitfalls frames leakage as using information at training time that would not be available at prediction time, and shows how even fitting a feature selector on the full dataset before splitting inflates a score on random data from chance to well above it.

Choose a method that fits the question

Only after framing the problem and understanding the data should you pick an approach, and justify it. To understand drivers of churn, an interpretable model like logistic regression or a decision tree may beat a black box that scores slightly better, because the team needs to know which levers to pull. For pure prediction to prioritise outreach, a gradient boosted model is reasonable, but say how you would still surface feature importance so the result is actionable. The judgement signal is matching the method to the decision, and choosing the simpler tool when it serves.

A compact way to reason about method choice out loud:

Decision the business needs	Sensible first method	Why
Understand drivers, explain to a team	Logistic regression, single decision tree	Coefficients and splits map to levers
Rank users for outreach	Gradient boosted trees	Strong ranking, importances still recoverable
Estimate causal effect of a change	Experiment, or difference in differences	Correlation alone will not answer "did it cause it"
Forecast a metric forward in time	Time series model, or a baseline plus seasonality	Respects temporal structure and trend

Be clear about how you would evaluate. For an imbalanced outcome like churn, accuracy is misleading: a model that predicts "nobody churns" can be ninety five percent accurate and worthless. Google's machine learning crash course makes the same point, noting a classifier that always predicts the majority class scores ninety nine percent accuracy on a one percent positive rate while being useless, and that precision and recall usually move in opposite directions as you shift the threshold. Talk about precision, recall, and the operating point that matches the cost of contacting a user who would have stayed versus missing one who leaves. If outreach is cheap and losing a customer is expensive, tune for recall; if outreach is a personal call, precision matters more. Tying the threshold to the cost is worth more than quoting an AUC.

Always anchor against a baseline. Before any model, state the naive answer: the base rate of churn, last month carried forward, or a rule like "users with no logins in fourteen days." A method only earns its place if it beats that baseline by enough to justify the complexity. Interviewers notice when a candidate reaches for XGBoost without asking what a simple heuristic would achieve.

Experiment design questions

Many data scientist cases are really about experimentation. For "design a test for the new onboarding flow," structure your answer around the parts of a sound experiment. Start with the hypothesis and primary metric, stated precisely: "the new flow increases seven day activation by two percentage points." Then name the unit of randomisation, usually the user, and why randomising at the wrong level, such as by session, leaks the treatment.

SQL

-- sanity check that assignment is balanced before trusting results
select variant, count(distinct user_id) as users
from experiment_assignment
where experiment = 'onboarding_v2'
group by variant;

Cover sample size and run length, the guardrail metrics that catch harm, and the risks: novelty effects where users react to anything new, the multiple comparisons problem if you check many metrics, and peeking at results early and stopping when they look good. Evan Miller's write-up on how not to run an A/B test shows why: repeatedly checking and stopping the moment a result crosses significance can turn a nominal five percent false positive rate into something closer to twenty six percent. Naming the peeking problem, and fixing the duration in advance, is a strong signal.

Power and sample size are where many candidates wave their hands. The three levers are the baseline rate, the minimum effect you care about, and the variance: a smaller effect or a noisier metric needs a larger sample. You need not recite the formula, but translating it into a calendar, "tens of thousands per arm is roughly a two week run at our traffic," signals real experience.

Know when a clean A/B test is not available, because interviewers push on the edge cases:

Network effects, where treating one user changes another's experience, so randomising by user breaks independence. Cluster randomisation by region or team is the usual answer.
Low traffic, where you cannot reach power, so you lean on a longer run, a larger effect threshold, or a quasi experimental design.
Things you cannot randomise, like a pricing change rolled out to everyone, where difference in differences or a synthetic control becomes the honest tool. Difference in differences leans on a parallel trends assumption: it treats the untreated group's change over the same window as the counterfactual for the treated group, so you should be ready to defend that the two groups moved together before the change.

A short worked experiment answer

A strong answer, compressed:

"Seven day activation sits near forty percent today, so my hypothesis is that the new flow raises it, with a minimum detectable effect of two percentage points, since anything smaller will not justify the engineering cost. I would randomise at the user level on first session, because session level assignment would let a returning user see both flows. Primary metric is seven day activation; guardrails are day one crash rate and support ticket volume, so we catch a flow that activates people but frustrates them. Detecting a two point lift off a forty percent base needs on the order of ten thousand users per arm, which is about two weeks at our traffic, and I would fix the duration in advance to avoid peeking. Before reading results I would check assignment is balanced and the arms looked similar in the pre period. If activation is up, guardrails are flat, and the effect holds into the second week rather than spiking on day one, I would recommend rollout."

That answer hits hypothesis, effect size, randomisation unit, primary and guardrail metrics, duration, a balance check, a novelty check, and a recommendation. The structure, not the cleverness, scores.

Communicate like the analysis has a reader

A data scientist who cannot explain results to a product manager is only half doing the job, and interviewers test this directly. As you work, narrate your reasoning in plain language, and state conclusions as a recommendation with a confidence level and a caveat. "The new flow lifted activation by about two points, and we are fairly confident it is real, though we should watch whether the effect holds past the first week" communicates more than a recitation of p-values.

A reliable closing structure is recommendation, then evidence, then caveat, then next step. Leading with the recommendation respects a busy reader, and translating analysis into a decision is often what tips a case round into an offer.

Reading which case type you drew

Case prompts fall into a few recognisable shapes, and each shape is graded on a different move. Naming the shape in the first minute tells you where your time should go, and stops you from running an experiment-design playbook on a diagnosis prompt.

Case type	The prompt sounds like	The move that scores	Where candidates over invest
Diagnosis	"Churn is rising, what would you do"	Segment the metric by cohort before theorising, then rank hypotheses by the evidence you would gather	Reaching for a model before splitting the number apart
Metric investigation	"Signups fell last week, why"	Separate a real change from an instrumentation or seasonality artefact before naming a cause	Guessing causes before confirming the drop is even real
Experiment design	"Design a test for the new flow"	A precise hypothesis, the right randomisation unit, and a stopping rule fixed in advance	Listing metrics with no unit of randomisation and no duration
Measurement	"How would you measure success"	Turn a fuzzy goal into one primary metric plus guardrails that catch harm	Proposing twenty metrics with nothing named as primary

The interviewer's title shifts the emphasis too. An analytics engineer case leans on data modelling, grain, and metric definitions the business can trust more than on prediction. An ML engineer case pushes on how a model reaches production: monitoring, retraining, and latency. A core data scientist case sits between them, weighting experimentation and inference most heavily. Read the title before deciding where your minutes go.

Where candidates lose a case they could have won

These are the habits that quietly cost the offer even when the statistics are sound.

Reaching for a model before defining the question and the decision it supports.
Ignoring data quality, leakage, and bias, then trusting a result that is built on sand.
Optimising a metric like accuracy that does not fit an imbalanced problem.
Skipping the baseline, so you cannot say whether the method actually helped.
Designing an experiment without stating the unit of randomisation or the stopping rule.
Quoting a sample size with no link to the effect you want to detect.
Presenting findings as raw numbers rather than a clear, caveated recommendation.
Going silent while you think, so the interviewer cannot follow or help you.

Frequently asked questions

Do I need to write working code, or is talking through it enough? It depends on the format. A live case is usually about reasoning out loud, so clean pseudocode or a SQL sketch is plenty. A take home expects runnable code and a short written conclusion. Either way, never let the code crowd out the framing and the recommendation.

What if I do not know the exact statistical formula? Say what it depends on and reason qualitatively. "I would need a power calculation, which depends on the baseline rate, the effect I want to detect, and the variance" is confident even without the closed form. Interviewers care that you know the levers, not that you memorised it.

The prompt is too vague and I am stuck. That is the test, not a flaw in the question. Pick the most reasonable interpretation, state it as an assumption, and proceed: "I will assume voluntary churn over thirty days, tell me if that is off."

How long should I spend framing before solving? Roughly the first fifth of the time, the opening eight to ten minutes of a forty five minute case. Rushing it is the more common failure by far.

How to practise

Run several full cases out loud on a timer, mixing the types: a churn diagnosis, a metric drop investigation, an experiment design, and a "how would you measure success" prompt. After each, check that you framed the question, reasoned about the data and its biases, justified your method against a baseline, and ended with a recommendation a non-technical reader could act on. Record yourself once and listen back; the gap between what you think you said and what a listener heard is usually larger than you expect. Structure, statistical judgement, and clear communication are what these rounds find.

Sources

Evan Miller, How Not to Run an A/B Test, on how peeking and early stopping inflate the false positive rate.
scikit-learn, Common pitfalls and recommended practices, on data leakage and why you split before you fit.
Google, Accuracy, precision, and recall, on why accuracy misleads on imbalanced data.
Matheus Facure, Difference in Differences, on estimating a causal effect when you cannot randomise.

Where to take this next

What the case study round is really for

It helps to know what the interviewer is scoring. Most rubrics weigh four things, in roughly this order.

What they score	What it looks like in the room
Problem framing	You restate the goal as a decision before discussing data or models
Statistical judgement	You spot bias and leakage, and you pick metrics that match the cost of being wrong
Method fit	You choose the simplest tool that answers the question, and justify it
Communication	You finish with a caveated recommendation a product manager could act on

Frame the problem before touching methods

A short clarifying volley early on buys the rest of the case. The questions that pay off:

What decision will this analysis support, and who owns it?
How is the metric defined today, and is there a definition the business trusts?
What is the time horizon: explaining the past, predicting next month, or forecasting a quarter?
What is the cost of a wrong call in each direction?
What data realistically exists, and how clean is it?

Strong candidates treat the first five minutes as scoping, not stalling. Weak candidates either dive into modelling immediately or ask clarifying questions with no purpose behind them. The test is whether each question changes what you would do next. If the answer would not move your plan, do not ask it.

What good versus weak framing looks like

Reason about the data you would need

Survivorship bias, where you only look at users who stayed and miss the signal in those who left.
Leakage, where a feature secretly encodes the outcome, such as "account closed" predicting churn.
Selection effects, where your sample is not representative of the population you care about.
Look-ahead bias, where a feature uses information not available at prediction time.
Definition drift, where the way churn was logged changed mid-window, so a trend is an artefact of instrumentation.

Choose a method that fits the question

A compact way to reason about method choice out loud:

Decision the business needs	Sensible first method	Why
Understand drivers, explain to a team	Logistic regression, single decision tree	Coefficients and splits map to levers
Rank users for outreach	Gradient boosted trees	Strong ranking, importances still recoverable
Estimate causal effect of a change	Experiment, or difference in differences	Correlation alone will not answer "did it cause it"
Forecast a metric forward in time	Time series model, or a baseline plus seasonality	Respects temporal structure and trend

Experiment design questions

SQL

-- sanity check that assignment is balanced before trusting results
select variant, count(distinct user_id) as users
from experiment_assignment
where experiment = 'onboarding_v2'
group by variant;

Know when a clean A/B test is not available, because interviewers push on the edge cases:

Network effects, where treating one user changes another's experience, so randomising by user breaks independence. Cluster randomisation by region or team is the usual answer.
Low traffic, where you cannot reach power, so you lean on a longer run, a larger effect threshold, or a quasi experimental design.
Things you cannot randomise, like a pricing change rolled out to everyone, where difference in differences or a synthetic control becomes the honest tool. Difference in differences leans on a parallel trends assumption: it treats the untreated group's change over the same window as the counterfactual for the treated group, so you should be ready to defend that the two groups moved together before the change.

A short worked experiment answer

A strong answer, compressed:

"Seven day activation sits near forty percent today, so my hypothesis is that the new flow raises it, with a minimum detectable effect of two percentage points, since anything smaller will not justify the engineering cost. I would randomise at the user level on first session, because session level assignment would let a returning user see both flows. Primary metric is seven day activation; guardrails are day one crash rate and support ticket volume, so we catch a flow that activates people but frustrates them. Detecting a two point lift off a forty percent base needs on the order of ten thousand users per arm, which is about two weeks at our traffic, and I would fix the duration in advance to avoid peeking. Before reading results I would check assignment is balanced and the arms looked similar in the pre period. If activation is up, guardrails are flat, and the effect holds into the second week rather than spiking on day one, I would recommend rollout."

Communicate like the analysis has a reader

Reading which case type you drew

Case type	The prompt sounds like	The move that scores	Where candidates over invest
Diagnosis	"Churn is rising, what would you do"	Segment the metric by cohort before theorising, then rank hypotheses by the evidence you would gather	Reaching for a model before splitting the number apart
Metric investigation	"Signups fell last week, why"	Separate a real change from an instrumentation or seasonality artefact before naming a cause	Guessing causes before confirming the drop is even real
Experiment design	"Design a test for the new flow"	A precise hypothesis, the right randomisation unit, and a stopping rule fixed in advance	Listing metrics with no unit of randomisation and no duration
Measurement	"How would you measure success"	Turn a fuzzy goal into one primary metric plus guardrails that catch harm	Proposing twenty metrics with nothing named as primary

Where candidates lose a case they could have won

These are the habits that quietly cost the offer even when the statistics are sound.

Reaching for a model before defining the question and the decision it supports.
Ignoring data quality, leakage, and bias, then trusting a result that is built on sand.
Optimising a metric like accuracy that does not fit an imbalanced problem.
Skipping the baseline, so you cannot say whether the method actually helped.
Designing an experiment without stating the unit of randomisation or the stopping rule.
Quoting a sample size with no link to the effect you want to detect.
Presenting findings as raw numbers rather than a clear, caveated recommendation.
Going silent while you think, so the interviewer cannot follow or help you.

Frequently asked questions

How long should I spend framing before solving? Roughly the first fifth of the time, the opening eight to ten minutes of a forty five minute case. Rushing it is the more common failure by far.

How to practise

Sources

Evan Miller, How Not to Run an A/B Test, on how peeking and early stopping inflate the false positive rate.
scikit-learn, Common pitfalls and recommended practices, on data leakage and why you split before you fit.
Google, Accuracy, precision, and recall, on why accuracy misleads on imbalanced data.
Matheus Facure, Difference in Differences, on estimating a causal effect when you cannot randomise.

Data Scientist Case Study Interviews

What the case study round is really for

Frame the problem before touching methods

What good versus weak framing looks like

Reason about the data you would need

Choose a method that fits the question

Experiment design questions

A short worked experiment answer

Communicate like the analysis has a reader

Reading which case type you drew

Where candidates lose a case they could have won

Frequently asked questions

How to practise

Sources

Where to take this next

Continue your prep

Data scientist interview questions

Analytics engineer interview questions

ML engineer interview questions

Data Scientist Case Study Interviews

What the case study round is really for

Frame the problem before touching methods

What good versus weak framing looks like

Reason about the data you would need

Choose a method that fits the question

Experiment design questions

A short worked experiment answer

Communicate like the analysis has a reader

Reading which case type you drew

Where candidates lose a case they could have won

Frequently asked questions

How to practise

Sources

Where to take this next

Continue your prep

Data scientist interview questions

Analytics engineer interview questions

ML engineer interview questions