Data scientist - Coding

Data scientist coding interview questions

22 questions on coding for data scientist candidates. Each entry has the question as asked, a sample answer outline, common follow-ups, and a reference implementation where applicable.

As asked

Given a DataFrame with columns 'user_id', 'product_category', and 'spend', write Python code to add a column 'spend_percentile' that is each user's percentile rank of spend within their product_category group. Do not use a loop.

Sample answer outline

The candidate should use groupby with transform and scipy.stats.percentileofscore or pandas rank(pct=True). The key is using transform so the result aligns back with the original DataFrame index. They should handle ties with a specified method argument and explain why apply versus transform matters for returning a scalar vs a Series.

Reference implementation (python)

Python

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'product_category': ['A', 'A', 'B', 'B', 'A'],
    'spend': [100, 200, 150, 300, 50]
})

# Fill in: add 'spend_percentile' column
df['spend_percentile'] = df.groupby('product_category')['spend'].transform(
    lambda x: x.rank(pct=True)
)
print(df)

Expect these follow-ups

How would you do this if there were 10 million rows and you needed it to run in under 5 seconds?

pandaspythongroupbytransformcoding

As asked

You have a table 'orders' with columns order_id, product_category, and revenue. Write a SQL query to compute the median revenue for each product_category. Assume you are using PostgreSQL.

Sample answer outline

The candidate should use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) which is the ANSI SQL ordered-set aggregate function available in PostgreSQL. They should know this is different from AVG and understand that median requires sorting. An alternative using row_number and count for non-PostgreSQL environments demonstrates deeper SQL knowledge.

Reference implementation (sql)

SQL

SELECT
  product_category,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
FROM orders
GROUP BY product_category
ORDER BY product_category;

Expect these follow-ups

How would you compute the median without a built-in percentile function, using only standard window functions?

sqlmedianwindow-functionspostgresqlanalytics

As asked

Write a Python function that takes a 1D NumPy array of observations and returns a 95% bootstrap confidence interval for the mean. Use 10,000 resamples and no external bootstrap libraries.

Sample answer outline

The candidate should draw n samples with replacement 10,000 times using np.random.choice or numpy indexing with np.random.randint, compute the mean of each resample, and take the 2.5th and 97.5th percentiles of the resulting distribution. They should vectorize the operation rather than looping over resamples one by one, and note that this is the percentile method (not the BCa method).

Reference implementation (python)

Python

import numpy as np

def bootstrap_ci(data: np.ndarray, n_resamples: int = 10_000, ci: float = 0.95) -> tuple:
    rng = np.random.default_rng(seed=42)
    n = len(data)
    # draw all resamples at once for speed
    indices = rng.integers(0, n, size=(n_resamples, n))
    resample_means = data[indices].mean(axis=1)
    alpha = (1 - ci) / 2
    lower = np.quantile(resample_means, alpha)
    upper = np.quantile(resample_means, 1 - alpha)
    return lower, upper

Expect these follow-ups

What is the difference between the percentile method and the BCa bootstrap, and when does it matter?

bootstrapstatisticspythonnumpyconfidence-intervals

As asked

Implement a logistic regression classifier in Python using only NumPy. It should have a fit method that runs gradient descent and a predict_proba method. Show the gradient derivation in a comment.

Sample answer outline

The candidate should derive the binary cross-entropy gradient: dL/dw = X.T @ (sigmoid(X @ w) - y) / n. The sigmoid is 1/(1+exp(-z)) and should be numerically stabilized. The fit loop should update weights with w -= lr * gradient, optionally with L2 regularization. The candidate should discuss convergence criteria and learning rate selection.

Reference implementation (python)

Python

import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, n_iters=1000, lam=0.0):
        self.lr = lr
        self.n_iters = n_iters
        self.lam = lam
        self.w = None

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        n, p = X.shape
        self.w = np.zeros(p)
        for _ in range(self.n_iters):
            preds = self._sigmoid(X @ self.w)
            # gradient of BCE + L2: X.T(preds - y)/n + lam*w
            grad = X.T @ (preds - y) / n + self.lam * self.w
            self.w -= self.lr * grad

    def predict_proba(self, X):
        return self._sigmoid(X @ self.w)

Expect these follow-ups

How would you add L1 regularization and why does the gradient become non-smooth?

logistic-regressionnumpygradient-descentmachine-learningcoding

As asked

You have a DataFrame indexed by date with a column 'sales'. Dates are not contiguous (weekends are missing). Write code to compute a 7-calendar-day rolling average that counts only trading days in the window, then compare it to a 7-row rolling average.

Sample answer outline

A 7-row rolling(7).mean() computes over the last 7 rows regardless of time gaps. To get a 7-calendar-day window, use rolling('7D') after ensuring the index is a DatetimeIndex. The candidate should explain that rolling('7D') is an offset-based window that respects the actual dates, not the row count. They should handle the distinction and demonstrate both in code.

Reference implementation (python)

Python

import pandas as pd
import numpy as np

dates = pd.bdate_range('2024-01-01', '2024-03-31')  # business days only
df = pd.DataFrame({'sales': np.random.rand(len(dates)) * 100}, index=dates)

# Row-based (ignores date gaps)
df['rolling_7row'] = df['sales'].rolling(7).mean()

# Calendar-based (7-day window by actual date)
df['rolling_7day'] = df['sales'].rolling('7D').mean()
print(df.head(10))

Expect these follow-ups

How would you compute a rolling 7-day sum that treats missing dates as zero rather than skipping them?

pandasrolling-windowtime-seriespythoncoding

As asked

Given a DataFrame with columns 'user_id', 'signup_date', and 'activity_date', write Python code to produce a cohort retention matrix where rows are signup month cohorts and columns are months since signup (0, 1, 2, ...).

Sample answer outline

The candidate should compute the cohort month from signup_date, compute months since signup as the period difference between activity_date and cohort month, groupby cohort month and months-since-signup to count distinct active users, and divide by the cohort size (month 0 count). The final pivot table should handle missing values gracefully with fillna.

Reference implementation (python)

Python

import pandas as pd

def cohort_retention(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['cohort'] = df['signup_date'].dt.to_period('M')
    df['activity_period'] = df['activity_date'].dt.to_period('M')
    df['months_since_signup'] = (
        df['activity_period'] - df['cohort']
    ).apply(lambda x: x.n)
    cohort_size = df.groupby('cohort')['user_id'].nunique().rename('cohort_size')
    retention = df.groupby(['cohort', 'months_since_signup'])['user_id'].nunique().reset_index()
    retention = retention.pivot(index='cohort', columns='months_since_signup', values='user_id')
    return retention.divide(cohort_size, axis=0)

Expect these follow-ups

How would you visualize this matrix as a heatmap and what color scale is most interpretable?

pandascohort-analysisretentionpythoncoding

As asked

Write Python code to test whether 'device_type' (mobile vs desktop) and 'converted' (0 or 1) are independent in a dataset. Use scipy and interpret the output.

Sample answer outline

The candidate should use pd.crosstab to build the contingency table, then pass it to scipy.stats.chi2_contingency. They should retrieve chi2, p-value, degrees of freedom, and expected frequencies. Interpretation: a p-value below alpha rejects independence. They should also check that expected cell counts are at least 5, which is required for the chi-square approximation to be valid.

Reference implementation (python)

Python

import pandas as pd
from scipy.stats import chi2_contingency

# Assume df has 'device_type' and 'converted' columns
contingency = pd.crosstab(df['device_type'], df['converted'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f'chi2={chi2:.3f}, p={p_value:.4f}, dof={dof}')
print('Min expected frequency:', expected.min())
if p_value < 0.05:
    print('Reject independence at 5% level')

Expect these follow-ups

What test would you use instead if any expected cell count is below 5?

statisticschi-squarescipypythonhypothesis-testing

As asked

Without using pandas, write a pure NumPy function that computes a centered moving average of window size k on a 1D array. Handle edges by returning NaN where a full window is not available.

Sample answer outline

The candidate should use np.convolve with np.ones(k)/k and mode='full', then trim and pad with NaN, or use np.lib.stride_tricks.sliding_window_view which is available from NumPy 1.20. The stride_tricks approach is most efficient. Edge handling: floor(k/2) NaNs on each side for a centered window.

Reference implementation (python)

Python

import numpy as np

def centered_moving_avg(arr: np.ndarray, k: int) -> np.ndarray:
    if k % 2 == 0:
        raise ValueError('Window size must be odd for a centered average')
    half = k // 2
    windows = np.lib.stride_tricks.sliding_window_view(arr, k)
    result = np.full(len(arr), np.nan)
    result[half: len(arr) - half] = windows.mean(axis=1)
    return result

Expect these follow-ups

What is the time complexity of your implementation and how would you handle a very large array?

numpypythonsignal-processingtime-seriescoding

As asked

You have a table 'events' with user_id and event_time. Define a new session as starting when a user has a gap of more than 30 minutes since their previous event. Write SQL to assign a session_id to each event.

Sample answer outline

The candidate should use a LAG window function to get the previous event time per user, compute the time difference, flag rows where the gap exceeds 30 minutes as session starts, then use a cumulative sum of the session-start flag to create an incrementing session counter per user. The final session_id can be user_id concatenated with the session number.

Reference implementation (sql)

SQL

WITH lagged AS (
  SELECT
    user_id,
    event_time,
    LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
  FROM events
),
session_starts AS (
  SELECT
    user_id,
    event_time,
    CASE
      WHEN prev_time IS NULL
        OR event_time - prev_time > INTERVAL '30 minutes'
      THEN 1 ELSE 0
    END AS is_new_session
  FROM lagged
)
SELECT
  user_id,
  event_time,
  SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM session_starts;

Expect these follow-ups

How would you compute the average session length in minutes from this result?

sqlwindow-functionssession-analysisanalyticscoding

As asked

Implement a Welch's two-sample t-test in Python using only NumPy. The function should return the t-statistic and two-tailed p-value given two 1D arrays.

Sample answer outline

Welch's t-stat = (mean1 - mean2) / sqrt(var1/n1 + var2/n2). Degrees of freedom use the Welch-Satterthwaite equation. The p-value is computed from the t-distribution CDF using scipy.stats.t.sf (or the candidate can describe using the CDF). The candidate should avoid assuming equal variances (which is the Student's t-test assumption) and explain the difference.

Reference implementation (python)

Python

import numpy as np
from scipy.stats import t as t_dist

def welch_t_test(a: np.ndarray, b: np.ndarray) -> tuple:
    n1, n2 = len(a), len(b)
    v1, v2 = a.var(ddof=1), b.var(ddof=1)
    t_stat = (a.mean() - b.mean()) / np.sqrt(v1/n1 + v2/n2)
    # Welch-Satterthwaite degrees of freedom
    df = (v1/n1 + v2/n2)**2 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1))
    p_value = 2 * t_dist.sf(abs(t_stat), df)
    return t_stat, p_value

Expect these follow-ups

When would you use a permutation test instead of Welch's t-test?

statisticst-testnumpypythoncoding

As asked

Write a Python implementation of k-means clustering using only NumPy. It should include the assignment step, the update step, and a stopping criterion. No sklearn allowed.

Sample answer outline

Assignment step: for each point compute Euclidean distance to all k centroids and assign to the nearest. Update step: compute the mean of all points assigned to each centroid. Stopping: iterate until centroid positions change by less than a tolerance or max iterations reached. The candidate should handle empty clusters (reassign randomly) and mention k-means++ initialization to avoid bad random starts.

Reference implementation (python)

Python

import numpy as np

def kmeans(X: np.ndarray, k: int, max_iter: int = 100, tol: float = 1e-4) -> tuple:
    rng = np.random.default_rng(0)
    centroids = X[rng.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)  # (n, k)
        labels = dists.argmin(axis=1)
        new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(k)])
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids

Expect these follow-ups

How do you choose k in practice, and what is the elbow method's limitation?

clusteringkmeansnumpymachine-learningcoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'product_category': ['A', 'A', 'B', 'B', 'A'],
    'spend': [100, 200, 150, 300, 50]
})

# Fill in: add 'spend_percentile' column
df['spend_percentile'] = df.groupby('product_category')['spend'].transform(
    lambda x: x.rank(pct=True)
)
print(df)

Expect these follow-ups

How would you do this if there were 10 million rows and you needed it to run in under 5 seconds?

pandaspythongroupbytransformcoding

As asked

You have a table 'orders' with columns order_id, product_category, and revenue. Write a SQL query to compute the median revenue for each product_category. Assume you are using PostgreSQL.

Sample answer outline

Reference implementation (sql)

SQL

SELECT
  product_category,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
FROM orders
GROUP BY product_category
ORDER BY product_category;

Expect these follow-ups

How would you compute the median without a built-in percentile function, using only standard window functions?

sqlmedianwindow-functionspostgresqlanalytics

As asked

Write a Python function that takes a 1D NumPy array of observations and returns a 95% bootstrap confidence interval for the mean. Use 10,000 resamples and no external bootstrap libraries.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def bootstrap_ci(data: np.ndarray, n_resamples: int = 10_000, ci: float = 0.95) -> tuple:
    rng = np.random.default_rng(seed=42)
    n = len(data)
    # draw all resamples at once for speed
    indices = rng.integers(0, n, size=(n_resamples, n))
    resample_means = data[indices].mean(axis=1)
    alpha = (1 - ci) / 2
    lower = np.quantile(resample_means, alpha)
    upper = np.quantile(resample_means, 1 - alpha)
    return lower, upper

Expect these follow-ups

What is the difference between the percentile method and the BCa bootstrap, and when does it matter?

bootstrapstatisticspythonnumpyconfidence-intervals

As asked

Implement a logistic regression classifier in Python using only NumPy. It should have a fit method that runs gradient descent and a predict_proba method. Show the gradient derivation in a comment.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, n_iters=1000, lam=0.0):
        self.lr = lr
        self.n_iters = n_iters
        self.lam = lam
        self.w = None

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        n, p = X.shape
        self.w = np.zeros(p)
        for _ in range(self.n_iters):
            preds = self._sigmoid(X @ self.w)
            # gradient of BCE + L2: X.T(preds - y)/n + lam*w
            grad = X.T @ (preds - y) / n + self.lam * self.w
            self.w -= self.lr * grad

    def predict_proba(self, X):
        return self._sigmoid(X @ self.w)

Expect these follow-ups

How would you add L1 regularization and why does the gradient become non-smooth?

logistic-regressionnumpygradient-descentmachine-learningcoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd
import numpy as np

dates = pd.bdate_range('2024-01-01', '2024-03-31')  # business days only
df = pd.DataFrame({'sales': np.random.rand(len(dates)) * 100}, index=dates)

# Row-based (ignores date gaps)
df['rolling_7row'] = df['sales'].rolling(7).mean()

# Calendar-based (7-day window by actual date)
df['rolling_7day'] = df['sales'].rolling('7D').mean()
print(df.head(10))

Expect these follow-ups

How would you compute a rolling 7-day sum that treats missing dates as zero rather than skipping them?

pandasrolling-windowtime-seriespythoncoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd

def cohort_retention(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['cohort'] = df['signup_date'].dt.to_period('M')
    df['activity_period'] = df['activity_date'].dt.to_period('M')
    df['months_since_signup'] = (
        df['activity_period'] - df['cohort']
    ).apply(lambda x: x.n)
    cohort_size = df.groupby('cohort')['user_id'].nunique().rename('cohort_size')
    retention = df.groupby(['cohort', 'months_since_signup'])['user_id'].nunique().reset_index()
    retention = retention.pivot(index='cohort', columns='months_since_signup', values='user_id')
    return retention.divide(cohort_size, axis=0)

Expect these follow-ups

How would you visualize this matrix as a heatmap and what color scale is most interpretable?

pandascohort-analysisretentionpythoncoding

As asked

Write Python code to test whether 'device_type' (mobile vs desktop) and 'converted' (0 or 1) are independent in a dataset. Use scipy and interpret the output.

Sample answer outline

Reference implementation (python)

Python

import pandas as pd
from scipy.stats import chi2_contingency

# Assume df has 'device_type' and 'converted' columns
contingency = pd.crosstab(df['device_type'], df['converted'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f'chi2={chi2:.3f}, p={p_value:.4f}, dof={dof}')
print('Min expected frequency:', expected.min())
if p_value < 0.05:
    print('Reject independence at 5% level')

Expect these follow-ups

What test would you use instead if any expected cell count is below 5?

statisticschi-squarescipypythonhypothesis-testing

As asked

Without using pandas, write a pure NumPy function that computes a centered moving average of window size k on a 1D array. Handle edges by returning NaN where a full window is not available.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def centered_moving_avg(arr: np.ndarray, k: int) -> np.ndarray:
    if k % 2 == 0:
        raise ValueError('Window size must be odd for a centered average')
    half = k // 2
    windows = np.lib.stride_tricks.sliding_window_view(arr, k)
    result = np.full(len(arr), np.nan)
    result[half: len(arr) - half] = windows.mean(axis=1)
    return result

Expect these follow-ups

What is the time complexity of your implementation and how would you handle a very large array?

numpypythonsignal-processingtime-seriescoding

As asked

Sample answer outline

Reference implementation (sql)

SQL

WITH lagged AS (
  SELECT
    user_id,
    event_time,
    LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
  FROM events
),
session_starts AS (
  SELECT
    user_id,
    event_time,
    CASE
      WHEN prev_time IS NULL
        OR event_time - prev_time > INTERVAL '30 minutes'
      THEN 1 ELSE 0
    END AS is_new_session
  FROM lagged
)
SELECT
  user_id,
  event_time,
  SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM session_starts;

Expect these follow-ups

How would you compute the average session length in minutes from this result?

sqlwindow-functionssession-analysisanalyticscoding

As asked

Implement a Welch's two-sample t-test in Python using only NumPy. The function should return the t-statistic and two-tailed p-value given two 1D arrays.

Sample answer outline

Reference implementation (python)

Python

import numpy as np
from scipy.stats import t as t_dist

def welch_t_test(a: np.ndarray, b: np.ndarray) -> tuple:
    n1, n2 = len(a), len(b)
    v1, v2 = a.var(ddof=1), b.var(ddof=1)
    t_stat = (a.mean() - b.mean()) / np.sqrt(v1/n1 + v2/n2)
    # Welch-Satterthwaite degrees of freedom
    df = (v1/n1 + v2/n2)**2 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1))
    p_value = 2 * t_dist.sf(abs(t_stat), df)
    return t_stat, p_value

Expect these follow-ups

When would you use a permutation test instead of Welch's t-test?

statisticst-testnumpypythoncoding

As asked

Write a Python implementation of k-means clustering using only NumPy. It should include the assignment step, the update step, and a stopping criterion. No sklearn allowed.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def kmeans(X: np.ndarray, k: int, max_iter: int = 100, tol: float = 1e-4) -> tuple:
    rng = np.random.default_rng(0)
    centroids = X[rng.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)  # (n, k)
        labels = dists.argmin(axis=1)
        new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(k)])
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids

Expect these follow-ups

How do you choose k in practice, and what is the elbow method's limitation?

clusteringkmeansnumpymachine-learningcoding

Practise these patterns on AlgoExpert

Recommended

200+ video-explained coding interview questions organised by the patterns covered on this page, with timed practice and solution walkthroughs.

Start practising

An external resource we recommend. AlgoExpert is not affiliated with us and we earn nothing from this link.

Tools to sharpen your prep

All tools

Data scientist - Coding

Data scientist coding interview questions

22 questions on coding for data scientist candidates. Each entry has the question as asked, a sample answer outline, common follow-ups, and a reference implementation where applicable.

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'product_category': ['A', 'A', 'B', 'B', 'A'],
    'spend': [100, 200, 150, 300, 50]
})

# Fill in: add 'spend_percentile' column
df['spend_percentile'] = df.groupby('product_category')['spend'].transform(
    lambda x: x.rank(pct=True)
)
print(df)

Expect these follow-ups

How would you do this if there were 10 million rows and you needed it to run in under 5 seconds?

pandaspythongroupbytransformcoding

As asked

You have a table 'orders' with columns order_id, product_category, and revenue. Write a SQL query to compute the median revenue for each product_category. Assume you are using PostgreSQL.

Sample answer outline

Reference implementation (sql)

SQL

SELECT
  product_category,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
FROM orders
GROUP BY product_category
ORDER BY product_category;

Expect these follow-ups

How would you compute the median without a built-in percentile function, using only standard window functions?

sqlmedianwindow-functionspostgresqlanalytics

As asked

Write a Python function that takes a 1D NumPy array of observations and returns a 95% bootstrap confidence interval for the mean. Use 10,000 resamples and no external bootstrap libraries.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def bootstrap_ci(data: np.ndarray, n_resamples: int = 10_000, ci: float = 0.95) -> tuple:
    rng = np.random.default_rng(seed=42)
    n = len(data)
    # draw all resamples at once for speed
    indices = rng.integers(0, n, size=(n_resamples, n))
    resample_means = data[indices].mean(axis=1)
    alpha = (1 - ci) / 2
    lower = np.quantile(resample_means, alpha)
    upper = np.quantile(resample_means, 1 - alpha)
    return lower, upper

Expect these follow-ups

What is the difference between the percentile method and the BCa bootstrap, and when does it matter?

bootstrapstatisticspythonnumpyconfidence-intervals

As asked

Implement a logistic regression classifier in Python using only NumPy. It should have a fit method that runs gradient descent and a predict_proba method. Show the gradient derivation in a comment.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, n_iters=1000, lam=0.0):
        self.lr = lr
        self.n_iters = n_iters
        self.lam = lam
        self.w = None

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        n, p = X.shape
        self.w = np.zeros(p)
        for _ in range(self.n_iters):
            preds = self._sigmoid(X @ self.w)
            # gradient of BCE + L2: X.T(preds - y)/n + lam*w
            grad = X.T @ (preds - y) / n + self.lam * self.w
            self.w -= self.lr * grad

    def predict_proba(self, X):
        return self._sigmoid(X @ self.w)

Expect these follow-ups

How would you add L1 regularization and why does the gradient become non-smooth?

logistic-regressionnumpygradient-descentmachine-learningcoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd
import numpy as np

dates = pd.bdate_range('2024-01-01', '2024-03-31')  # business days only
df = pd.DataFrame({'sales': np.random.rand(len(dates)) * 100}, index=dates)

# Row-based (ignores date gaps)
df['rolling_7row'] = df['sales'].rolling(7).mean()

# Calendar-based (7-day window by actual date)
df['rolling_7day'] = df['sales'].rolling('7D').mean()
print(df.head(10))

Expect these follow-ups

How would you compute a rolling 7-day sum that treats missing dates as zero rather than skipping them?

pandasrolling-windowtime-seriespythoncoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd

def cohort_retention(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['cohort'] = df['signup_date'].dt.to_period('M')
    df['activity_period'] = df['activity_date'].dt.to_period('M')
    df['months_since_signup'] = (
        df['activity_period'] - df['cohort']
    ).apply(lambda x: x.n)
    cohort_size = df.groupby('cohort')['user_id'].nunique().rename('cohort_size')
    retention = df.groupby(['cohort', 'months_since_signup'])['user_id'].nunique().reset_index()
    retention = retention.pivot(index='cohort', columns='months_since_signup', values='user_id')
    return retention.divide(cohort_size, axis=0)

Expect these follow-ups

How would you visualize this matrix as a heatmap and what color scale is most interpretable?

pandascohort-analysisretentionpythoncoding

As asked

Write Python code to test whether 'device_type' (mobile vs desktop) and 'converted' (0 or 1) are independent in a dataset. Use scipy and interpret the output.

Sample answer outline

Reference implementation (python)

Python

import pandas as pd
from scipy.stats import chi2_contingency

# Assume df has 'device_type' and 'converted' columns
contingency = pd.crosstab(df['device_type'], df['converted'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f'chi2={chi2:.3f}, p={p_value:.4f}, dof={dof}')
print('Min expected frequency:', expected.min())
if p_value < 0.05:
    print('Reject independence at 5% level')

Expect these follow-ups

What test would you use instead if any expected cell count is below 5?

statisticschi-squarescipypythonhypothesis-testing

As asked

Without using pandas, write a pure NumPy function that computes a centered moving average of window size k on a 1D array. Handle edges by returning NaN where a full window is not available.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def centered_moving_avg(arr: np.ndarray, k: int) -> np.ndarray:
    if k % 2 == 0:
        raise ValueError('Window size must be odd for a centered average')
    half = k // 2
    windows = np.lib.stride_tricks.sliding_window_view(arr, k)
    result = np.full(len(arr), np.nan)
    result[half: len(arr) - half] = windows.mean(axis=1)
    return result

Expect these follow-ups

What is the time complexity of your implementation and how would you handle a very large array?

numpypythonsignal-processingtime-seriescoding

As asked

Sample answer outline

Reference implementation (sql)

SQL

WITH lagged AS (
  SELECT
    user_id,
    event_time,
    LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
  FROM events
),
session_starts AS (
  SELECT
    user_id,
    event_time,
    CASE
      WHEN prev_time IS NULL
        OR event_time - prev_time > INTERVAL '30 minutes'
      THEN 1 ELSE 0
    END AS is_new_session
  FROM lagged
)
SELECT
  user_id,
  event_time,
  SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM session_starts;

Expect these follow-ups

How would you compute the average session length in minutes from this result?

sqlwindow-functionssession-analysisanalyticscoding

As asked

Implement a Welch's two-sample t-test in Python using only NumPy. The function should return the t-statistic and two-tailed p-value given two 1D arrays.

Sample answer outline

Reference implementation (python)

Python

import numpy as np
from scipy.stats import t as t_dist

def welch_t_test(a: np.ndarray, b: np.ndarray) -> tuple:
    n1, n2 = len(a), len(b)
    v1, v2 = a.var(ddof=1), b.var(ddof=1)
    t_stat = (a.mean() - b.mean()) / np.sqrt(v1/n1 + v2/n2)
    # Welch-Satterthwaite degrees of freedom
    df = (v1/n1 + v2/n2)**2 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1))
    p_value = 2 * t_dist.sf(abs(t_stat), df)
    return t_stat, p_value

Expect these follow-ups

When would you use a permutation test instead of Welch's t-test?

statisticst-testnumpypythoncoding

As asked

Write a Python implementation of k-means clustering using only NumPy. It should include the assignment step, the update step, and a stopping criterion. No sklearn allowed.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def kmeans(X: np.ndarray, k: int, max_iter: int = 100, tol: float = 1e-4) -> tuple:
    rng = np.random.default_rng(0)
    centroids = X[rng.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)  # (n, k)
        labels = dists.argmin(axis=1)
        new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(k)])
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids

Expect these follow-ups

How do you choose k in practice, and what is the elbow method's limitation?

clusteringkmeansnumpymachine-learningcoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'product_category': ['A', 'A', 'B', 'B', 'A'],
    'spend': [100, 200, 150, 300, 50]
})

# Fill in: add 'spend_percentile' column
df['spend_percentile'] = df.groupby('product_category')['spend'].transform(
    lambda x: x.rank(pct=True)
)
print(df)

Expect these follow-ups

How would you do this if there were 10 million rows and you needed it to run in under 5 seconds?

pandaspythongroupbytransformcoding

As asked

You have a table 'orders' with columns order_id, product_category, and revenue. Write a SQL query to compute the median revenue for each product_category. Assume you are using PostgreSQL.

Sample answer outline

Reference implementation (sql)

SQL

SELECT
  product_category,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
FROM orders
GROUP BY product_category
ORDER BY product_category;

Expect these follow-ups

How would you compute the median without a built-in percentile function, using only standard window functions?

sqlmedianwindow-functionspostgresqlanalytics

As asked

Write a Python function that takes a 1D NumPy array of observations and returns a 95% bootstrap confidence interval for the mean. Use 10,000 resamples and no external bootstrap libraries.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def bootstrap_ci(data: np.ndarray, n_resamples: int = 10_000, ci: float = 0.95) -> tuple:
    rng = np.random.default_rng(seed=42)
    n = len(data)
    # draw all resamples at once for speed
    indices = rng.integers(0, n, size=(n_resamples, n))
    resample_means = data[indices].mean(axis=1)
    alpha = (1 - ci) / 2
    lower = np.quantile(resample_means, alpha)
    upper = np.quantile(resample_means, 1 - alpha)
    return lower, upper

Expect these follow-ups

What is the difference between the percentile method and the BCa bootstrap, and when does it matter?

bootstrapstatisticspythonnumpyconfidence-intervals

As asked

Implement a logistic regression classifier in Python using only NumPy. It should have a fit method that runs gradient descent and a predict_proba method. Show the gradient derivation in a comment.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, n_iters=1000, lam=0.0):
        self.lr = lr
        self.n_iters = n_iters
        self.lam = lam
        self.w = None

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        n, p = X.shape
        self.w = np.zeros(p)
        for _ in range(self.n_iters):
            preds = self._sigmoid(X @ self.w)
            # gradient of BCE + L2: X.T(preds - y)/n + lam*w
            grad = X.T @ (preds - y) / n + self.lam * self.w
            self.w -= self.lr * grad

    def predict_proba(self, X):
        return self._sigmoid(X @ self.w)

Expect these follow-ups

How would you add L1 regularization and why does the gradient become non-smooth?

logistic-regressionnumpygradient-descentmachine-learningcoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd
import numpy as np

dates = pd.bdate_range('2024-01-01', '2024-03-31')  # business days only
df = pd.DataFrame({'sales': np.random.rand(len(dates)) * 100}, index=dates)

# Row-based (ignores date gaps)
df['rolling_7row'] = df['sales'].rolling(7).mean()

# Calendar-based (7-day window by actual date)
df['rolling_7day'] = df['sales'].rolling('7D').mean()
print(df.head(10))

Expect these follow-ups

How would you compute a rolling 7-day sum that treats missing dates as zero rather than skipping them?

pandasrolling-windowtime-seriespythoncoding

As asked

Sample answer outline

Reference implementation (python)

Python

import pandas as pd

def cohort_retention(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['cohort'] = df['signup_date'].dt.to_period('M')
    df['activity_period'] = df['activity_date'].dt.to_period('M')
    df['months_since_signup'] = (
        df['activity_period'] - df['cohort']
    ).apply(lambda x: x.n)
    cohort_size = df.groupby('cohort')['user_id'].nunique().rename('cohort_size')
    retention = df.groupby(['cohort', 'months_since_signup'])['user_id'].nunique().reset_index()
    retention = retention.pivot(index='cohort', columns='months_since_signup', values='user_id')
    return retention.divide(cohort_size, axis=0)

Expect these follow-ups

How would you visualize this matrix as a heatmap and what color scale is most interpretable?

pandascohort-analysisretentionpythoncoding

As asked

Write Python code to test whether 'device_type' (mobile vs desktop) and 'converted' (0 or 1) are independent in a dataset. Use scipy and interpret the output.

Sample answer outline

Reference implementation (python)

Python

import pandas as pd
from scipy.stats import chi2_contingency

# Assume df has 'device_type' and 'converted' columns
contingency = pd.crosstab(df['device_type'], df['converted'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f'chi2={chi2:.3f}, p={p_value:.4f}, dof={dof}')
print('Min expected frequency:', expected.min())
if p_value < 0.05:
    print('Reject independence at 5% level')

Expect these follow-ups

What test would you use instead if any expected cell count is below 5?

statisticschi-squarescipypythonhypothesis-testing

As asked

Without using pandas, write a pure NumPy function that computes a centered moving average of window size k on a 1D array. Handle edges by returning NaN where a full window is not available.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def centered_moving_avg(arr: np.ndarray, k: int) -> np.ndarray:
    if k % 2 == 0:
        raise ValueError('Window size must be odd for a centered average')
    half = k // 2
    windows = np.lib.stride_tricks.sliding_window_view(arr, k)
    result = np.full(len(arr), np.nan)
    result[half: len(arr) - half] = windows.mean(axis=1)
    return result

Expect these follow-ups

What is the time complexity of your implementation and how would you handle a very large array?

numpypythonsignal-processingtime-seriescoding

As asked

Sample answer outline

Reference implementation (sql)

SQL

WITH lagged AS (
  SELECT
    user_id,
    event_time,
    LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
  FROM events
),
session_starts AS (
  SELECT
    user_id,
    event_time,
    CASE
      WHEN prev_time IS NULL
        OR event_time - prev_time > INTERVAL '30 minutes'
      THEN 1 ELSE 0
    END AS is_new_session
  FROM lagged
)
SELECT
  user_id,
  event_time,
  SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM session_starts;

Expect these follow-ups

How would you compute the average session length in minutes from this result?

sqlwindow-functionssession-analysisanalyticscoding

As asked

Implement a Welch's two-sample t-test in Python using only NumPy. The function should return the t-statistic and two-tailed p-value given two 1D arrays.

Sample answer outline

Reference implementation (python)

Python

import numpy as np
from scipy.stats import t as t_dist

def welch_t_test(a: np.ndarray, b: np.ndarray) -> tuple:
    n1, n2 = len(a), len(b)
    v1, v2 = a.var(ddof=1), b.var(ddof=1)
    t_stat = (a.mean() - b.mean()) / np.sqrt(v1/n1 + v2/n2)
    # Welch-Satterthwaite degrees of freedom
    df = (v1/n1 + v2/n2)**2 / ((v1/n1)**2/(n1-1) + (v2/n2)**2/(n2-1))
    p_value = 2 * t_dist.sf(abs(t_stat), df)
    return t_stat, p_value

Expect these follow-ups

When would you use a permutation test instead of Welch's t-test?

statisticst-testnumpypythoncoding

As asked

Write a Python implementation of k-means clustering using only NumPy. It should include the assignment step, the update step, and a stopping criterion. No sklearn allowed.

Sample answer outline

Reference implementation (python)

Python

import numpy as np

def kmeans(X: np.ndarray, k: int, max_iter: int = 100, tol: float = 1e-4) -> tuple:
    rng = np.random.default_rng(0)
    centroids = X[rng.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids[None], axis=2)  # (n, k)
        labels = dists.argmin(axis=1)
        new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(k)])
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids

Expect these follow-ups

How do you choose k in practice, and what is the elbow method's limitation?

clusteringkmeansnumpymachine-learningcoding

Practise these patterns on AlgoExpert

Recommended

200+ video-explained coding interview questions organised by the patterns covered on this page, with timed practice and solution walkthroughs.

Start practising

An external resource we recommend. AlgoExpert is not affiliated with us and we earn nothing from this link.

Tools to sharpen your prep

All tools

Questions

Pandas: compute within-group percentile rankCodingmediumVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

SQL: compute median revenue per product categoryCodingmediumCommon

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Implement a bootstrap confidence interval in PythonCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement logistic regression gradient descent from scratchCodinghardCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Pandas: compute a 7-day rolling average ignoring gapsCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Build a cohort retention matrix with pandasCodinghardCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Chi-square test for independence between two categorical featuresCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement a vectorized moving average with NumPyCodingmediumOccasional

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

SQL: identify user sessions from raw event timestampsCodinghardCommon

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Two-sample t-test without scipyCodingmediumOccasional

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement k-means clustering with NumPyCodinghardOccasional

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Pandas: compute within-group percentile rankCodingmediumVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

SQL: compute median revenue per product categoryCodingmediumCommon

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Implement a bootstrap confidence interval in PythonCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement logistic regression gradient descent from scratchCodinghardCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Pandas: compute a 7-day rolling average ignoring gapsCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)