MLOps engineer coding interview questions

11 questions on coding for mlops engineer candidates. Expect prompts such as “Topological sort for ML pipeline DAG execution” and “Implement Population Stability Index for drift detection”, each with a worked answer outline and the follow-ups interviewers push on.

Showing 1 to 11 of 11 coding questions.

As asked

Given a dictionary representing an ML pipeline DAG where keys are step names and values are lists of dependencies, implement a function that returns a valid execution order or raises an error if a cycle exists.

Sample answer outline

A correct answer implements Kahn's algorithm (BFS with in-degree tracking) or DFS with three-color marking. It handles the case where len(result) != len(graph) by raising a cycle-detected error. It should discuss time complexity O(V+E) and mention that this underpins pipeline schedulers like Airflow and Kubeflow.

Reference implementation (python)

Python

def execution_order(pipeline: dict[str, list[str]]) -> list[str]:
    """
    pipeline = {
        'train': ['preprocess', 'validate'],
        'preprocess': ['ingest'],
        'validate': ['ingest'],
        'ingest': [],
        'evaluate': ['train'],
    }
    expected output: ['ingest', 'preprocess', 'validate', 'train', 'evaluate']
    """
    # your implementation here
    pass

Expect these follow-ups

How would you extend this to support parallel execution of steps with no dependency on each other?
What happens to your algorithm if the same step appears as a dependency multiple times?

algorithmsdagtopological-sortml-pipelinesgraph

As asked

Implement a function that computes PSI between a reference distribution and a current distribution for a single numeric feature, using a fixed number of bins derived from the reference dataset.

Sample answer outline

A correct answer bins the reference data into n equal-frequency buckets (quantile-based, not equal-width, to avoid empty buckets), maps current data into the same bin edges, computes proportions for each bin in both distributions, clips zero proportions to a small epsilon to avoid log(0), and returns sum((actual - expected) * log(actual / expected)). Edge cases: empty current bins, all-same-value features.

Reference implementation (python)

Python

import numpy as np

def compute_psi(
    reference: np.ndarray,
    current: np.ndarray,
    n_bins: int = 10,
    eps: float = 1e-4,
) -> float:
    """
    Returns the PSI value between reference and current distributions.
    reference: 1-D array of values from the training / baseline period
    current:   1-D array of values from the production / monitoring period
    """
    # your implementation here
    pass

Expect these follow-ups

Why do you use equal-frequency binning rather than equal-width binning for PSI?
How would you vectorize this to compute PSI across 200 features simultaneously?

drift-detectionpsistatisticsmonitoringnumpy

As asked

You receive a stream of (prediction, label) tuples. Implement a class that maintains a rolling window of the last N predictions and exposes the current accuracy. Then implement the same using a deque instead of a list and explain the complexity difference.

Sample answer outline

A correct list implementation tracks correct count by iterating the window on each update, O(N) per update. The deque implementation evicts the oldest item when at capacity and increments or decrements a running correct count, making updates O(1). The candidate should explain that at 1000 RPS a list-based approach computing rolling accuracy becomes a bottleneck, while the deque is constant time.

Reference implementation (python)

Python

from collections import deque

class RollingAccuracy:
    def __init__(self, window_size: int):
        self.window_size = window_size
        # your state here

    def update(self, prediction: int, label: int) -> None:
        """Record a new (prediction, label) pair."""
        pass

    def accuracy(self) -> float:
        """Return accuracy over the current window. Return 0.0 if empty."""
        pass

Expect these follow-ups

How would you extend this to support a time-based rolling window (last 5 minutes) instead of count-based?
How would you make this thread-safe if multiple prediction threads are writing concurrently?

data-structuresdequemonitoringstreamingcomplexity

As asked

Implement an LRU cache that stores up to K loaded model objects, evicts the least recently used when at capacity, and is thread-safe for concurrent inference requests.

Sample answer outline

A correct answer uses OrderedDict (Python 3.7+ insertion-order guaranteed) and wraps all mutations in a threading.Lock. get() moves the accessed key to the end; put() evicts the first item (oldest) when at capacity. The candidate should discuss why this matters for a multi-model server that loads models lazily from a registry and should not reload them on every request.

Reference implementation (python)

Python

import threading
from collections import OrderedDict
from typing import Any

class ModelCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self._cache: OrderedDict[str, Any] = OrderedDict()
        self._lock = threading.Lock()

    def get(self, key: str) -> Any | None:
        """Return the model for key, or None if not cached."""
        pass

    def put(self, key: str, model: Any) -> None:
        """Insert or update key, evicting LRU if at capacity."""
        pass

Expect these follow-ups

Python's functools.lru_cache is not thread-safe for mutable callables. When would you prefer rolling your own?
How would you add a TTL (time-to-live) to automatically expire stale model versions?

data-structureslru-cacheconcurrencythreadingmodel-serving

As asked

Given a predictions table and a slowly changing features table (both as pandas DataFrames with timestamps), implement a point-in-time correct join that returns, for each prediction, the most recent feature row that existed at or before the prediction timestamp.

Sample answer outline

A correct answer uses pd.merge_asof with direction='backward' after sorting both DataFrames by their timestamp columns, joining on the entity key. It should discuss why a naive inner join on date partitions is wrong (uses future feature values), how this relates to training-serving skew, and the time complexity of the sort-merge approach versus a cross join. The candidate should handle the edge case where no prior feature row exists for an entity.

Reference implementation (python)

Python

import pandas as pd

predictions = pd.DataFrame({
    'entity_id': [1, 1, 2, 2],
    'prediction_ts': pd.to_datetime(['2024-01-05', '2024-01-10', '2024-01-03', '2024-01-08']),
    'score': [0.8, 0.6, 0.9, 0.4],
})

features = pd.DataFrame({
    'entity_id': [1, 1, 2],
    'feature_ts': pd.to_datetime(['2024-01-01', '2024-01-07', '2024-01-01']),
    'age': [25, 26, 30],
})

def point_in_time_join(predictions: pd.DataFrame, features: pd.DataFrame) -> pd.DataFrame:
    """Return predictions with the correct feature values at prediction time."""
    pass

Expect these follow-ups

How does Feast implement point-in-time joins internally?
At what scale would you move this join from pandas to Spark, and what Spark function replicates merge_asof?

feature-storepoint-in-time-jointraining-serving-skewpandasdata-engineering

As asked

Write a generator that takes a list of record IDs and a batch_size, yields batches of IDs, and also accepts a shard parameter so that multiple workers can each process a non-overlapping subset of the data in parallel.

Sample answer outline

A correct answer yields slices of len(batch_size) from the IDs that belong to this shard (IDs where id_index % num_shards == shard_id), ensuring no overlap between workers. It should use a generator (yield) rather than building a list in memory, handle the case where the last batch is smaller than batch_size, and discuss how this maps to distributed frameworks (Spark partitions, Ray Data shards).

Reference implementation (python)

Python

from typing import Generator

def batch_shard(
    record_ids: list[int],
    batch_size: int,
    shard_id: int,
    num_shards: int,
) -> Generator[list[int], None, None]:
    """
    Yields batches of record_ids belonging to shard_id.
    Each record_ids[i] belongs to shard i % num_shards.
    """
    pass

Expect these follow-ups

How do you guarantee every record is processed exactly once if workers can fail and restart?
How would you change the sharding strategy if records are not uniformly distributed by ID?

batch-inferenceparallelismdata-structuresgeneratorsdistributed-computing

As asked

Given two arrays of binary prediction outcomes (correct=1, wrong=0) from model A and model B on the same holdout set, implement a paired test to determine whether the difference in accuracy is statistically significant.

Sample answer outline

A correct answer implements McNemar's test: builds a 2x2 contingency table of cases where A is correct and B is wrong (b) versus A is wrong and B is correct (c), and computes the chi-squared statistic (|b-c| - 1)^2 / (b+c) or uses scipy.stats.mcnemar. It should explain why McNemar's is appropriate for paired predictions on the same samples versus unpaired t-tests, and how to interpret the p-value in the context of deploying a new model version.

Reference implementation (python)

Python

import numpy as np
from scipy.stats import chi2

def mcnemar_test(
    a_correct: np.ndarray,
    b_correct: np.ndarray,
) -> tuple[float, float]:
    """
    a_correct: 1-D binary array, 1 where model A is correct on sample i
    b_correct: 1-D binary array, 1 where model B is correct on sample i
    Both arrays must be the same length (same evaluation samples).
    Returns (test_statistic, p_value).
    """
    pass

Expect these follow-ups

When would you use a bootstrap confidence interval instead of McNemar's test?
How does the choice of significance threshold (0.05 vs 0.01) affect your decision to deploy?

statisticsa-b-testinghypothesis-testingmodel-evaluationscipy

As asked

Write a function that reads an MLflow run info JSON file, validates that required keys exist (run_id, experiment_id, status, metrics containing val_auc above 0.75), and returns a typed dataclass or raises a descriptive ValueError.

Sample answer outline

A correct answer defines a dataclass or TypedDict, uses .get() with key-existence checks and raises ValueError with the missing field name, validates the val_auc threshold, and avoids bare except blocks. Bonus for using pydantic with a validator for the metric threshold. The candidate should discuss how this pattern is used in CI gates that block model promotion if evaluation metrics are below threshold.

Reference implementation (python)

Python

import json
from dataclasses import dataclass

@dataclass
class RunMetadata:
    run_id: str
    experiment_id: str
    status: str
    val_auc: float

def parse_run_metadata(json_path: str, min_val_auc: float = 0.75) -> RunMetadata:
    """
    Reads the MLflow run info JSON at json_path.
    Raises ValueError with a descriptive message if any required field
    is missing or val_auc is below min_val_auc.
    """
    with open(json_path) as f:
        data = json.load(f)
    # your implementation here

Expect these follow-ups

How would you make this validation reusable across different model types with different required metrics?
How do you surface these validation failures in a CI pipeline so they are visible to the data scientist?

mlflowdata-validationpythonci-cdmodel-registry

As asked

Write a Python function that generates a Kubernetes Job manifest as a dictionary (ready for yaml.dump) for a PyTorch training job, parameterized by image, GPU count, environment variables from a config dict, and a unique run ID.

Sample answer outline

A correct answer produces the correct apiVersion (batch/v1), kind (Job), metadata with labels including the run_id, spec with backoffLimit and completions set to 1, a container with resources.limits of nvidia.com/gpu and the image, and env from the config dict. It should set restartPolicy to Never and include a ttlSecondsAfterFinished to avoid stale job accumulation. Bonus for adding a node selector for GPU node pools.

Reference implementation (python)

Python

def generate_training_job(
    run_id: str,
    image: str,
    gpu_count: int,
    env_vars: dict[str, str],
    namespace: str = "ml-training",
) -> dict:
    """
    Returns a Kubernetes Job manifest dict.
    The job should:
    - Request gpu_count GPUs via nvidia.com/gpu resource limit
    - Set env_vars as container environment variables
    - Use run_id as a unique name suffix
    - Set restartPolicy to Never
    """
    pass

Expect these follow-ups

Why is restartPolicy: Never preferred over OnFailure for training jobs?
How would you extend this to a multi-node distributed training job using MPI or PyTorch's elastic training?

kubernetestraining-jobspythoninfrastructuregpu

As asked

Implement a feature hasher that maps arbitrary string category values to a fixed-size integer index vector using the hashing trick, matching the behavior needed for online inference without a prebuilt vocabulary.

Sample answer outline

A correct answer hashes each (feature_name, value) pair to a bucket in 0..n_features using a deterministic hash (e.g., murmurhash or FNV) and handles sign randomization to reduce collision bias. It should discuss how the hashing trick avoids storing a vocabulary at serving time, the tradeoff of collision rate versus n_features size, and why the same hash function and seed must be used at training and serving time to avoid skew.

Reference implementation (python)

Python

import hashlib
import numpy as np

def hash_features(
    features: dict[str, str],
    n_buckets: int = 2**18,
) -> np.ndarray:
    """
    features: dict of feature_name -> string_value
    Returns a dense float32 vector of shape (n_buckets,)
    with +1 or -1 at the hashed bucket indices.
    """
    vec = np.zeros(n_buckets, dtype=np.float32)
    # your implementation here
    return vec

Expect these follow-ups

What collision rate do you expect for 1 million distinct values hashed into 2^18 buckets?
How does sklearn's HashingVectorizer handle negative indices?

feature-engineeringhashing-trickcategoricalsinferenceonline-serving

As asked

Write a Python decorator that wraps a training step function, saves the result to a checkpoint file on success, and on failure retries up to max_retries times with exponential backoff, loading from checkpoint if one exists at the start.

Sample answer outline

A correct answer implements a decorator that checks for an existing checkpoint at startup and returns it immediately if found (idempotency), retries the wrapped function with 2^attempt * base_delay seconds of sleep between attempts, writes the result to the checkpoint path on success using atomic rename (write to a temp file then rename), and re-raises the exception after max_retries are exhausted. Using json.dump instead of binary serialization avoids security concerns with untrusted input.

Reference implementation (python)

Python

import time
import json
import os
from functools import wraps

def with_checkpoint(checkpoint_path: str, max_retries: int = 3, base_delay: float = 1.0):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            # Load checkpoint if it exists (use json.load for safe deserialization)
            # Retry with exponential backoff on failure
            # Save checkpoint on success (write to temp path, then rename atomically)
            pass
        return wrapper
    return decorator

Expect these follow-ups

Why is an atomic rename important here, and how do you implement it on a shared network filesystem?
How would you adapt this for a long-running training step where mid-step checkpointing is also needed?

fault-tolerancecheckpointingdistributed-trainingpythonretry

Tools to sharpen your prep

All tools

def execution_order(pipeline: dict[str, list[str]]) -> list[str]: """ pipeline = { 'train': ['preprocess', 'validate'], 'preprocess': ['ingest'], 'validate': ['ingest'], 'ingest': [], 'evaluate': ['train'], } expected output: ['ingest', 'preprocess', 'validate', 'train', 'evaluate'] """ # your implementation here pass

import numpy as np def compute_psi( reference: np.ndarray, current: np.ndarray, n_bins: int = 10, eps: float = 1e-4, ) -> float: """ Returns the PSI value between reference and current distributions. reference: 1-D array of values from the training / baseline period current: 1-D array of values from the production / monitoring period """ # your implementation here pass

from collections import deque class RollingAccuracy: def __init__(self, window_size: int): self.window_size = window_size # your state here def update(self, prediction: int, label: int) -> None: """Record a new (prediction, label) pair.""" pass def accuracy(self) -> float: """Return accuracy over the current window. Return 0.0 if empty.""" pass

import threading from collections import OrderedDict from typing import Any class ModelCache: def __init__(self, capacity: int): self.capacity = capacity self._cache: OrderedDict[str, Any] = OrderedDict() self._lock = threading.Lock() def get(self, key: str) -> Any | None: """Return the model for key, or None if not cached.""" pass def put(self, key: str, model: Any) -> None: """Insert or update key, evicting LRU if at capacity.""" pass

import pandas as pd predictions = pd.DataFrame({ 'entity_id': [1, 1, 2, 2], 'prediction_ts': pd.to_datetime(['2024-01-05', '2024-01-10', '2024-01-03', '2024-01-08']), 'score': [0.8, 0.6, 0.9, 0.4], }) features = pd.DataFrame({ 'entity_id': [1, 1, 2], 'feature_ts': pd.to_datetime(['2024-01-01', '2024-01-07', '2024-01-01']), 'age': [25, 26, 30], }) def point_in_time_join(predictions: pd.DataFrame, features: pd.DataFrame) -> pd.DataFrame: """Return predictions with the correct feature values at prediction time.""" pass

from typing import Generator def batch_shard( record_ids: list[int], batch_size: int, shard_id: int, num_shards: int, ) -> Generator[list[int], None, None]: """ Yields batches of record_ids belonging to shard_id. Each record_ids[i] belongs to shard i % num_shards. """ pass

import numpy as np from scipy.stats import chi2 def mcnemar_test( a_correct: np.ndarray, b_correct: np.ndarray, ) -> tuple[float, float]: """ a_correct: 1-D binary array, 1 where model A is correct on sample i b_correct: 1-D binary array, 1 where model B is correct on sample i Both arrays must be the same length (same evaluation samples). Returns (test_statistic, p_value). """ pass

import json from dataclasses import dataclass @dataclass class RunMetadata: run_id: str experiment_id: str status: str val_auc: float def parse_run_metadata(json_path: str, min_val_auc: float = 0.75) -> RunMetadata: """ Reads the MLflow run info JSON at json_path. Raises ValueError with a descriptive message if any required field is missing or val_auc is below min_val_auc. """ with open(json_path) as f: data = json.load(f) # your implementation here

def generate_training_job( run_id: str, image: str, gpu_count: int, env_vars: dict[str, str], namespace: str = "ml-training", ) -> dict: """ Returns a Kubernetes Job manifest dict. The job should: - Request gpu_count GPUs via nvidia.com/gpu resource limit - Set env_vars as container environment variables - Use run_id as a unique name suffix - Set restartPolicy to Never """ pass

import hashlib import numpy as np def hash_features( features: dict[str, str], n_buckets: int = 2**18, ) -> np.ndarray: """ features: dict of feature_name -> string_value Returns a dense float32 vector of shape (n_buckets,) with +1 or -1 at the hashed bucket indices. """ vec = np.zeros(n_buckets, dtype=np.float32) # your implementation here return vec

import time import json import os from functools import wraps def with_checkpoint(checkpoint_path: str, max_retries: int = 3, base_delay: float = 1.0): def decorator(fn): @wraps(fn) def wrapper(*args, **kwargs): # Load checkpoint if it exists (use json.load for safe deserialization) # Retry with exponential backoff on failure # Save checkpoint on success (write to temp path, then rename atomically) pass return wrapper return decorator

Questions

Topological sort for ML pipeline DAG executionCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement Population Stability Index for drift detectionCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Compute rolling accuracy on a prediction log streamCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

LRU cache for in-memory model registry clientCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Point-in-time correct feature join for training dataCodinghardCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Partition a large dataset for parallel batch inferenceCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Statistical significance test for A/B model comparisonCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Parse and validate MLflow run metadata from JSONCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Generate a Kubernetes Job manifest for a training runCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement the hashing trick for high-cardinality categoricalsCodingmediumOccasional

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Checkpoint-aware retry loop for distributed training stepsCodingmediumOccasional

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Related questions

Implement Population Stability Index for drift detection

Compute rolling accuracy on a prediction log stream

LRU cache for in-memory model registry client

Point-in-time correct feature join for training data

More mlops engineer topics

Tools to sharpen your prep

Questions

Topological sort for ML pipeline DAG executionCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement Population Stability Index for drift detectionCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Compute rolling accuracy on a prediction log streamCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

LRU cache for in-memory model registry clientCodingmediumCommon