ML engineer coding interview questions

16 questions on coding for ml engineer candidates. Each entry has the question as asked, a sample answer outline, common follow-ups, and a reference implementation where applicable.

Showing 1 to 16 of 16 coding questions.

As asked

Implement cross-entropy loss from raw logits for a multiclass problem using the log-sum-exp trick. Explain why computing softmax then log then cross-entropy in three steps is numerically unstable.

Sample answer outline

Computing softmax naively (exp(x) / sum(exp(x))) overflows for large logits and underflows for very negative logits. The stable version subtracts the max logit before exponentiation: log_softmax(x) = x - max(x) - log(sum(exp(x - max(x)))). Cross-entropy is then -x[label] + max(x) + log(sum(exp(x - max(x)))). This equals the logsumexp formulation. The key insight is that log and exp are inverses so computing softmax then taking log is wasteful and unstable.

Reference implementation (python)

Python

import numpy as np

def cross_entropy(logits: np.ndarray, label: int) -> float:
    """logits: (vocab,), label: int index"""
    shifted = logits - logits.max()
    log_sum_exp = np.log(np.exp(shifted).sum())
    return -shifted[label] + log_sum_exp

def cross_entropy_batch(
    logits: np.ndarray, labels: np.ndarray
) -> float:
    """logits: (N, C), labels: (N,)"""
    shifted = logits - logits.max(axis=1, keepdims=True)
    log_sum_exp = np.log(np.exp(shifted).sum(axis=1))
    correct = shifted[np.arange(len(labels)), labels]
    return (-correct + log_sum_exp).mean()

Expect these follow-ups

How does PyTorch's F.cross_entropy implement this under the hood?
What changes when implementing label smoothing on top of this loss?

loss-functionsnumerical-stabilitycross-entropyimplementationsoftmax

As asked

OpenAI's inference stack needs to cache KV (key-value) states and decoded outputs to serve repeated or similar prompts efficiently. As a warm-up, implement a general LRU cache that supports two operations: get(key) returns the value if the key exists, otherwise -1; put(key, value) inserts or updates the key, evicting the least-recently-used key when the cache is at capacity. Both operations must run in O(1) time.

Sample answer outline

The candidate should reach for a doubly-linked list paired with a hash map. They should explain why a doubly-linked list allows O(1) removal (you need the prev pointer) and why the hash map stores node references, not just keys. A strong answer handles edge cases: capacity of 1, overwriting an existing key without duplicating the node, and moving the accessed node to the head correctly.

Reference implementation (python)

Python

class LRUCache:
    def __init__(self, capacity: int):
        self.cap = capacity
        self.cache = {}  # key -> node
        # sentinel head and tail
        self.head = Node(0, 0)
        self.tail = Node(0, 0)
        self.head.next = self.tail
        self.tail.prev = self.head

    def get(self, key: int) -> int:
        ...

    def put(self, key: int, value: int) -> None:
        ...

Expect these follow-ups

How would you make this thread-safe for concurrent access?
What changes if you need to support a TTL on each entry in addition to LRU eviction?

company:openailruhash-maplinked-listdata-structurescaching

As asked

Implement a binary focal loss function in PyTorch that is numerically stable and supports per-sample weights. The focal loss is: FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t) where p_t is the model probability for the true class.

Sample answer outline

Use torch.nn.functional.binary_cross_entropy_with_logits with reduction='none' to get per-sample BCE in a numerically stable way (avoids computing sigmoid then log separately). Compute p_t from logits using sigmoid, then compute (1 - p_t)^gamma as the modulating factor. Apply alpha weighting per sample and optionally multiply by sample weights. Sum or mean at the end. Key pitfall is computing log(sigmoid(x)) naively versus using the logsumexp-stable path already in BCEWithLogitsLoss.

Reference implementation (python)

Python

import torch
import torch.nn.functional as F

def focal_loss(
    logits: torch.Tensor,
    targets: torch.Tensor,
    alpha: float = 0.25,
    gamma: float = 2.0,
    weights: torch.Tensor | None = None,
) -> torch.Tensor:
    # logits: (N,), targets: (N,) binary float
    bce = F.binary_cross_entropy_with_logits(
        logits, targets, reduction="none"
    )
    p_t = torch.sigmoid(logits)
    p_t = torch.where(targets == 1, p_t, 1 - p_t)
    focal_factor = (1 - p_t) ** gamma
    alpha_t = torch.where(targets == 1, alpha, 1 - alpha)
    loss = alpha_t * focal_factor * bce
    if weights is not None:
        loss = loss * weights
    return loss.mean()

Expect these follow-ups

How would you verify your focal loss implementation is correct numerically at extreme logit values?
How do you extend this to multiclass focal loss?

pytorchloss-functionsnumerical-stabilityclassificationimplementation

As asked

Given a model that returns logits of shape (batch, vocab_size), implement both greedy decoding and top-k sampling for a single generation step. Handle the temperature parameter.

Sample answer outline

Greedy decoding takes argmax over the vocab dimension. Top-k sampling filters all tokens outside the top-k by setting their logits to negative infinity, then divides by temperature and applies softmax, then samples using torch.multinomial. Temperature below 1 sharpens the distribution and above 1 flattens it. The key implementation pitfall is applying temperature before filtering versus after, which changes the relative weights of the retained tokens.

Reference implementation (python)

Python

import torch

def greedy_step(logits: torch.Tensor) -> torch.Tensor:
    # logits: (batch, vocab)
    return logits.argmax(dim=-1)

def top_k_step(
    logits: torch.Tensor,
    k: int = 50,
    temperature: float = 1.0,
) -> torch.Tensor:
    logits = logits / temperature
    top_k_vals, _ = torch.topk(logits, k, dim=-1)
    threshold = top_k_vals[..., -1:]
    logits = logits.masked_fill(logits < threshold, float("-inf"))
    probs = torch.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).squeeze(-1)

Expect these follow-ups

How would you extend this to nucleus (top-p) sampling?
What is the difference between temperature scaling and top-p in terms of what they control about the output distribution?

decodingsamplinglanguage-modelpytorchimplementation

As asked

Write a learning rate scheduler function that linearly warms up from 0 to peak_lr over warmup_steps, then cosine anneals to min_lr over the remaining total_steps. No library calls to CosineAnnealingLR.

Sample answer outline

During warmup, lr = peak_lr * (step / warmup_steps). After warmup, use the cosine formula: lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + cos(pi * progress)) where progress = (step - warmup_steps) / (total_steps - warmup_steps). The key correctness check is that at step 0 lr equals 0, at warmup_steps it equals peak_lr, and at total_steps it equals min_lr.

Reference implementation (python)

Python

import math

def get_lr(
    step: int,
    warmup_steps: int,
    total_steps: int,
    peak_lr: float,
    min_lr: float = 0.0,
) -> float:
    if step < warmup_steps:
        return peak_lr * step / max(warmup_steps, 1)
    progress = (step - warmup_steps) / max(total_steps - warmup_steps, 1)
    cosine = 0.5 * (1 + math.cos(math.pi * progress))
    return min_lr + (peak_lr - min_lr) * cosine

Expect these follow-ups

How would you modify this for a cyclical cosine schedule that restarts?
Why do practitioners often use a small nonzero min_lr rather than annealing all the way to 0?

learning-rateschedulingtrainingpytorchoptimization

As asked

You need to compute the mean and variance of a feature stream without storing all values. Implement an online (Welford) algorithm and explain where you would use this in a feature pipeline.

Sample answer outline

Welford's algorithm updates mean and M2 (sum of squared deviations) in one pass with no numerical instability. The update is: delta = x - mean; mean += delta / n; delta2 = x - mean; M2 += delta * delta2; variance = M2 / (n - 1). This is used in online feature normalization, streaming z-score computation, and monitoring feature distribution statistics without storing the full dataset.

Reference implementation (python)

Python

class OnlineNormalizer:
    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0

    def update(self, x: float) -> None:
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2

    @property
    def variance(self) -> float:
        if self.n < 2:
            return 0.0
        return self.M2 / (self.n - 1)

    def normalize(self, x: float) -> float:
        return (x - self.mean) / (self.variance ** 0.5 + 1e-8)

Expect these follow-ups

How do you parallelize Welford's algorithm across shards of data?
What numerical problems arise if you compute variance as E[X^2] - E[X]^2 on large datasets?

statisticsfeature-normalizationstreamingnumerical-stabilityimplementation

As asked

Implement a function that takes a padded token ID tensor of shape (batch, seq_len) and an embedding matrix, performs the lookup, and returns embeddings with zero vectors at padding positions. Padding token ID is 0.

Sample answer outline

Index the embedding matrix with the token IDs, then create a boolean mask where token_ids equals 0, expand the mask to the embedding dimension, and zero out the padding positions. Using torch.nn.Embedding with padding_idx=0 handles this automatically but it is worth knowing the manual path for custom layers. The key is to zero after lookup rather than skipping the lookup, since scatter-style indexing is not efficient in PyTorch for variable-length masking.

Reference implementation (python)

Python

import torch

def batched_embed(
    token_ids: torch.Tensor,   # (batch, seq_len)
    embed_matrix: torch.Tensor, # (vocab, d_model)
    pad_id: int = 0,
) -> torch.Tensor:             # (batch, seq_len, d_model)
    embeds = embed_matrix[token_ids]          # simple index
    pad_mask = (token_ids == pad_id).unsqueeze(-1)  # (batch, seq, 1)
    embeds = embeds.masked_fill(pad_mask, 0.0)
    return embeds

Expect these follow-ups

How does padding_idx in nn.Embedding affect the gradient for that embedding vector during training?
How would you handle a variable-length batch without padding using PyTorch's nested tensors?

embeddingspytorchnlpbatchingimplementation

As asked

Without using sklearn, write a function that takes lists of true labels and predicted labels and returns precision, recall, F1, and support for each class in a multiclass problem.

Sample answer outline

Build a confusion matrix by iterating predictions and accumulating TP, FP, FN per class. Precision = TP / (TP + FP), Recall = TP / (TP + FN), F1 = 2 * P * R / (P + R). Support is the total number of actual occurrences of each class (TP + FN), not just the true positives. Handle the zero-division case when a class has no predictions. Return a dict keyed by class label. This is a common warm-up question that also tests attention to edge cases like unseen classes in predictions.

Reference implementation (python)

Python

from collections import defaultdict
from typing import Dict, List, Any

def class_metrics(
    y_true: List[Any], y_pred: List[Any]
) -> Dict[Any, Dict[str, float]]:
    classes = set(y_true) | set(y_pred)
    tp = defaultdict(int); fp = defaultdict(int); fn = defaultdict(int)
    for t, p in zip(y_true, y_pred):
        if t == p:
            tp[t] += 1
        else:
            fp[p] += 1
            fn[t] += 1
    results = {}
    for c in classes:
        prec = tp[c] / (tp[c] + fp[c]) if (tp[c] + fp[c]) else 0.0
        rec  = tp[c] / (tp[c] + fn[c]) if (tp[c] + fn[c]) else 0.0
        f1   = 2*prec*rec/(prec+rec) if (prec+rec) else 0.0
        results[c] = {"precision": prec, "recall": rec,
                      "f1": f1, "support": tp[c] + fn[c]}
    return results

Expect these follow-ups

How do you compute macro versus weighted F1 from per-class F1 scores?
How would you efficiently compute this for one million examples without a loop?

evaluationclassificationmetricsimplementationnumpy

As asked

Write a Python generator-based batching function that accumulates items up to batch_size or max_wait_ms milliseconds, then yields the batch. This is the core of a low-latency serving batching layer.

Sample answer outline

Use a queue with a non-blocking get in a loop, tracking elapsed time since the first item arrived. When either the batch fills or the deadline passes, yield the batch and reset. The critical correctness issue is measuring wall-clock time correctly and not blocking indefinitely when the queue is empty. In production this runs in a dedicated thread, and the caller enqueues items while the consumer yields batches to the model.

Reference implementation (python)

Python

import queue, time
from typing import Generator, List, TypeVar
T = TypeVar("T")

def dynamic_batcher(
    q: queue.Queue,
    batch_size: int,
    max_wait_ms: float,
) -> Generator[List, None, None]:
    while True:
        batch, deadline = [], None
        while len(batch) < batch_size:
            timeout = None if deadline is None else max(
                0, deadline - time.monotonic()
            )
            try:
                item = q.get(timeout=timeout)
                batch.append(item)
                if deadline is None:
                    deadline = time.monotonic() + max_wait_ms / 1000
            except queue.Empty:
                break
        if batch:
            yield batch

Expect these follow-ups

How would you implement backpressure so the queue does not grow unboundedly under load?
What changes when this is asyncio-based versus thread-based?

batchingservinglatencypythonstreaming

As asked

Implement intersection-over-union for axis-aligned bounding boxes and a simple greedy NMS function that takes boxes and scores and returns the kept indices.

Sample answer outline

IoU is computed by finding the intersection rectangle coordinates with max/min operations, computing its area, and dividing by the union area (area_a + area_b - intersection). Greedy NMS sorts by score descending, keeps the highest-score box, then removes all remaining boxes with IoU above the threshold against the kept box, and repeats. The vectorized version computes IoU of the top box against all remaining boxes in one operation. Edge cases include zero-area boxes and the threshold boundary condition.

Reference implementation (python)

Python

import numpy as np
from numpy import ndarray

def iou(a: ndarray, b: ndarray) -> float:
    """a, b: [x1, y1, x2, y2]"""
    ix1, iy1 = max(a[0], b[0]), max(a[1], b[1])
    ix2, iy2 = min(a[2], b[2]), min(a[3], b[3])
    inter = max(0, ix2 - ix1) * max(0, iy2 - iy1)
    union = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
    return inter / union if union else 0.0

def nms(boxes: ndarray, scores: ndarray, thresh: float) -> list:
    order = scores.argsort()[::-1]
    keep = []
    while len(order):
        i = order[0]; keep.append(i)
        ious = np.array([iou(boxes[i], boxes[j]) for j in order[1:]])
        order = order[1:][ious <= thresh]
    return keep

Expect these follow-ups

How does Soft-NMS differ from greedy NMS and when does it help?
How would you vectorize this to run on GPU with PyTorch?

object-detectioniounmscomputer-visionimplementation

As asked

Implement one step of the Adam optimizer given parameters, gradients, first and second moment estimates, step count, and hyperparameters. Include bias correction.

Sample answer outline

m = beta1 * m + (1 - beta1) * g updates the first moment. v = beta2 * v + (1 - beta2) * g^2 updates the second moment. Bias-corrected estimates are m_hat = m / (1 - beta1^t) and v_hat = v / (1 - beta2^t). The parameter update is theta -= lr * m_hat / (sqrt(v_hat) + epsilon). The epsilon is added inside the denominator to prevent division by zero, not outside. AdamW adds a weight decay term directly to the parameter before the Adam update, not to the gradient.

Reference implementation (python)

Python

import numpy as np

def adam_step(
    theta: np.ndarray,
    grad: np.ndarray,
    m: np.ndarray,
    v: np.ndarray,
    t: int,
    lr: float = 1e-3,
    beta1: float = 0.9,
    beta2: float = 0.999,
    eps: float = 1e-8,
) -> tuple:
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad ** 2
    m_hat = m / (1 - beta1 ** t)
    v_hat = v / (1 - beta2 ** t)
    theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
    return theta, m, v

Expect these follow-ups

Why does AdamW decouple weight decay from the gradient update and what problem does it solve?
What is the Lion optimizer and how does its memory footprint compare to Adam?

optimizationadamtrainingimplementationgradient-descent

As asked

Implement a tiled matrix multiplication in C or Python that improves cache utilization over the naive triple-loop version. Explain why tiling helps and what tile size you would choose.

Sample answer outline

A strong answer implements the three-loop GEMM with a blocking factor B, loading B x B tiles into local arrays before the inner product. The key insight is reuse: each element of A is read B times instead of N times, and same for B. Tile size is chosen so two tiles fit in L1 cache (typically 64 to 128 KB), so for float32 with 4 bytes each: sqrt(64KB/4/2) is roughly 90, so a tile of 64 or 128 is common. On AMD GPUs the equivalent is LDS tiling in a HIP kernel, where tile size matches the wavefront width.

Reference implementation (python)

Python

def tiled_matmul(A, B, tile=64):
    N = len(A)
    C = [[0.0]*N for _ in range(N)]
    for ii in range(0, N, tile):
        for jj in range(0, N, tile):
            for kk in range(0, N, tile):
                for i in range(ii, min(ii+tile, N)):
                    for k in range(kk, min(kk+tile, N)):
                        a_ik = A[i][k]
                        for j in range(jj, min(jj+tile, N)):
                            C[i][j] += a_ik * B[k][j]
    return C

Expect these follow-ups

How does the memory layout (row-major vs column-major) affect your tiling strategy?
How would you add vectorization hints or use SIMD intrinsics to speed up the inner loop on a CPU?

company:amdmatrix-multiplytilingcacheperformancealgorithms

As asked

Write a Python function that calls a hypothetical `claude_complete` function, parses tool calls from the response, executes them, feeds results back, and retries the loop up to three times if a tool execution raises an exception. Stop when the model returns a final text response with no more tool calls.

Sample answer outline

The loop calls claude_complete with the current messages list, checks for tool_use blocks in the response content, executes each tool, appends the tool result as a user message, and iterates. On exception from a tool, it appends an error result and increments a retry counter. If retries exceed the limit, it raises or returns a failure. When the response contains no tool blocks, the text is the final answer. A strong solution handles the message format correctly and resets the retry counter on a successful tool call.

Reference implementation (python)

Python

def run_agent(initial_messages: list[dict], tools: dict) -> str:
    messages = list(initial_messages)
    retries = 0
    max_retries = 3
    while True:
        response = claude_complete(messages=messages, tools=list(tools.values()))
        # check response.content for tool_use blocks
        # execute tools, append results
        # if no tool calls, return text
        # if tool error, increment retries or raise
        pass

Expect these follow-ups

How would you add a timeout to individual tool calls without blocking the event loop?
How would you log each turn of the agent loop for debugging without storing sensitive tool results?

company:anthropiccodingagentictool-useclaude-api

As asked

Implement a CUDA kernel that computes the sum of an array of N floats in parallel. Walk me through your approach: how do you handle the reduction within a block, how do you handle multiple blocks, and what optimizations matter most?

Sample answer outline

A strong answer describes a two-phase approach: first, each block reduces its chunk of the array into a single partial sum using shared memory and a tree reduction pattern (halving the active threads each step with __syncthreads between steps), writing the partial sum to a global output array. A second kernel or atomic operation then sums the partial results. Key optimizations: use warp shuffle instructions (__shfl_down_sync) for the last 32 threads to avoid shared memory entirely; ensure the initial load uses coalesced global memory accesses; and avoid thread divergence in the reduction loop by using stride halving from the top rather than sequential stride. The candidate should mention that for production use, CUB's DeviceReduce is preferred.

Reference implementation (cuda)

cuda

__global__ void reduceSum(float* in, float* out, int n) {
  extern __shared__ float sdata[];
  int tid = threadIdx.x, i = blockIdx.x * blockDim.x + tid;
  sdata[tid] = (i < n) ? in[i] : 0.0f;
  __syncthreads();
  for (int s = blockDim.x / 2; s > 0; s >>= 1) {
    if (tid < s) sdata[tid] += sdata[tid + s];
    __syncthreads();
  }
  if (tid == 0) atomicAdd(out, sdata[0]);
}

Expect these follow-ups

How does __shfl_down_sync replace shared memory in the final warp?
At what point does using atomicAdd on a single global variable become a bottleneck and how do you avoid it?

company:nvidiacudareductionparallelshared-memorynvidia

As asked

OpenAI's prompt-caching layer stores structured prompt trees that must be serialized for transport and reconstructed exactly on the receiving end. Design an algorithm to serialize a binary tree to a string and deserialize that string back to the original tree structure. Your algorithm must handle null nodes and work correctly for any binary tree, not just BSTs.

Sample answer outline

A strong answer uses preorder traversal with explicit null markers. Serialization: DFS, emitting node values and a sentinel like 'N' for nulls, comma-separated. Deserialization: convert to a queue, recursively consume values, returning null when you see the sentinel. The candidate should explain why preorder works (you can reconstruct the tree uniquely with null markers) and handle integers with potential leading zeros or negative values correctly.

Reference implementation (python)

Python

class Codec:
    def serialize(self, root):
        res = []
        def dfs(node):
            if not node:
                res.append('N')
                return
            res.append(str(node.val))
            dfs(node.left)
            dfs(node.right)
        dfs(root)
        return ','.join(res)

    def deserialize(self, data):
        vals = iter(data.split(','))
        def dfs():
            val = next(vals)
            if val == 'N':
                return None
            node = TreeNode(int(val))
            node.left = dfs()
            node.right = dfs()
            return node
        return dfs()

Expect these follow-ups

How would your approach change if you needed the serialized format to be human-readable JSON?
What is the space complexity of your serialized representation compared to storing the tree level by level?

company:openaibinary-treedfsserializationrecursiondata-structures

As asked

Implement brute-force k-nearest neighbor search using cosine similarity for a query matrix Q of shape (q, d) against a corpus matrix C of shape (n, d), returning indices and scores for top-k neighbors.

Sample answer outline

Normalize both Q and C to unit norm along the embedding dimension, then compute the dot product matrix Q @ C.T of shape (q, n). Each row is the cosine similarity of one query against all corpus items. Use np.argpartition to get the top-k indices efficiently without a full sort, then sort only those k values. Return indices and similarity scores. This is the baseline before switching to FAISS or HNSW for large corpora.

Reference implementation (python)

Python

import numpy as np

def cosine_knn(
    Q: np.ndarray,  # (q, d)
    C: np.ndarray,  # (n, d)
    k: int,
) -> tuple[np.ndarray, np.ndarray]:
    Q_n = Q / (np.linalg.norm(Q, axis=1, keepdims=True) + 1e-10)
    C_n = C / (np.linalg.norm(C, axis=1, keepdims=True) + 1e-10)
    sims = Q_n @ C_n.T          # (q, n)
    part = np.argpartition(sims, -k, axis=1)[:, -k:]
    # sort within top-k
    rows = np.arange(len(Q))[:, None]
    top_sims = sims[rows, part]
    order = np.argsort(-top_sims, axis=1)
    indices = part[rows, order]
    scores  = top_sims[rows, order]
    return indices, scores

Expect these follow-ups

What is the complexity of this brute-force search and at what corpus size would you switch to an approximate method like FAISS?
How does L2 distance relate to cosine similarity for unit-norm vectors?

similarity-searchembeddingsnumpyknnretrieval

Practise these patterns on AlgoExpert

Recommended

200+ video-explained coding interview questions organised by the patterns covered on this page, with timed practice and solution walkthroughs.

Start practising

An external resource we recommend. AlgoExpert is not affiliated with us and we earn nothing from this link.

Tools to sharpen your prep

All tools

import numpy as np def cross_entropy(logits: np.ndarray, label: int) -> float: """logits: (vocab,), label: int index""" shifted = logits - logits.max() log_sum_exp = np.log(np.exp(shifted).sum()) return -shifted[label] + log_sum_exp def cross_entropy_batch( logits: np.ndarray, labels: np.ndarray ) -> float: """logits: (N, C), labels: (N,)""" shifted = logits - logits.max(axis=1, keepdims=True) log_sum_exp = np.log(np.exp(shifted).sum(axis=1)) correct = shifted[np.arange(len(labels)), labels] return (-correct + log_sum_exp).mean()

class LRUCache: def __init__(self, capacity: int): self.cap = capacity self.cache = {} # key -> node # sentinel head and tail self.head = Node(0, 0) self.tail = Node(0, 0) self.head.next = self.tail self.tail.prev = self.head def get(self, key: int) -> int: ... def put(self, key: int, value: int) -> None: ...

import torch import torch.nn.functional as F def focal_loss( logits: torch.Tensor, targets: torch.Tensor, alpha: float = 0.25, gamma: float = 2.0, weights: torch.Tensor | None = None, ) -> torch.Tensor: # logits: (N,), targets: (N,) binary float bce = F.binary_cross_entropy_with_logits( logits, targets, reduction="none" ) p_t = torch.sigmoid(logits) p_t = torch.where(targets == 1, p_t, 1 - p_t) focal_factor = (1 - p_t) ** gamma alpha_t = torch.where(targets == 1, alpha, 1 - alpha) loss = alpha_t * focal_factor * bce if weights is not None: loss = loss * weights return loss.mean()

import torch def greedy_step(logits: torch.Tensor) -> torch.Tensor: # logits: (batch, vocab) return logits.argmax(dim=-1) def top_k_step( logits: torch.Tensor, k: int = 50, temperature: float = 1.0, ) -> torch.Tensor: logits = logits / temperature top_k_vals, _ = torch.topk(logits, k, dim=-1) threshold = top_k_vals[..., -1:] logits = logits.masked_fill(logits < threshold, float("-inf")) probs = torch.softmax(logits, dim=-1) return torch.multinomial(probs, num_samples=1).squeeze(-1)

import math def get_lr( step: int, warmup_steps: int, total_steps: int, peak_lr: float, min_lr: float = 0.0, ) -> float: if step < warmup_steps: return peak_lr * step / max(warmup_steps, 1) progress = (step - warmup_steps) / max(total_steps - warmup_steps, 1) cosine = 0.5 * (1 + math.cos(math.pi * progress)) return min_lr + (peak_lr - min_lr) * cosine

class OnlineNormalizer: def __init__(self): self.n = 0 self.mean = 0.0 self.M2 = 0.0 def update(self, x: float) -> None: self.n += 1 delta = x - self.mean self.mean += delta / self.n delta2 = x - self.mean self.M2 += delta * delta2 @property def variance(self) -> float: if self.n < 2: return 0.0 return self.M2 / (self.n - 1) def normalize(self, x: float) -> float: return (x - self.mean) / (self.variance ** 0.5 + 1e-8)

import torch def batched_embed( token_ids: torch.Tensor, # (batch, seq_len) embed_matrix: torch.Tensor, # (vocab, d_model) pad_id: int = 0, ) -> torch.Tensor: # (batch, seq_len, d_model) embeds = embed_matrix[token_ids] # simple index pad_mask = (token_ids == pad_id).unsqueeze(-1) # (batch, seq, 1) embeds = embeds.masked_fill(pad_mask, 0.0) return embeds

from collections import defaultdict from typing import Dict, List, Any def class_metrics( y_true: List[Any], y_pred: List[Any] ) -> Dict[Any, Dict[str, float]]: classes = set(y_true) | set(y_pred) tp = defaultdict(int); fp = defaultdict(int); fn = defaultdict(int) for t, p in zip(y_true, y_pred): if t == p: tp[t] += 1 else: fp[p] += 1 fn[t] += 1 results = {} for c in classes: prec = tp[c] / (tp[c] + fp[c]) if (tp[c] + fp[c]) else 0.0 rec = tp[c] / (tp[c] + fn[c]) if (tp[c] + fn[c]) else 0.0 f1 = 2*prec*rec/(prec+rec) if (prec+rec) else 0.0 results[c] = {"precision": prec, "recall": rec, "f1": f1, "support": tp[c] + fn[c]} return results

import queue, time from typing import Generator, List, TypeVar T = TypeVar("T") def dynamic_batcher( q: queue.Queue, batch_size: int, max_wait_ms: float, ) -> Generator[List, None, None]: while True: batch, deadline = [], None while len(batch) < batch_size: timeout = None if deadline is None else max( 0, deadline - time.monotonic() ) try: item = q.get(timeout=timeout) batch.append(item) if deadline is None: deadline = time.monotonic() + max_wait_ms / 1000 except queue.Empty: break if batch: yield batch

import numpy as np from numpy import ndarray def iou(a: ndarray, b: ndarray) -> float: """a, b: [x1, y1, x2, y2]""" ix1, iy1 = max(a[0], b[0]), max(a[1], b[1]) ix2, iy2 = min(a[2], b[2]), min(a[3], b[3]) inter = max(0, ix2 - ix1) * max(0, iy2 - iy1) union = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter return inter / union if union else 0.0 def nms(boxes: ndarray, scores: ndarray, thresh: float) -> list: order = scores.argsort()[::-1] keep = [] while len(order): i = order[0]; keep.append(i) ious = np.array([iou(boxes[i], boxes[j]) for j in order[1:]]) order = order[1:][ious <= thresh] return keep

import numpy as np def adam_step( theta: np.ndarray, grad: np.ndarray, m: np.ndarray, v: np.ndarray, t: int, lr: float = 1e-3, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-8, ) -> tuple: m = beta1 * m + (1 - beta1) * grad v = beta2 * v + (1 - beta2) * grad ** 2 m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) theta -= lr * m_hat / (np.sqrt(v_hat) + eps) return theta, m, v

def tiled_matmul(A, B, tile=64): N = len(A) C = [[0.0]*N for _ in range(N)] for ii in range(0, N, tile): for jj in range(0, N, tile): for kk in range(0, N, tile): for i in range(ii, min(ii+tile, N)): for k in range(kk, min(kk+tile, N)): a_ik = A[i][k] for j in range(jj, min(jj+tile, N)): C[i][j] += a_ik * B[k][j] return C

def run_agent(initial_messages: list[dict], tools: dict) -> str: messages = list(initial_messages) retries = 0 max_retries = 3 while True: response = claude_complete(messages=messages, tools=list(tools.values())) # check response.content for tool_use blocks # execute tools, append results # if no tool calls, return text # if tool error, increment retries or raise pass

__global__ void reduceSum(float* in, float* out, int n) { extern __shared__ float sdata[]; int tid = threadIdx.x, i = blockIdx.x * blockDim.x + tid; sdata[tid] = (i < n) ? in[i] : 0.0f; __syncthreads(); for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) sdata[tid] += sdata[tid + s]; __syncthreads(); } if (tid == 0) atomicAdd(out, sdata[0]); }

class Codec: def serialize(self, root): res = [] def dfs(node): if not node: res.append('N') return res.append(str(node.val)) dfs(node.left) dfs(node.right) dfs(root) return ','.join(res) def deserialize(self, data): vals = iter(data.split(',')) def dfs(): val = next(vals) if val == 'N': return None node = TreeNode(int(val)) node.left = dfs() node.right = dfs() return node return dfs()

import numpy as np def cosine_knn( Q: np.ndarray, # (q, d) C: np.ndarray, # (n, d) k: int, ) -> tuple[np.ndarray, np.ndarray]: Q_n = Q / (np.linalg.norm(Q, axis=1, keepdims=True) + 1e-10) C_n = C / (np.linalg.norm(C, axis=1, keepdims=True) + 1e-10) sims = Q_n @ C_n.T # (q, n) part = np.argpartition(sims, -k, axis=1)[:, -k:] # sort within top-k rows = np.arange(len(Q))[:, None] top_sims = sims[rows, part] order = np.argsort(-top_sims, axis=1) indices = part[rows, order] scores = top_sims[rows, order] return indices, scores

Questions

Numerically stable cross-entropy loss from logitsCodingeasyVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement an LRU cache with O(1) get and putCodingmediumVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement a numerically stable focal loss in PyTorchCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement greedy decoding and top-k sampling for a language modelCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement a cosine annealing LR schedule with warmupCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Online computation of mean and variance for feature normalizationCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Efficient batched embedding lookup with padding maskCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Compute precision, recall, F1, and support from predictionsCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Build a streaming batch inference loop with max latency guaranteeCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement IoU and non-maximum suppression for object detectionCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement the Adam optimizer update step from scratchCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Tiled matrix multiplication for cache efficiencyCodinghardCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement a simple agent loop that retries on tool call errorsCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Implement a parallel reduction sum in CUDACodinghardCommon

As asked

Sample answer outline

Reference implementation (cuda)

Expect these follow-ups

Serialize and deserialize a binary treeCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Brute-force k-nearest neighbors with cosine similarity in NumPyCodingmediumOccasional

As asked

Sample answer outline

Reference implementation (python)