Question 1

Design an internal LLM gateway used by ten product teams. It must support provider fallback, prompt versioning, cost controls, and audit logging.

Accepted Answer

The gateway should be a policy and observability layer, not just an HTTP proxy. It needs typed request contracts, prompt and model version identifiers, tenant-level budgets, retries with idempotency, and provider fallback rules that are explicit about quality and compliance tradeoffs. Audit logs should capture inputs, outputs, metadata, and redaction state according to data policy. The answer should include latency budgets and failure modes, such as a fallback model producing different JSON shape or safety behaviour. Strong candidates avoid automatic fallback for high-risk flows unless the evals prove equivalence.

Question 2

Design a shared GPU cluster for research training jobs, batch fine-tunes, and low-latency inference. How do you schedule work fairly and efficiently?

Accepted Answer

Separate workload classes because inference and long training jobs have different latency and preemption tolerance. Use quotas, priority classes, gang scheduling for distributed training, and preemptible queues for exploratory work. Strong answers discuss GPU fragmentation, topology awareness, checkpointing, fair-share policies, and visibility into queue time by team. Inference should usually live in reserved pools with autoscaling and admission control rather than competing directly with training. The common failure is maximising utilisation while making urgent production workloads wait behind low-priority experiments.

Question 3

Design the networking, storage, and scheduling infrastructure for a 1000-GPU A100 cluster intended to run multi-week pre-training jobs on 100B+ parameter models. Walk through your choices for inter-node fabric, intra-node connectivity, shared storage, job scheduling, and failure handling.

Accepted Answer

Inter-node: fat-tree or DragonFly InfiniBand HDR/NDR at 200/400 Gb/s per node, non-blocking with at least 1:1 oversubscription at the spine. Intra-node: NVLink 4.0 via NVSwitch for 8-GPU nodes. Storage: Lustre or GPFS parallel filesystem for checkpoints (hundreds of GB/s aggregate write), S3/object storage for training data (streaming reads via WebDataset). Scheduling: Slurm or Kubernetes+Volcano with gang scheduling, preemption tiers, and GPU health checks pre-launch. Failure handling: async checkpointing every N steps, elastic restart via torchrun, node health monitoring with DCGM and automated node drain on ECC errors.

Question 4

Design a serving system for a 70B parameter LLM that needs to handle 10,000 requests per minute with p95 time-to-first-token under 2 seconds and p95 per-token latency under 100 ms. Describe the architecture from load balancer to GPU fleet.

Accepted Answer

Use a load balancer that routes to a pool of serving replicas, each running vLLM or TGI with tensor parallelism across 4 or 8 GPUs. Separate the prefill (compute-bound) and decode (memory-bound) phases onto different GPU pools (disaggregated serving). Queue incoming requests through Redis or a message broker, apply admission control to prevent KV cache exhaustion. Use prefix caching for shared system prompts. Monitor p50/p95 TTFT and ITL with Prometheus and alert on KV cache hit rate drops. Autoscale replicas based on queue depth and GPU utilization.

Question 5

Design a checkpoint management system for a team running multiple concurrent 100B model training experiments. The system needs to handle multi-terabyte checkpoints, fast restore, experiment versioning, and storage cost management.

Accepted Answer

Write checkpoints asynchronously to a fast tier (NFS/Lustre) and asynchronously replicate to object storage (S3). Keep the last 3 checkpoints on fast storage and archive older ones to S3 Glacier. Tag each checkpoint with experiment ID, step, and git commit hash. Implement a sidecar process per training run that uploads and garbage-collects old checkpoints. For fast restore, pre-stage the most recent checkpoint to local SSD before the training job starts. Track checkpoint metadata in a lightweight database (SQLite or Postgres) for querying by experiment.

Question 6

Design a system that continuously monitors GPU health across a 1000-node cluster, detects failing GPUs before they crash training runs, and automatically drains and replaces affected nodes without human intervention.

Accepted Answer

Deploy DCGM (Data Center GPU Manager) on each node as a DaemonSet, exporting metrics to Prometheus. Define alerting rules for ECC double-bit errors, Xid errors above threshold, PCIe bandwidth drop, and temperature excursions. On alert, trigger a node drain workflow: cordon the node in Kubernetes, checkpoint the running job if it supports elastic restart, evict pods, and label the node for hardware inspection. Maintain a pool of spare nodes that can be added to replace drained nodes. Log all GPU events to a central time-series store for failure pattern analysis.

Question 7

Your organization has 2000 GPUs shared across 10 teams with different priorities and quotas. Design a scheduling system that enforces quotas, provides fair share for idle capacity, supports preemption, and gives teams visibility into their usage and wait times.

Accepted Answer

Use Kubernetes with Kueue or Volcano implementing hierarchical queues per team with hard quota enforcement and borrowing policies for idle capacity. Preemption policy: lower-priority jobs yield to higher-priority ones when above quota; borrowing capacity is preemptible. Expose a dashboard (Grafana) showing per-team GPU hours used, queue depth, estimated wait time, and utilization. Implement a fair-share algorithm (dominant resource fairness) that accounts for multi-dimensional resources (GPU, CPU, memory). Alert teams when their jobs are about to be preempted.

Question 8

Design a data pipeline that processes raw web crawl data (100 TB) into tokenized training shards that can feed a 1000-GPU training run at 1M tokens per second without I/O bottlenecking the GPUs.

Accepted Answer

Offline pipeline: deduplicate with MinHash/LSH, filter with quality classifiers, tokenize with a fast tokenizer (tiktoken/sentencepiece), write to Parquet or binary WebDataset shards of 500 MB each on object storage. Online pipeline: workers read shards in parallel, shuffle at the shard and sample level, and feed a DataLoader with num_workers set for CPU-GPU overlap. At 1M tokens/s with 2 bytes per token that is 2 GB/s; ensure aggregate I/O from storage exceeds this with enough reader workers. Use a prefetch buffer to absorb I/O variance.

Question 9

Your LLM serving cluster sees 10x traffic variation between peak and off-peak hours. Design an autoscaling system that minimizes cost while maintaining SLA targets for latency and availability. How do you handle the slow startup time of large model replicas?

Accepted Answer

Use a custom scaling metric (queue depth per replica plus p95 TTFT) rather than CPU/memory because GPU utilization is not a good proxy for LLM load. Keep a warm minimum fleet for baseline traffic; scale out preemptively using traffic forecasts (time-of-day patterns). For slow startup, pre-load model weights on standby nodes with the model in memory but not accepting traffic; health check on readiness once loaded. Use KEDA (Kubernetes Event-Driven Autoscaler) with a Prometheus adapter for the custom metric. Tear down excess replicas slowly to avoid thrashing.

Question 10

Design the infrastructure for tracking training experiments at a team of 50 ML engineers running hundreds of concurrent training jobs. Cover metrics collection, artifact storage, hyperparameter management, and reproducibility guarantees.

Accepted Answer

Collect per-step metrics (loss, learning rate, gradient norm, MFU) via a lightweight metrics library (W&B, MLflow, or a custom Prometheus exporter). Store model checkpoints and training artifacts in versioned object storage keyed by experiment ID and run ID. Link each run to the exact git commit, dataset version (DVC or dataset hashes), and hyperparameters. Provide a comparison UI for runs. Reproducibility: seed all RNG sources (Python, NumPy, PyTorch) and log them; record environment (Docker image, CUDA version, dependency hashes). Alert on loss divergence or NaN via metric threshold checks.

Question 11

A multimodal generation feature is too slow for interactive use. How would you reduce latency without destroying output quality?

Accepted Answer

Break latency into queueing, pre-processing, model inference, post-processing, and network time before changing the model. Candidate optimisations include smaller distilled models, quantisation, caching repeated inputs, batching where it does not hurt interactivity, progressive previews, and asynchronous high-quality refinement. Strong answers evaluate quality with task-specific tests after every optimisation rather than assuming faster is acceptable. They also mention capacity planning because queueing delay can dominate model time under bursty demand. The common mistake is jumping straight to a smaller model without measuring the real bottleneck.

Question 12

Design a training framework that can shrink or grow the number of workers mid-run in response to node failures or preemptions, without losing more than a few minutes of progress. What components does this require, and what are the consistency challenges?

Accepted Answer

Requires frequent checkpointing (every few hundred steps), a coordinator that monitors worker health and triggers rendezvous on membership change, and a training loop that re-initializes FSDP/DDP after each membership change using the latest checkpoint. PyTorch Elastic (torchelastic) provides this via the etcd rendezvous backend. Consistency challenges: optimizer state sharding must be rebalanced when world_size changes, the learning rate schedule must account for the effective batch size change, and the data sampler must resume from the exact position to avoid re-reading samples.

Questions

Design an LLM gateway with fallbacksSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design scheduling for a shared GPU clusterSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design a training cluster for 1000-GPU LLM runsSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design a high-throughput LLM inference serving systemSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design a checkpoint storage and versioning system for LLM trainingSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Design a GPU health monitoring and automated remediation systemSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Design multi-tenant GPU cluster scheduling with fairnessSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Design a scalable training data pipeline for pre-trainingSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Design autoscaling for a variable-load LLM inference serviceSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Design experiment tracking infrastructure for large-scale trainingSystem designmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Trade off generation quality and latencySystem designmediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Design an elastic distributed training systemSystem designhardOccasional

As asked

Sample answer outline

Expect these follow-ups

Related questions

Design scheduling for a shared GPU cluster

Design a training cluster for 1000-GPU LLM runs

Design a high-throughput LLM inference serving system

Design a checkpoint storage and versioning system for LLM training

More ai infrastructure engineer topics

Tools to sharpen your prep

Questions

Design an LLM gateway with fallbacksSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design scheduling for a shared GPU clusterSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design a training cluster for 1000-GPU LLM runsSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design a high-throughput LLM inference serving systemSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Design a checkpoint storage and versioning system for LLM trainingSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Design a GPU health monitoring and automated remediation systemSystem designhardCommon

As asked

Sample answer outline