ML engineer backend interview questions

4 questions on backend for ml engineer candidates. Each entry has the question as asked, a sample answer outline, common follow-ups, and a reference implementation where applicable.

Showing 1 to 4 of 4 backend questions.

As asked

You have a Spark job on Databricks that is taking three times longer than expected. The Spark UI shows most of the time is in the shuffle read stage. Walk me through how you diagnose what is causing the slow shuffle, and what changes you would make to the job or cluster configuration to fix it.

Sample answer outline

A strong answer checks for data skew first by looking at task duration variance in the Spark UI stage view. If one or a few tasks take 100x longer, skew is the likely cause and salting the join key or using skew hints in Databricks SQL is the fix. If skew is not the problem, the candidate should consider shuffle partition count (spark.sql.shuffle.partitions defaults to 200, which can be too low or too high for the data volume), disk spill to executor local storage, and network bandwidth. They should also mention Adaptive Query Execution in Spark 3 which can coalesce partitions automatically.

Expect these follow-ups

How would you use Adaptive Query Execution in Databricks Runtime to reduce shuffle overhead automatically?
Explain the difference between a sort-merge join and a broadcast hash join and when each is appropriate.

company:databrickssparkshuffleperformance-tuningdatabricks-runtimedistributed-systems

As asked

You are building an internal LLM-powered feature that calls the OpenAI API as part of a larger request pipeline. What observability instrumentation would you add, and how would you use it to debug a latency regression or an increase in bad outputs?

Sample answer outline

A strong answer covers structured logging with trace IDs that span from the user request through every LLM call, capturing prompt tokens and completion tokens per call, latency per stage, model version, and sampling parameters. They should mention async log export to avoid adding latency, human-readable traces for debugging bad outputs (storing prompt and response, not just metadata), and dashboards for p50/p95/p99 TTFT and total latency. For quality regressions: eval datasets with golden outputs and automated diffing after deploys.

Expect these follow-ups

How do you balance storing full prompts and responses for debugging against user privacy obligations?
What would you add to detect prompt injection attempts in your traces?

company:openaiobservabilitytracingllm-opsdebuggingmonitoring

As asked

OpenAI's API supports tool calling, where the model can emit a structured function call that your code then executes and returns results for. Walk me through how you would implement this end-to-end: the prompt design, parsing the model's output, executing the tool, handling errors, and returning results back to the model.

Sample answer outline

A strong answer covers: encoding the tool schema (name, description, parameters as JSON Schema) in the API request, detecting when the model emits a tool_call finish reason, parsing the structured JSON arguments safely (validating against the schema before executing), executing the tool in a sandbox or with appropriate guards, and returning the tool result as a tool message in the next turn. They should address error handling: what to pass back if the tool fails, infinite loop prevention, and token budget concerns when results are large.

Expect these follow-ups

How would you prevent a model from calling a tool in an infinite loop?
How do you handle a tool call where the model passes an argument that fails JSON schema validation?

company:openaitool-callingfunction-callingllmapi-designopenai-products

As asked

You notice that the first request to a freshly started API server instance has 10 times higher latency than subsequent requests. Walk me through the sources of this cold-start problem and the strategies you would use to reduce it.

Sample answer outline

A strong answer covers model weight loading from disk to GPU memory (the dominant cost for LLMs), JIT compilation of CUDA kernels, connection pool warming, and dependency initialization. Solutions: pre-loading weights into a memory-mapped file or GPU memory before the first request, using persistent processes rather than spawning a new process per request, warming up with synthetic requests during deployment, and keeping a minimum number of warm instances to absorb traffic spikes.

Expect these follow-ups

How would you handle the cost of keeping warm instances when traffic is very low overnight?
What metrics would you track to know whether your cold-start optimizations are working?

company:openailatencycold-startinferenceperformancegpu

Practise these patterns on AlgoExpert

Recommended

200+ video-explained coding interview questions organised by the patterns covered on this page, with timed practice and solution walkthroughs.

Start practising

An external resource we recommend. AlgoExpert is not affiliated with us and we earn nothing from this link.

Tools to sharpen your prep

All tools

ML engineer backend interview questions

4 questions on backend for ml engineer candidates. Each entry has the question as asked, a sample answer outline, common follow-ups, and a reference implementation where applicable.

Showing 1 to 4 of 4 backend questions.

As asked

Sample answer outline

Expect these follow-ups

How would you use Adaptive Query Execution in Databricks Runtime to reduce shuffle overhead automatically?
Explain the difference between a sort-merge join and a broadcast hash join and when each is appropriate.

company:databrickssparkshuffleperformance-tuningdatabricks-runtimedistributed-systems

As asked

Sample answer outline

Expect these follow-ups

How do you balance storing full prompts and responses for debugging against user privacy obligations?
What would you add to detect prompt injection attempts in your traces?

company:openaiobservabilitytracingllm-opsdebuggingmonitoring

As asked

Sample answer outline

Expect these follow-ups

How would you prevent a model from calling a tool in an infinite loop?
How do you handle a tool call where the model passes an argument that fails JSON schema validation?

company:openaitool-callingfunction-callingllmapi-designopenai-products

As asked

Sample answer outline

Expect these follow-ups

How would you handle the cost of keeping warm instances when traffic is very low overnight?
What metrics would you track to know whether your cold-start optimizations are working?

company:openailatencycold-startinferenceperformancegpu

Practise these patterns on AlgoExpert

Recommended

200+ video-explained coding interview questions organised by the patterns covered on this page, with timed practice and solution walkthroughs.

Start practising

An external resource we recommend. AlgoExpert is not affiliated with us and we earn nothing from this link.

Tools to sharpen your prep

All tools

ML engineer backend interview questions

As asked

Sample answer outline

Expect these follow-ups

As asked

Sample answer outline

Expect these follow-ups

As asked

Sample answer outline

Expect these follow-ups

As asked

Sample answer outline

Expect these follow-ups

Related questions

How would you add observability to an LLM-powered service?

Explain how you would implement tool calling in an LLM system

How would you reduce cold-start latency for an LLM API endpoint?

Tell me about the biggest production incident you've handled

More ml engineer topics

Tools to sharpen your prep

ML engineer backend interview questions

As asked

Sample answer outline

Expect these follow-ups

As asked

Sample answer outline

Expect these follow-ups

As asked

Sample answer outline

Expect these follow-ups

As asked

Sample answer outline

Expect these follow-ups

Related questions

How would you add observability to an LLM-powered service?

Explain how you would implement tool calling in an LLM system

How would you reduce cold-start latency for an LLM API endpoint?

Tell me about the biggest production incident you've handled

More ml engineer topics

Tools to sharpen your prep

Questions

A Spark job is bottlenecked on shuffle. Walk me through your diagnosis.BackendmediumVery common

As asked

Sample answer outline

Expect these follow-ups

How would you add observability to an LLM-powered service?BackendmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Explain how you would implement tool calling in an LLM systemBackendmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How would you reduce cold-start latency for an LLM API endpoint?BackendmediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Related questions

How would you add observability to an LLM-powered service?

Explain how you would implement tool calling in an LLM system

How would you reduce cold-start latency for an LLM API endpoint?

Tell me about the biggest production incident you've handled

More ml engineer topics

Tools to sharpen your prep

Questions

A Spark job is bottlenecked on shuffle. Walk me through your diagnosis.BackendmediumVery common

As asked

Sample answer outline

Expect these follow-ups

How would you add observability to an LLM-powered service?BackendmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Explain how you would implement tool calling in an LLM systemBackendmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How would you reduce cold-start latency for an LLM API endpoint?BackendmediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Related questions

How would you add observability to an LLM-powered service?

Explain how you would implement tool calling in an LLM system

How would you reduce cold-start latency for an LLM API endpoint?

Tell me about the biggest production incident you've handled

More ml engineer topics

Tools to sharpen your prep