Question 1

You own a daily ETL job that loads orders from an upstream API into the warehouse. Yesterday's job failed halfway through and will be retried. How do you make the pipeline idempotent?

Accepted Answer

A strong answer defines the target invariant first: re-running the job for the same logical window should produce the same final table state, not duplicate rows or partial totals. Use a stable business key such as order_id, write into a staging table, validate counts and required columns, then merge into the target inside a transaction or partition swap. Track high-water marks separately from successful commits so a failed run does not advance the cursor. Candidates often trip up by relying only on append-only loads or by deleting a partition before they know the replacement data is complete.

Question 2

Your company has 40 data teams writing Airflow or Dagster jobs in different styles. How would you build a shared orchestration platform without becoming a ticket queue?

Accepted Answer

The platform should provide paved roads: project templates, deployment automation, secrets access, standard retries, alerting, and lineage out of the box. Teams should own their DAG code while the platform owns runtime reliability, conventions, and guardrails. Multi-tenancy needs isolation for dependencies, quotas, credentials, and noisy workloads. Adoption is won by making the easiest path the best path, not by banning all custom jobs on day one. Candidates often miss the product-management side of platform work: documentation, migration support, and measuring whether teams are actually faster.

Question 3

Delta Lake maintains a JSON transaction log that grows with every operation. How does checkpoint compaction work, when does it trigger, and what does VACUUM actually delete versus what it leaves behind?

Accepted Answer

Strong answers explain that Delta checkpoints consolidate the JSON log into a Parquet state file every 10 commits by default, allowing readers to skip older log entries. VACUUM removes data files no longer referenced by any snapshot older than the retention threshold (default 7 days), but it never touches the transaction log files themselves. Candidates should note that running VACUUM with a short retention window breaks time-travel queries and can cause concurrent readers to see missing files.

Question 4

Explain how Kafka achieves exactly-once semantics from producer through consumer. What are idempotent producers, transactions, and the read-committed isolation level, and where can each break down?

Accepted Answer

A complete answer covers idempotent producers using sequence numbers to deduplicate retried sends within a single session, transactional producers wrapping multi-partition writes in atomic transactions, and consumers setting isolation.level=read_committed to skip uncommitted records. Failure modes include epoch bumps resetting sequence state, aborted transactions leaving zombie offsets, and exactly-once breaking down at the sink if the sink itself is not transactional.

Question 5

A Spark job performing a large groupBy is spilling heavily to disk and taking twice as long as expected. Walk me through how you diagnose the root cause and what levers you would pull to fix it.

Accepted Answer

Strong answers start with reading the Spark UI: checking shuffle read/write sizes, spill (memory) and spill (disk) metrics per stage, and per-task memory usage. Root causes include data skew, insufficient executor memory, or too few partitions. Fixes include increasing spark.executor.memory or spark.memory.fraction, repartitioning to reduce per-task data, salting keys to address skew, or using a partial aggregation (combiner) to reduce shuffle volume.

Question 6

A high-frequency Kafka-to-Iceberg ingestion pipeline produces thousands of small Parquet files per hour. Describe the impact on query performance and the strategies you would use to compact files without impacting concurrent readers or writers.

Accepted Answer

Small files cause excessive S3 LIST and open calls per query, slow the manifest scan in Iceberg, and increase the metadata overhead of the snapshot. Compaction rewrites small files into target-size files (typically 128 to 512 MB). Iceberg's RewriteDataFiles action can be run concurrently with active writers because it commits atomically: it adds new compacted files and removes old ones in a single snapshot, so readers always see a consistent state. The candidate should know bin-packing versus sort-based rewrite strategies and how to schedule compaction (Airflow, Spark job, or Iceberg maintenance procedures).

Browse by topic

Top data platform engineer interview questions

Design an idempotent daily ETL loadRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design a multi-team orchestration platformSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Delta Lake transaction log compaction and VACUUM semanticsRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Kafka exactly-once delivery end to endRole-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

Spark shuffle spill diagnosis and tuningRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Small file problem in a lakehouse and compaction strategiesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools

Browse by topic

Top data platform engineer interview questions

Design an idempotent daily ETL loadRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design a multi-team orchestration platformSystem designhardVery common

As asked

Sample answer outline

Expect these follow-ups

Delta Lake transaction log compaction and VACUUM semanticsRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Kafka exactly-once delivery end to endRole-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

Spark shuffle spill diagnosis and tuningRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Small file problem in a lakehouse and compaction strategiesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools