As asked
You own a daily ETL job that loads orders from an upstream API into the warehouse. Yesterday's job failed halfway through and will be retried. How do you make the pipeline idempotent?
Sample answer outline
A strong answer defines the target invariant first: re-running the job for the same logical window should produce the same final table state, not duplicate rows or partial totals. Use a stable business key such as order_id, write into a staging table, validate counts and required columns, then merge into the target inside a transaction or partition swap. Track high-water marks separately from successful commits so a failed run does not advance the cursor. Candidates often trip up by relying only on append-only loads or by deleting a partition before they know the replacement data is complete.
Expect these follow-ups
- How would your answer change if the source sends updates to old orders?
- What metadata would you store for each pipeline run?
- How do you recover if the warehouse supports no multi-statement transaction?