As asked
You are building a pipeline that ingests records from an external source, transforms them, and writes results to a PostgreSQL database. The pipeline can fail and restart at any point. How do you ensure each source record is processed exactly once, without duplicating or dropping rows in the output table?
Sample answer outline
A strong answer covers idempotent write operations using upsert with a deterministic deduplication key, checkpointing the last-processed offset in a durable store, and the difference between at-least-once delivery with idempotent consumers versus true exactly-once semantics. The candidate should address the two-phase commit problem when writing to both a message broker and the database, and mention transactional outbox pattern or Kafka transactions as options.
Expect these follow-ups
- What happens if two pipeline instances start simultaneously after a crash?
- How do you test that your idempotency logic actually works under concurrent writes?