As asked
Design a change-data-capture pipeline from a production Postgres database to Snowflake, with under 5 minutes of replication lag and exactly-once semantics for analytics.
Sample answer outline
Source: Postgres logical replication via a slot, consumed by Debezium or a managed connector like Fivetran or Estuary. Stream the change events into Kafka or directly into a staging schema. Land raw events in an append-only table per source table, with the LSN and op type. A scheduled merge job materialises the current state from the raw stream. Exactly-once analytics: idempotent merges keyed on primary key, deduplicate on LSN. Watch the replication slot lag, an abandoned slot fills disk fast.
Reference implementation (sql)
-- Postgres: set up logical replication for CDC
ALTER SYSTEM SET wal_level = 'logical';
SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput');
CREATE PUBLICATION cdc_pub FOR TABLE customers, orders, line_items;
-- Snowflake: idempotent merge from the raw CDC stream
MERGE INTO analytics.customers AS tgt
USING (
SELECT customer_id, name, email, lsn, op
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY lsn DESC) AS rn
FROM raw.customers_cdc
)
WHERE rn = 1
) AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED AND src.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET name = src.name, email = src.email
WHEN NOT MATCHED AND src.op <> 'd' THEN
INSERT (customer_id, name, email) VALUES (src.customer_id, src.name, src.email);Expect these follow-ups
- What happens when a schema changes upstream?
- How do you handle a 200GB historical backfill without blocking ongoing CDC?
- What is the failure mode if the replication slot is deleted?