Question 1

Tell me about a time an ETL pipeline failure caused incorrect or missing data to reach end users or a downstream system. Walk me through what happened, how you discovered it, how you fixed it, and what you changed to prevent recurrence.

Accepted Answer

Strong answers follow STAR: describe a concrete failure (wrong aggregation logic, missing rows, duplicate records), explain how discovery happened (user report, monitoring alert, or downstream system failure), detail the remediation steps (rollback, re-run, data correction), and show genuine learning (added a specific data quality check, improved alerting thresholds, wrote a runbook). Avoid vague answers that never name what the actual data problem was.

Question 2

Your nightly ETL job is supposed to finish by 7 AM but it is 8:30 AM and it is still running, and users are already asking why the dashboard is empty. You have never seen it run this long before. What do you do right now?

Accepted Answer

First communicate: tell the dashboard users there is an active incident and give an ETA (even a rough one) to stop new questions from piling in. Then diagnose in parallel: check which task is stuck, look at resource utilization, check for locks on the target database, and check if source data volume spiked. Fix the immediate issue with the least-risk intervention. After resolution, add an alert for job duration exceeding 1.5x the p95 baseline so this gets caught an hour earlier next time.

Question 3

Tell me about a legacy ETL pipeline you inherited that was fragile or hard to maintain. What did you do to improve it, how did you justify the work to stakeholders, and what was the outcome?

Accepted Answer

The candidate should describe the specific problems with the legacy pipeline (no tests, hardcoded values, single points of failure, no observability), the decision process for how much to refactor versus rewrite, and how they built confidence in the new version through shadow running or parallel validation. They should quantify the improvement: reduced failures by X%, cut run time from Y to Z minutes, or reduced on-call incidents.

Question 4

Tell me about a time an upstream team changed the schema or format of data you depended on without warning. How did you handle the immediate incident and the ongoing relationship?

Accepted Answer

Strong answers describe the immediate triage (which pipelines broke, what data is missing, how to communicate to data consumers), then the resolution steps, and critically the longer-term changes: establishing a data contract or schema registry, adding a schema validation step at ingestion, and building a change notification process with the upstream team. The candidate should show they influenced organizational process, not just fixed their own pipeline.

Question 5

Describe a specific ETL or data transformation job that was too slow. Walk me through how you diagnosed the bottleneck, what you changed, and the measurable improvement.

Accepted Answer

Strong answers describe a concrete scenario with numbers: job took 4 hours, needed to finish in 45 minutes before the business opened. The diagnosis method should be specific (profiling with Spark UI, checking query plans, measuring I/O wait). The fix should be precise (added a clustering key, switched from row-level MERGE to partition swap, increased shuffle partitions). The outcome should be quantified.

Question 6

Tell me about a time you were under pressure to ship a data pipeline quickly and cut corners on data quality checks or testing. What happened, and what did you learn?

Accepted Answer

Honest answers acknowledge the specific shortcut taken and the consequence. Strong candidates describe a genuine tradeoff, not a perfect story, and show learning by explaining exactly what minimum viable quality gate they now insist on even under tight deadlines. They should also show judgment: some shortcuts are acceptable (skipping edge-case docs) while others are not (skipping null checks on primary keys).

Question 7

Tell me about a time you helped a junior engineer understand why a particular ETL pattern matters, such as idempotency, staging tables, or testing. How did you approach the teaching, and how did you know it stuck?

Accepted Answer

Strong answers describe a specific concept, a concrete teaching method (code review with explanation, pair programming, a worked example from a real failure), and evidence of knowledge transfer (the junior applied the pattern independently, caught a related bug in code review, or taught another colleague). Generic answers about 'being supportive' without specifics are weak.

Question 8

Tell me about a time a business stakeholder asked you to build a data pipeline or report that you believed was technically flawed, would produce misleading results, or was not feasible in the requested timeframe. How did you handle it?

Accepted Answer

Strong answers show the candidate understood the underlying business need behind the request, not just the literal ask. They should describe how they explained the technical limitation clearly without jargon, proposed an alternative that met the actual need, and got buy-in. Avoid answers where the candidate just did what the stakeholder asked and it was wrong.

Question 9

Describe a time you were paged for an ETL or data pipeline incident outside business hours. Walk me through your triage process, how you communicated status, and how you resolved it.

Accepted Answer

Strong answers show a structured approach: check dashboards before SSHing anywhere, isolate the scope (one table, one pipeline, or everything), communicate early to stakeholders about the impact and ETA, resolve with the least-risk fix first (restart before rewriting), and document the incident. They should mention what they added to the runbook afterward.

Question 10

Tell me about a time you disagreed with a colleague or manager on how to build an ETL pipeline or data model. How did you make your case, and what happened?

Accepted Answer

Strong answers show the candidate argued on technical merit with data (benchmarks, precedent, cost estimates), listened to the other side and acknowledged valid points, and either changed their mind with a good reason or escalated with evidence when the stakes were high enough. Avoid answers where the candidate always wins or always defers.

Question 11

Tell me about a time you had to learn and use a new data tool or technology (such as Dagster, dbt, Spark, or a new warehouse) on a project with a real deadline. How did you get up to speed quickly, and what corners did you cut that you later went back to fix?

Accepted Answer

Strong answers show a deliberate ramp-up strategy (official docs then a toy project then a production PR with a code review from someone more experienced) and honest acknowledgment of what they skipped initially and why. They should name the specific tool and describe one concrete thing they implemented incorrectly on first try and how they fixed it.

Question 12

Tell me about a time you had to choose between building new pipeline features requested by the business and fixing technical debt that was causing recurring incidents. How did you frame the decision and what did you do?

Accepted Answer

Strong candidates quantify the debt cost (engineer hours per week on incidents, frequency of SLA misses) and compare it against the feature value. They describe how they communicated this tradeoff to a non-technical stakeholder without using jargon, and show they negotiated a specific time allocation for debt work rather than letting it go undefined. The outcome should be concrete.

Question 13

It is 6 AM and a vendor's daily file has arrived but it is truncated at 40% of the expected row count, and your pipeline already loaded it and moved on. Finance needs the data for a 9 AM report. What are your next steps in order?

Accepted Answer

Immediately: quarantine the bad load by rolling back or flagging the loaded rows as invalid, alert the downstream consumers about the data gap before 9 AM, and contact the vendor for a resend or the cause. Simultaneously: check if yesterday's data can serve as a safe fallback and if the report can be delayed or annotated. After resolution: add a row count validation check at ingestion so this is caught before the load step next time.

Question 14

Sales and Finance both have a 'monthly revenue' metric in their dashboards, built from different ETL pipelines. They now report different numbers for last month. Both teams believe their number is correct. What do you do?

Accepted Answer

This is a data governance and communication problem first. Bring both teams together to compare the definitions: what events count, what time zone is used for the period boundary, how refunds are treated, and so on. Once the definitions are documented and the right one agreed upon, build a single canonical metric in the warehouse that both dashboards use. Candidates should advocate for a metrics layer (dbt Semantic Layer, LookML, or similar) to prevent this from recurring.

Question 15

A product manager asks you to integrate a third-party data vendor's API into the warehouse within two weeks. You have never worked with this vendor before. What questions do you ask and what do you evaluate before committing to the timeline?

Accepted Answer

Key questions: API rate limits and pagination, authentication method, schema stability and versioning policy, historical data availability, SLA for data freshness, support for change detection (updated_at field or CDC), terms of service around data storage and redistribution, and sample data quality. The timeline also depends on whether this is a one-time load or an ongoing pipeline. Candidates should mention requesting a sandbox or sample file before committing.

Question 16

A DAG that has been running daily for 6 months did not trigger yesterday or today. There are no failed tasks, just no runs at all. How do you diagnose this?

Accepted Answer

Check in this order: Is the DAG paused in the UI? Has the scheduler process been restarted recently? Is the DAG file still being parsed (check for import errors in the DAG file or in scheduler logs)? Has the start_date or end_date changed? Is catchup=False causing runs to be skipped after a scheduler downtime? Check the scheduler heartbeat metric and the parsed_dags metric in the Airflow logs. A syntax error in an unrelated file in the DAGs folder can sometimes prevent all DAGs from parsing.

Question 17

An upstream engineering team tells you they need to rename a column from 'user_id' to 'account_id' in a Postgres table your pipeline depends on, and they plan to do it in two weeks with no transition period. You have 15 downstream ETL jobs using this column. What do you negotiate and how do you handle the migration?

Accepted Answer

Negotiate a transition period where both column names exist (a view or a generated column alias) for at least 30 days. Audit all 15 jobs and their downstream dependencies to estimate migration effort. Migrate and test each job in a staging environment before cutover. Add a schema validation step at the ingestion layer of each job so future renames trigger an alert rather than a silent failure. After cutover, confirm the upstream team drops the old column name only after all jobs are verified.

Question 18

Tell me about a time you joined or worked with a team that had no consistent standards for data naming, documentation, or pipeline testing. What did you do to improve the situation, and how much resistance did you face?

Accepted Answer

Strong answers show pragmatic incrementalism: pick one high-pain problem (column naming inconsistency, no dbt tests on critical models), propose a lightweight standard, get buy-in from one influential team member, and demonstrate value before pushing for broader adoption. They should acknowledge that governance feels like overhead until a high-profile incident makes the value obvious.

Question 19

A junior engineer ran a DELETE without a WHERE clause on a 50-million-row production fact table in Snowflake during working hours. Active dashboards are now showing zeros. What do you do?

Accepted Answer

Snowflake's Time Travel allows restoration via CREATE TABLE ... AS SELECT * FROM table AT (OFFSET => -600) or UNDROP if the table was dropped. The first action is to restore the data, ideally in minutes. Simultaneously communicate to stakeholders that a recovery is in progress. After recovery: revoke delete permissions from the engineer's role, add a pre-production review step for destructive DDL/DML, and document the incident. Avoid blame in the post-mortem; fix the access model.

Question 20

Your daily ingestion job usually processes 5 million rows but this morning it started processing 50 million rows and is failing due to memory and timeout limits. The source team says it is a legitimate data spike, not a bug. What do you do today, and what do you change for the future?

Accepted Answer

Today: increase memory allocation or warehouse size temporarily to process the load, split the batch into smaller chunks if the current approach loads everything into memory, and communicate the delay to stakeholders. Future: design the pipeline to process in fixed-size partitions rather than relying on total volume being small, add row count monitoring to detect spikes early, and set up auto-scaling or a fallback warehouse size for large runs.

Questions

Tell me about a time a pipeline failure impacted usersBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Pipeline misses its SLA on a critical morningBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Describe a legacy ETL pipeline you modernizedBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Navigating a schema change from an upstream teamBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about a pipeline performance problem you solvedBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

When you shipped fast and paid the price in data qualityBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Mentoring a junior engineer on pipeline best practicesBehaviouraleasyCommon

As asked

Sample answer outline

Expect these follow-ups

Handling a data request that could not be fulfilled as askedBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Handling an on-call ETL incident under time pressureBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Disagreeing with a colleague on a data architecture decisionBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Adopting a new data tool under a tight deadlineBehaviouraleasyCommon

As asked

Sample answer outline

Expect these follow-ups

Balancing feature work and ETL technical debtBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

What do you do when a source file arrives corrupted?BehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Two teams define the same metric differentlyBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Evaluating a new data source before integrating itBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

A DAG that was running daily suddenly stops triggeringBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Upstream team wants to rename a critical columnBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Introducing data governance practices to a chaotic teamBehaviouralmediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Accidental DELETE on a production warehouse tableBehaviouralhardOccasional

As asked

Sample answer outline

Expect these follow-ups

Handling an unexpected 10x data volume spikeBehaviouralhardOccasional

As asked

Sample answer outline