ETL engineer coding interview questions

11 questions on coding for etl engineer candidates. Each entry has the question as asked, a sample answer outline, common follow-ups, and a reference implementation where applicable.

Showing 1 to 11 of 11 coding questions.

As asked

Write a Python function that takes a list of event dictionaries, each with an 'id' and a 'ts' (Unix timestamp), and returns a list containing only the most recent record for each id. Optimize for the case where the list has millions of records.

Sample answer outline

A single-pass O(n) solution using a dict keyed by id and comparing ts values is the right answer. Candidates should avoid sorting the entire list first which is O(n log n). Bonus for noting that the dict approach works in constant space relative to unique keys and handles ties consistently.

Reference implementation (python)

Python

def deduplicate(records: list[dict]) -> list[dict]:
    # records = [{'id': 'a', 'ts': 1000}, {'id': 'a', 'ts': 2000}, ...]
    latest = {}
    for rec in records:
        rid = rec['id']
        if rid not in latest or rec['ts'] > latest[rid]['ts']:
            latest[rid] = rec
    return list(latest.values())

Expect these follow-ups

How would you parallelize this if the records were spread across 100 files on S3?

pythondeduplicationdictionariesetlalgorithms

As asked

Write a Python generator that takes an iterable of arbitrary size and a chunk_size, and yields lists of up to chunk_size items. Use this to batch insert rows into Postgres without loading all rows into memory at once.

Sample answer outline

The generator should use islice or a simple accumulator loop and yield each batch. Strong answers note that generators are memory-efficient for large inputs and that the caller controls back-pressure by consuming one batch at a time. They should also mention that psycopg2's execute_values is more efficient than looping execute for Postgres inserts.

Reference implementation (python)

Python

from itertools import islice

def chunked(iterable, chunk_size: int):
    it = iter(iterable)
    while True:
        batch = list(islice(it, chunk_size))
        if not batch:
            break
        yield batch

# Usage:
# for batch in chunked(rows, 1000):
#     cursor.executemany(sql, batch)

Expect these follow-ups

How does your solution behave if the iterable raises an exception halfway through?

pythongeneratorspostgresbatch-insertmemory-efficiency

As asked

Given a table called daily_sales with columns (sale_date DATE, product_id INT, revenue DECIMAL), write a SQL query that returns each row with an additional column showing the cumulative revenue for that product_id from the earliest date up to and including that row's sale_date.

Sample answer outline

The answer uses SUM(revenue) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). Candidates should know that the default frame for an ordered window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which can give unexpected results when multiple rows share the same date. Using ROWS avoids this ambiguity.

Reference implementation (sql)

SQL

SELECT
  sale_date,
  product_id,
  revenue,
  SUM(revenue) OVER (
    PARTITION BY product_id
    ORDER BY sale_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM daily_sales
ORDER BY product_id, sale_date;

Expect these follow-ups

What changes if you want the running total to reset at the start of each calendar month?

sqlwindow-functionsrunning-totaldatabases

As asked

Write a Python function that takes a nested dictionary of arbitrary depth and returns a flat dictionary where nested keys are joined with a dot. For example, {'a': {'b': {'c': 1}}} should become {'a.b.c': 1}.

Sample answer outline

A recursive or iterative DFS approach works. Strong candidates write a helper that carries a prefix and handles lists by either converting the index to a key or skipping them depending on the use case. They should mention that this is useful in ETL when normalizing event payloads before inserting into a column-per-field warehouse table.

Reference implementation (python)

Python

def flatten(d: dict, prefix: str = '', sep: str = '.') -> dict:
    result = {}
    for key, value in d.items():
        full_key = f'{prefix}{sep}{key}' if prefix else key
        if isinstance(value, dict):
            result.update(flatten(value, full_key, sep))
        else:
            result[full_key] = value
    return result

Expect these follow-ups

How do you handle a list value at a nested key, for example {'a': [1, 2, 3]}?

pythonjsonrecursionetldata-transformation

As asked

A pipeline writes one row per day per client_id into a table called daily_metrics (client_id, metric_date, value). Write a SQL query that finds all (client_id, metric_date) combinations where a date is missing from the continuous sequence between that client's first and last recorded date.

Sample answer outline

The approach uses a date series generator (generate_series in Postgres or a recursive CTE) crossed with the distinct client_ids, then a LEFT JOIN to the actual table to find NULLs. Strong candidates note that this is a classic data quality check in ETL and that it can be wrapped into a dbt test or Great Expectations check.

Reference implementation (sql)

SQL

WITH date_range AS (
  SELECT client_id,
         generate_series(MIN(metric_date), MAX(metric_date), '1 day'::interval)::date AS expected_date
  FROM daily_metrics
  GROUP BY client_id
)
SELECT dr.client_id, dr.expected_date AS missing_date
FROM date_range dr
LEFT JOIN daily_metrics dm
  ON dr.client_id = dm.client_id AND dr.expected_date = dm.metric_date
WHERE dm.metric_date IS NULL
ORDER BY dr.client_id, dr.expected_date;

Expect these follow-ups

How would you write this in BigQuery, which does not have generate_series?

sqldate-gapsdata-qualitypostgresetl

As asked

You have a PySpark DataFrame with columns (user_id, event_type, event_time, payload). Multiple rows can share the same user_id and event_type. Write PySpark code to keep only the row with the latest event_time for each (user_id, event_type) combination.

Sample answer outline

The standard approach uses Window.partitionBy('user_id', 'event_type').orderBy(F.desc('event_time')) with row_number() to assign ranks and then filter to rank == 1. An alternative is groupBy + agg(F.max) then join back for the full row. Candidates should note that dropDuplicates only removes exact duplicates and does not handle keeping the latest version of a non-exact duplicate.

Reference implementation (python)

Python

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('user_id', 'event_type').orderBy(F.desc('event_time'))

result = (
    df
    .withColumn('rn', F.row_number().over(w))
    .filter(F.col('rn') == 1)
    .drop('rn')
)

Expect these follow-ups

How does your solution handle ties where two rows have the exact same event_time?

pysparkdeduplicationwindow-functionsdata-engineering

As asked

Write a Python function that wraps an HTTP GET call to an external API and retries on transient failures (status 429 and 5xx) with exponential backoff and jitter, up to a maximum number of attempts.

Sample answer outline

Strong answers implement a loop with attempts tracking, sleep with 2**attempt * base_delay + random jitter, and raise after max_attempts. They should distinguish between retryable errors (429, 502, 503, 504) and non-retryable ones (400, 401, 404) and mention that production code would use tenacity rather than a hand-rolled loop. Jitter prevents thundering herd when many pipeline tasks retry simultaneously.

Reference implementation (python)

Python

import time, random, requests

RETRYABLE = {429, 500, 502, 503, 504}

def fetch_with_retry(url: str, max_attempts: int = 5, base_delay: float = 1.0):
    for attempt in range(max_attempts):
        resp = requests.get(url, timeout=10)
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code not in RETRYABLE:
            resp.raise_for_status()
        if attempt < max_attempts - 1:
            sleep_time = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(sleep_time)
    raise RuntimeError(f'Failed after {max_attempts} attempts')

Expect these follow-ups

How does tenacity's retry decorator handle state between retries, and how do you log each attempt?

pythonhttpretryresilienceetl

As asked

A table called user_events has columns (user_id INT, event_name VARCHAR, event_count INT). Event names are: 'signup', 'purchase', 'refund'. Write SQL to produce one row per user_id with columns signup_count, purchase_count, and refund_count.

Sample answer outline

The standard approach uses conditional aggregation: SUM(CASE WHEN event_name = 'signup' THEN event_count ELSE 0 END) AS signup_count, etc. Candidates should also mention that Snowflake, BigQuery, and Redshift have PIVOT syntax for a cleaner alternative. They should note that the CASE WHEN approach works in all SQL dialects and is safer when the set of event names is known at query time.

Reference implementation (sql)

SQL

SELECT
  user_id,
  SUM(CASE WHEN event_name = 'signup'   THEN event_count ELSE 0 END) AS signup_count,
  SUM(CASE WHEN event_name = 'purchase' THEN event_count ELSE 0 END) AS purchase_count,
  SUM(CASE WHEN event_name = 'refund'   THEN event_count ELSE 0 END) AS refund_count
FROM user_events
GROUP BY user_id;

Expect these follow-ups

How do you handle this if the event names are not known at query time and could be added dynamically by the product team?

sqlpivotaggregationetldatabases

As asked

Using Snowflake SQL, write a MERGE statement that upserts rows from a staging table into a Type 2 dimension table. New rows should be inserted; changed rows should have their current record closed with an effective_to date and a new record opened. Unchanged rows should be left alone.

Sample answer outline

Strong answers use a two-step approach: first MERGE to close changed records by setting effective_to = CURRENT_DATE and is_current = FALSE, then INSERT the new versions. Snowflake does not allow multiple update actions on the same row in one MERGE, so a single MERGE with WHEN MATCHED AND hash_differs THEN UPDATE and a separate INSERT for new versions requires staging the deltas first. Candidates should also mention the hash comparison trick to detect changes efficiently.

Reference implementation (sql)

SQL

-- Step 1: close changed records
MERGE INTO dim_customer tgt
USING staging_customer src
  ON tgt.customer_id = src.customer_id AND tgt.is_current = TRUE
WHEN MATCHED AND MD5(tgt.email || tgt.name) <> MD5(src.email || src.name) THEN
  UPDATE SET tgt.effective_to = CURRENT_DATE, tgt.is_current = FALSE;

-- Step 2: insert new/changed records
INSERT INTO dim_customer (customer_id, email, name, effective_from, effective_to, is_current)
SELECT src.customer_id, src.email, src.name, CURRENT_DATE, '9999-12-31', TRUE
FROM staging_customer src
LEFT JOIN dim_customer tgt ON src.customer_id = tgt.customer_id AND tgt.is_current = TRUE
WHERE tgt.customer_id IS NULL
   OR MD5(tgt.email || tgt.name) <> MD5(src.email || src.name);

Expect these follow-ups

How would you handle a row that was deleted from the source entirely?

sqlsnowflakescdmergedata-warehouse

As asked

You have a list of 20 table names stored in a config file. Write an Airflow DAG that dynamically creates one extraction task per table, all running in parallel, followed by a single aggregation task that waits for all of them.

Sample answer outline

Using Airflow 2.x Dynamic Task Mapping (expand) is the cleanest approach. Alternatively, a loop over the table list that creates PythonOperator instances and sets dependencies with >> is the classic approach. Candidates should know that task_ids must be unique so they should include the table name in the id. The aggregation task uses set_upstream(extract_tasks) or receives the list of operators in the loop.

Reference implementation (python)

Python

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

TABLES = ['orders', 'customers', 'products']  # ...20 total

with DAG('dynamic_extract', start_date=datetime(2024, 1, 1), schedule='@daily') as dag:
    extract_tasks = []
    for table in TABLES:
        t = PythonOperator(
            task_id=f'extract_{table}',
            python_callable=lambda t=table: print(f'Extracting {t}')
        )
        extract_tasks.append(t)

    aggregate = PythonOperator(
        task_id='aggregate_all',
        python_callable=lambda: print('Done')
    )
    extract_tasks >> aggregate

Expect these follow-ups

How does Airflow 2.3's Dynamic Task Mapping differ from the loop approach, and what problem does it solve?

airflowdagdynamic-tasksorchestrationpython

As asked

Write a Python function that takes a list of recent daily row counts for a table and raises an alert if today's count deviates by more than 2 standard deviations from the trailing 7-day mean. This function will run as an Airflow task after each load.

Sample answer outline

The function computes mean and standard deviation of the last 7 values, calculates the Z-score for today's count, and raises a ValueError if abs(z) > 2. Candidates should handle edge cases: fewer than 7 historical data points, standard deviation of zero (all counts identical), and NaN inputs. They should also mention that in production this would query a pipeline metadata table rather than take a list parameter.

Reference implementation (python)

Python

import statistics

def check_row_count(history: list[int], today_count: int, z_threshold: float = 2.0):
    if len(history) < 3:
        return  # not enough history to compute
    mean = statistics.mean(history)
    stdev = statistics.stdev(history)
    if stdev == 0:
        return
    z = (today_count - mean) / stdev
    if abs(z) > z_threshold:
        raise ValueError(
            f'Row count anomaly: today={today_count}, mean={mean:.0f}, '
            f'stdev={stdev:.0f}, z={z:.2f}'
        )

Expect these follow-ups

How would you tune the threshold for a table that legitimately has high variance on weekends?

pythondata-qualitystatisticsmonitoringetl

Practise these patterns on AlgoExpert

Recommended

200+ video-explained coding interview questions organised by the patterns covered on this page, with timed practice and solution walkthroughs.

Start practising

An external resource we recommend. AlgoExpert is not affiliated with us and we earn nothing from this link.

Tools to sharpen your prep

All tools

def deduplicate(records: list[dict]) -> list[dict]: # records = [{'id': 'a', 'ts': 1000}, {'id': 'a', 'ts': 2000}, ...] latest = {} for rec in records: rid = rec['id'] if rid not in latest or rec['ts'] > latest[rid]['ts']: latest[rid] = rec return list(latest.values())

from itertools import islice def chunked(iterable, chunk_size: int): it = iter(iterable) while True: batch = list(islice(it, chunk_size)) if not batch: break yield batch # Usage: # for batch in chunked(rows, 1000): # cursor.executemany(sql, batch)

SELECT sale_date, product_id, revenue, SUM(revenue) OVER ( PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_revenue FROM daily_sales ORDER BY product_id, sale_date;

def flatten(d: dict, prefix: str = '', sep: str = '.') -> dict: result = {} for key, value in d.items(): full_key = f'{prefix}{sep}{key}' if prefix else key if isinstance(value, dict): result.update(flatten(value, full_key, sep)) else: result[full_key] = value return result

WITH date_range AS ( SELECT client_id, generate_series(MIN(metric_date), MAX(metric_date), '1 day'::interval)::date AS expected_date FROM daily_metrics GROUP BY client_id ) SELECT dr.client_id, dr.expected_date AS missing_date FROM date_range dr LEFT JOIN daily_metrics dm ON dr.client_id = dm.client_id AND dr.expected_date = dm.metric_date WHERE dm.metric_date IS NULL ORDER BY dr.client_id, dr.expected_date;

from pyspark.sql import functions as F from pyspark.sql.window import Window w = Window.partitionBy('user_id', 'event_type').orderBy(F.desc('event_time')) result = ( df .withColumn('rn', F.row_number().over(w)) .filter(F.col('rn') == 1) .drop('rn') )

import time, random, requests RETRYABLE = {429, 500, 502, 503, 504} def fetch_with_retry(url: str, max_attempts: int = 5, base_delay: float = 1.0): for attempt in range(max_attempts): resp = requests.get(url, timeout=10) if resp.status_code == 200: return resp.json() if resp.status_code not in RETRYABLE: resp.raise_for_status() if attempt < max_attempts - 1: sleep_time = base_delay * (2 ** attempt) + random.uniform(0, 1) time.sleep(sleep_time) raise RuntimeError(f'Failed after {max_attempts} attempts')

SELECT user_id, SUM(CASE WHEN event_name = 'signup' THEN event_count ELSE 0 END) AS signup_count, SUM(CASE WHEN event_name = 'purchase' THEN event_count ELSE 0 END) AS purchase_count, SUM(CASE WHEN event_name = 'refund' THEN event_count ELSE 0 END) AS refund_count FROM user_events GROUP BY user_id;

-- Step 1: close changed records MERGE INTO dim_customer tgt USING staging_customer src ON tgt.customer_id = src.customer_id AND tgt.is_current = TRUE WHEN MATCHED AND MD5(tgt.email || tgt.name) <> MD5(src.email || src.name) THEN UPDATE SET tgt.effective_to = CURRENT_DATE, tgt.is_current = FALSE; -- Step 2: insert new/changed records INSERT INTO dim_customer (customer_id, email, name, effective_from, effective_to, is_current) SELECT src.customer_id, src.email, src.name, CURRENT_DATE, '9999-12-31', TRUE FROM staging_customer src LEFT JOIN dim_customer tgt ON src.customer_id = tgt.customer_id AND tgt.is_current = TRUE WHERE tgt.customer_id IS NULL OR MD5(tgt.email || tgt.name) <> MD5(src.email || src.name);

from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime TABLES = ['orders', 'customers', 'products'] # ...20 total with DAG('dynamic_extract', start_date=datetime(2024, 1, 1), schedule='@daily') as dag: extract_tasks = [] for table in TABLES: t = PythonOperator( task_id=f'extract_{table}', python_callable=lambda t=table: print(f'Extracting {t}') ) extract_tasks.append(t) aggregate = PythonOperator( task_id='aggregate_all', python_callable=lambda: print('Done') ) extract_tasks >> aggregate

import statistics def check_row_count(history: list[int], today_count: int, z_threshold: float = 2.0): if len(history) < 3: return # not enough history to compute mean = statistics.mean(history) stdev = statistics.stdev(history) if stdev == 0: return z = (today_count - mean) / stdev if abs(z) > z_threshold: raise ValueError( f'Row count anomaly: today={today_count}, mean={mean:.0f}, ' f'stdev={stdev:.0f}, z={z:.2f}' )

Questions

Deduplicate a list of records keeping latest by timestampCodingeasyVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Chunk a large iterable into fixed-size batches for DB insertsCodingeasyVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

SQL running total with window functionCodingeasyVery common

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Flatten arbitrarily nested JSON into a flat dictCodingeasyCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Find gaps in a date series from a Postgres tableCodingmediumCommon

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Deduplicate a PySpark DataFrame keeping the latest rowCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Retry a flaky HTTP extraction call with exponential backoffCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Pivot rows to columns using conditional aggregationCodingmediumCommon

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Write a MERGE statement for SCD Type 2 in SnowflakeCodinghardCommon

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Dynamically generate Airflow tasks from a config listCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Write a reusable row count anomaly detector in PythonCodingmediumCommon

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Related questions

Flatten arbitrarily nested JSON into a flat dict

Find gaps in a date series from a Postgres table

Deduplicate a PySpark DataFrame keeping the latest row

Retry a flaky HTTP extraction call with exponential backoff

More etl engineer topics

Tools to sharpen your prep

Questions

Deduplicate a list of records keeping latest by timestampCodingeasyVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

Chunk a large iterable into fixed-size batches for DB insertsCodingeasyVery common

As asked

Sample answer outline

Reference implementation (python)

Expect these follow-ups

SQL running total with window functionCodingeasyVery common

As asked

Sample answer outline

Reference implementation (sql)

Expect these follow-ups

Flatten arbitrarily nested JSON into a flat dictCodingeasyCommon