Question 1

How does the CAP theorem apply to a lakehouse built on object storage? When you have concurrent readers and writers on an Iceberg table, what guarantees do you actually get, and what trade-offs does the table format's atomic commit mechanism represent?

Accepted Answer

Object storage is not a distributed database in the CAP sense, but the lakehouse table format creates an eventually consistent system where concurrent readers see the latest committed snapshot and writers use optimistic concurrency with atomic catalog swaps. Iceberg provides isolation at snapshot granularity: a reader always sees a consistent point-in-time view, but a writer that conflicts on the same files must retry. The tradeoff is availability (writers can succeed eventually) over strict linearizability. The candidate should note that the catalog backend (Hive Metastore, Nessie, REST) is the actual consistency boundary.

Question 2

RocksDB uses an LSM tree and Parquet is a columnar format. For a write-heavy streaming use case, what are the fundamental read and write amplification tradeoffs of each, and how does this inform your choice of state store backend in Flink?

Accepted Answer

LSM trees buffer writes in a sorted in-memory structure and flush to immutable SSTables, resulting in low write amplification but higher read amplification because reads must merge multiple levels. Parquet is optimized for reads: columnar layout allows vectorized scans but requires full file rewrites for updates. For Flink state (key lookups and point updates) RocksDB's LSM is a natural fit because state operations are key-value lookups and sequential writes, not full column scans. Parquet would be unsuitable as a Flink state store because random key lookups on a columnar file are expensive.

Question 3

A streaming pipeline aggregates real-time sales totals and serves them on a dashboard. A business user asks: is the number on the dashboard exactly correct right now? How do you explain the consistency guarantees and what engineering choices determine how stale the number can be?

Accepted Answer

The number is accurate as of the latest processed watermark, not wall clock time. Staleness is bounded by the watermark lag (typically the out-of-orderness allowance plus processing lag) plus the sink write interval. The candidate should explain the difference between event time and processing time aggregations, and how choosing processing time gives a smaller latency bound at the cost of missing late events. They should be able to give a realistic expected staleness for a 5-minute Flink aggregation job.

Question 4

Explain how an analytical query engine like Trino or Spark uses Parquet column statistics to skip files and row groups. What types of predicates benefit from this and which do not?

Accepted Answer

Each Parquet file stores column-level min/max statistics per row group in the footer. The query engine reads the footer before reading any data and skips row groups where the predicate cannot match based on the min/max range. Range predicates (ts > '2024-01-01') and equality predicates on sorted or low-cardinality columns benefit most. Predicates on OR conditions, NOT, and high-cardinality string columns (LIKE with a leading wildcard) cannot use statistics. Bloom filters complement statistics for equality predicates on high-cardinality columns.

Question 5

Kafka's default partitioner uses a murmur2 hash of the message key to assign partitions. Explain how this works, what causes partition skew, and what alternatives exist when your key space is not evenly distributed.

Accepted Answer

The default partitioner applies a murmur2 hash modulo the number of partitions to the key bytes. This is a simple modulo hash, not consistent hashing (which is a ring-based technique used to minimize key redistribution when cluster topology changes). Skew occurs when a small number of keys are extremely frequent (a single high-volume user) or when the key space is small relative to partition count. Alternatives include salting the key with a random suffix and using a separate lookup to reassemble, using a custom partitioner that routes hot keys to a dedicated partition, or using sticky partitioning for null-key messages to reduce small batch overhead.

Question 6

How does the CAP theorem apply to a lakehouse built on object storage? When you have concurrent readers and writers on an Iceberg table, what guarantees do you actually get, and what trade-offs does the table format's atomic commit mechanism represent?

Accepted Answer

Object storage is not a distributed database in the CAP sense, but the lakehouse table format creates an eventually consistent system where concurrent readers see the latest committed snapshot and writers use optimistic concurrency with atomic catalog swaps. Iceberg provides isolation at snapshot granularity: a reader always sees a consistent point-in-time view, but a writer that conflicts on the same files must retry. The tradeoff is availability (writers can succeed eventually) over strict linearizability. The candidate should note that the catalog backend (Hive Metastore, Nessie, REST) is the actual consistency boundary.

Question 7

RocksDB uses an LSM tree and Parquet is a columnar format. For a write-heavy streaming use case, what are the fundamental read and write amplification tradeoffs of each, and how does this inform your choice of state store backend in Flink?

Accepted Answer

LSM trees buffer writes in a sorted in-memory structure and flush to immutable SSTables, resulting in low write amplification but higher read amplification because reads must merge multiple levels. Parquet is optimized for reads: columnar layout allows vectorized scans but requires full file rewrites for updates. For Flink state (key lookups and point updates) RocksDB's LSM is a natural fit because state operations are key-value lookups and sequential writes, not full column scans. Parquet would be unsuitable as a Flink state store because random key lookups on a columnar file are expensive.

Question 8

A streaming pipeline aggregates real-time sales totals and serves them on a dashboard. A business user asks: is the number on the dashboard exactly correct right now? How do you explain the consistency guarantees and what engineering choices determine how stale the number can be?

Accepted Answer

The number is accurate as of the latest processed watermark, not wall clock time. Staleness is bounded by the watermark lag (typically the out-of-orderness allowance plus processing lag) plus the sink write interval. The candidate should explain the difference between event time and processing time aggregations, and how choosing processing time gives a smaller latency bound at the cost of missing late events. They should be able to give a realistic expected staleness for a 5-minute Flink aggregation job.

Question 9

Explain how an analytical query engine like Trino or Spark uses Parquet column statistics to skip files and row groups. What types of predicates benefit from this and which do not?

Accepted Answer

Each Parquet file stores column-level min/max statistics per row group in the footer. The query engine reads the footer before reading any data and skips row groups where the predicate cannot match based on the min/max range. Range predicates (ts > '2024-01-01') and equality predicates on sorted or low-cardinality columns benefit most. Predicates on OR conditions, NOT, and high-cardinality string columns (LIKE with a leading wildcard) cannot use statistics. Bloom filters complement statistics for equality predicates on high-cardinality columns.

Question 10

Kafka's default partitioner uses a murmur2 hash of the message key to assign partitions. Explain how this works, what causes partition skew, and what alternatives exist when your key space is not evenly distributed.

Accepted Answer

The default partitioner applies a murmur2 hash modulo the number of partitions to the key bytes. This is a simple modulo hash, not consistent hashing (which is a ring-based technique used to minimize key redistribution when cluster topology changes). Skew occurs when a small number of keys are extremely frequent (a single high-volume user) or when the key space is small relative to partition count. Alternatives include salting the key with a random suffix and using a separate lookup to reassemble, using a custom partitioner that routes hot keys to a dedicated partition, or using sticky partitioning for null-key messages to reduce small batch overhead.

Questions

CAP theorem applied to a lakehouse with concurrent readers and writersTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

LSM trees versus columnar storage for write-heavy workloadsTechnical fundamentalshardOccasional

As asked

Sample answer outline

Expect these follow-ups

Eventual consistency guarantees in a streaming aggregation pipelineTechnical fundamentalseasyVery common

As asked

Sample answer outline

Expect these follow-ups

How Parquet column statistics enable file and row group pruningTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Consistent hashing and key distribution in Kafka partitionsTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

CAP theorem applied to a lakehouse with concurrent readers and writersTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

LSM trees versus columnar storage for write-heavy workloadsTechnical fundamentalshardOccasional

As asked

Sample answer outline

Expect these follow-ups

Eventual consistency guarantees in a streaming aggregation pipelineTechnical fundamentalseasyVery common

As asked

Sample answer outline

Expect these follow-ups

How Parquet column statistics enable file and row group pruningTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Consistent hashing and key distribution in Kafka partitionsTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Related questions

LSM trees versus columnar storage for write-heavy workloads

Eventual consistency guarantees in a streaming aggregation pipeline

How Parquet column statistics enable file and row group pruning

Consistent hashing and key distribution in Kafka partitions

More data platform engineer topics

Tools to sharpen your prep

Questions

CAP theorem applied to a lakehouse with concurrent readers and writersTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

LSM trees versus columnar storage for write-heavy workloadsTechnical fundamentalshardOccasional

As asked

Sample answer outline

Expect these follow-ups

Eventual consistency guarantees in a streaming aggregation pipelineTechnical fundamentalseasyVery common

As asked

Sample answer outline

Expect these follow-ups

How Parquet column statistics enable file and row group pruningTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Consistent hashing and key distribution in Kafka partitionsTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

CAP theorem applied to a lakehouse with concurrent readers and writersTechnical fundamentalsmediumCommon

As asked

Sample answer outline

Expect these follow-ups

LSM trees versus columnar storage for write-heavy workloadsTechnical fundamentalshardOccasional

As asked

Sample answer outline

Expect these follow-ups

Eventual consistency guarantees in a streaming aggregation pipelineTechnical fundamentalseasyVery common

As asked

Sample answer outline