As asked
You have a 20GB CSV that you are loading into a pandas DataFrame. A groupby-agg operation takes 40 minutes. Walk me through the steps you would take to speed it up, from quick wins to structural changes.
Sample answer outline
Quick wins: downcast numeric dtypes (float64 to float32, int64 to int32 or int16), use categorical dtype for low-cardinality string columns, and profile with memory_profiler. Structural changes: switch to chunked processing with read_csv(chunksize=), use Dask or Polars for out-of-core computation, or push the aggregation into the database with SQL. The candidate should also mention Parquet as a better storage format than CSV.
Expect these follow-ups
- How does Polars' lazy evaluation differ from pandas and why is it faster for this workload?