As asked
Given a DataFrame with columns 'user_id', 'product_category', and 'spend', write Python code to add a column 'spend_percentile' that is each user's percentile rank of spend within their product_category group. Do not use a loop.
Sample answer outline
The candidate should use groupby with transform and scipy.stats.percentileofscore or pandas rank(pct=True). The key is using transform so the result aligns back with the original DataFrame index. They should handle ties with a specified method argument and explain why apply versus transform matters for returning a scalar vs a Series.
Reference implementation (python)
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'product_category': ['A', 'A', 'B', 'B', 'A'],
'spend': [100, 200, 150, 300, 50]
})
# Fill in: add 'spend_percentile' column
df['spend_percentile'] = df.groupby('product_category')['spend'].transform(
lambda x: x.rank(pct=True)
)
print(df)Expect these follow-ups
- How would you do this if there were 10 million rows and you needed it to run in under 5 seconds?