As asked
Write a Python function that takes a list of event dictionaries, each with an 'id' and a 'ts' (Unix timestamp), and returns a list containing only the most recent record for each id. Optimize for the case where the list has millions of records.
Sample answer outline
A single-pass O(n) solution using a dict keyed by id and comparing ts values is the right answer. Candidates should avoid sorting the entire list first which is O(n log n). Bonus for noting that the dict approach works in constant space relative to unique keys and handles ties consistently.
Reference implementation (python)
def deduplicate(records: list[dict]) -> list[dict]:
# records = [{'id': 'a', 'ts': 1000}, {'id': 'a', 'ts': 2000}, ...]
latest = {}
for rec in records:
rid = rec['id']
if rid not in latest or rec['ts'] > latest[rid]['ts']:
latest[rid] = rec
return list(latest.values())Expect these follow-ups
- How would you parallelize this if the records were spread across 100 files on S3?