As asked
A researcher says yesterday's experiment improved benchmark accuracy by 3 percent, but nobody can reproduce it. What do you check first?
Sample answer outline
Start with the experiment record: exact code commit, model config, data snapshot, preprocessing version, random seeds, hardware, and library versions. Then check whether the reported metric came from the intended split and whether the evaluation script changed between runs. Strong answers discuss determinism limits on GPUs while still insisting on reproducible artefacts and comparable results. The candidate should propose a minimal rerun and then a full rerun under tracked configuration. A common failure is focusing only on the seed while ignoring data leakage or an untracked preprocessing change.
Expect these follow-ups
- Which parts of a GPU training run may remain non-deterministic?
- How would you store data snapshots without copying terabytes every time?
- What would make you distrust the original result?