As asked
Given a dictionary representing an ML pipeline DAG where keys are step names and values are lists of dependencies, implement a function that returns a valid execution order or raises an error if a cycle exists.
Sample answer outline
A correct answer implements Kahn's algorithm (BFS with in-degree tracking) or DFS with three-color marking. It handles the case where len(result) != len(graph) by raising a cycle-detected error. It should discuss time complexity O(V+E) and mention that this underpins pipeline schedulers like Airflow and Kubeflow.
Reference implementation (python)
def execution_order(pipeline: dict[str, list[str]]) -> list[str]:
"""
pipeline = {
'train': ['preprocess', 'validate'],
'preprocess': ['ingest'],
'validate': ['ingest'],
'ingest': [],
'evaluate': ['train'],
}
expected output: ['ingest', 'preprocess', 'validate', 'train', 'evaluate']
"""
# your implementation here
passExpect these follow-ups
- How would you extend this to support parallel execution of steps with no dependency on each other?
- What happens to your algorithm if the same step appears as a dependency multiple times?