Dear Hiring Manager,
The best compliment my last platform team paid me was that on call got boring. At Vantage Media I turned a service that paged engineers most nights into one that met a 99.95 percent availability target with budget to spare. Making reliability a measured, intentional thing rather than a constant scramble is the work I want to do for you.
Site reliability is about treating operations as a software problem, and that mindset runs through everything I do. I define service level objectives with the product teams so reliability has a clear target, I automate the toil that would otherwise eat a week, and I run blameless postmortems that produce real fixes rather than blame. Error budgets give me a shared language with engineering about when to slow down and when to ship.
At Vantage Media I introduced SLO based alerting that replaced 60 noisy threshold alarms with a handful of symptom based pages tied to user impact. Alert volume fell by 78 percent, on call interruptions outside hours dropped by two thirds, and we held availability at 99.95 percent across a year of steady traffic growth. The reduced noise let the team finally pay down its automation backlog.
Your advert mentions standing up an SLO practice across several teams, which is precisely the rollout I have done before. I would be glad to share how I sequenced it to win engineering buy in rather than impose it. Would a short call suit you?
Yours sincerely, Helena Voss