Deimos
Computers + 1 more
Description
Requirements
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 5+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering, with demonstrable ownership of reliability standards at a team or company level.
- Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write production-grade automation code.
- Cloud & IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation).
- Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). Strong understanding of SRE concepts including Golden Signals, high-cardinality data handling, and error budget mathematics.
- Systems Thinking: Strong grasp of designing for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.
- Proven ability to define and drive reliability standards across multiple teams and drive a blameless post-mortem culture.
Responsibilities
- Enablement & RelOps Culture
- Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.
- Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
- Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).
- Standardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.
- Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.
- Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).
- Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.
- Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.
Start hiring with Fuzu
Recruit better talent faster - on your own or with our support.
Explore recruitment platformJob search tips from Fuzu
Selected articles on cover letters, CV structure, and interview preparation.