More than 100,000 people have found their dream job through Fuzu.

CLOSED FOR APPLICATIONS

Site Reliability Engineer / Senior Site Reliability Engineer, Reliability | Mlops

Closing: Dec 31, 2022

This position has expired

Published: Dec 27, 2022 (2 months ago)

Job Requirements

Education:

Work experience:

Language skills:

Job Summary

Contract Type:

Sign up to view job details.

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems.


You may be a fit to this role if you:

  • Are able to reason about large systems - how they work on large scale, edge cases, failure modes, behaviors.
  • Know your way around Linux and the Unix Shell.
  • Have experience in collaborating and communicating asynchronously.
  • Have significant professional experience in Python backend infrastructure.
  • Experience in working with pytorch , TF infrastructure and other similar frameworks
  • Help improve ML features scalability and maintainability
  • Collaborate with other ML engineers and advise on  the MLOps architect from infrastructure prospective
  • Aid in integrating of every ML feature with Gitlab
  • High interest in defining infrastructure for large scale ML recommendation engines (experience with this, however is a nice-to-have).
  • Comfort working in earlier stages of product development.
  • A genuine passion for learning.
  • Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies.
  • Are able to leverage GitLab as your day to day go-to tool.

 

Nice to have attributes:

  • Research or Industry experience in ML Engineering 
  • Experience with Kubernetes and MLFlow or Kubeflow or similar MLOps stack
  • Experience with cloud architecture optimization (GDF, PubSub, GCP).
  • Experience in Continuous training of models


Responsibilities
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems.


You may be a fit to this role if you:

  • Are able to reason about large systems - how they work on large scale, edge cases, failure modes, behaviors.
  • Know your way around Linux and the Unix Shell.
  • Have experience in collaborating and communicating asynchronously.
  • Have significant professional experience in Python backend infrastructure.
  • Experience in working with pytorch , TF infrastructure and other similar frameworks
  • Help improve ML features scalability and maintainability
  • Collaborate with other ML engineers and advise on  the MLOps architect from infrastructure prospective
  • Aid in integrating of every ML feature with Gitlab
  • High interest in defining infrastructure for large scale ML recommendation engines (experience with this, however is a nice-to-have).
  • Comfort working in earlier stages of product development.
  • A genuine passion for learning.
  • Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies.
  • Are able to leverage GitLab as your day to day go-to tool.

 

Nice to have attributes:

  • Research or Industry experience in ML Engineering 
  • Experience with Kubernetes and MLFlow or Kubeflow or similar MLOps stack
  • Experience with cloud architecture optimization (GDF, PubSub, GCP).
  • Experience in Continuous training of models


  1. Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.
  2. Responding to platform emergencies, alerts, and escalations from Customer Support.
  3. Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort.
  4. Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns.
  5. Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimise their resource consumption.


Applications submitted via Fuzu have 32% higher chance of getting shortlisted.

Don’t miss your chance to work at GitLab. Enter your email to start your application now