Computers + 1 more

Site Reliabity Engineer Intern

Closed for applications

Location

Contract Type

Required Knowledge, Qualification and Experience

Bachelor's Degree in Computer Science, Information Technology, or a related field.
Some exposure in Kubernetes and Cloud networking.
some experience with monitoring and observability tools.
Good exposure managing production systems in cloud environments.
Some exposure in implementing and managing CI/CD pipelines and utilizing tools like Jenkins, GitLab CI/CD, or equivalent.
Some exposure with cloud platforms (AWS, Azure, Google Cloud) and containerization tools like Docker and Kubernetes.
Basic hands-on exposure to monitoring and metrics systems such as Prometheus.
Basic familiarity with dashboarding and visualization tools such as Grafana. Foundational understanding of log aggregation systems such as Loki.
Familiarity with Linux environments and basic system commands. Exposure to scripting concepts using Python, Bash, or similar languages
Foundational knowledge of Artificial Intelligence (AI) and good exposure with Al agents; relevant certifications in Al or related disciplines will be an added advantage.

Send resume and portfolio with subject SITE RELIABITY ENGINEER INTERN to the Emil provided.

Assist in design, implement, and continuously improve system reliability, availability, and performance by assisting in defining and monitoring SLIS,
SLOS, and error budgets across all assigned platforms.
Support in building and managing a robust monitoring and observability framework using Prometheus, Grafana, and Loki to track latency, traffic, errors, system health, and user impact.
Assist in automating infrastructure provisioning, scaling, and configuration management using Infrastructure as Code principles with Terraform and Kubernetes to ensure consistency, scalability, and disaster recovery readiness.
Participate in incident response processes, including detection, escalation, resolution, communication, and conducting blameless postmortems to prevent recurrence.
Assist in reduce manual operational workload through automation, scripting, and process optimization to improve efficiency and release velocity.
Support in ensuring high availability and performance of business- critical systems.
Collaborate with Engineering, Product, and DevOps teams to assist in improving deployment safety, capacity planning, cost optimization, and system scalability.
Support in ensuring high availability and performance of business- critical systems.
Assist in establishing alerting strategies and reliability standards that minimize alert fatigue while ensuring rapid detection and resolution of production issues.

Recruit better talent faster - on your own or with our support.

Job search tips from Fuzu

Selected articles on cover letters, CV structure, and interview preparation.