
Equity Bank Kenya
Banking + 2 more
Description
Qualifications
KEY TECHNICAL SKILLS & COMPETENCIES
- Elasticsearch, Logstash, Kibana (ELK Stack)
- Microsoft Azure
- Unix / Linux and Shell Scripting
- SQL and database concepts
- Monitoring and observability tools
- Strong analytical, problem‑solving, and documentation skills
EXPERIENCE REQUIREMENTS
- Minimum 2 years’ experience in a Site Reliability Engineering, DevOps, or Production Support role
- Mandatory hands‑on experience with ELK Stack
- Experience supporting banking or enterprise‑scale applications
ACADEMIC QUALIFICATIONS & CERTIFICATIONS
- Bachelor’s degree in science, Engineering, Information Technology, or a related field
- Nice to have: ELK, Azure, or other relevant cloud/observability certifications
Responsibilities
. ELK Engineering and Log Analytics
- Install, configure, and maintain ELK stack components (Elasticsearch, Logstash, Kibana, Beats) across environments.
- Design efficient dashboards, graphs, and visualizations that translate application logs into business‑readable insights.
- Analyze application logs to identify trends, risks, and incidents affecting system performance and availability.
- Develop customized reports, bar charts, and pie charts to support operational and business decision‑making.
- Implement ELK‑triggered auto‑healing and remediation scripts to detect and resolve incidents proactively.
2. Toil Reduction and Automation
- Identify repetitive, manual, and reactive operational tasks and eliminate them through automation.
- Develop scripts and tools using languages such as Python, Bash, or Go to automate system maintenance and operational workflows.
- Implement Infrastructure as Code (IaC) using tools such as Terraform or Ansible to ensure consistent, repeatable infrastructure provisioning.
- Design and implement self‑healing systems capable of automatic recovery from common failures without human intervention.
3. Monitoring, Alerting, and Observability
- Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) in collaboration with business and development teams.
- Build and maintain robust monitoring, logging, and observability solutions using tools such as ELK, Prometheus, Grafana, or equivalent platforms.
- Configure intelligent, actionable alerts that minimize noise and false positives while ensuring rapid incident detection.
- Continuously improve monitoring coverage and system visibility to support proactive operations.
4. Incident Response and Management
- Participate in on‑call rotations to respond to critical system alerts and production incidents.
- Diagnose, mitigate, and resolve incidents to restore services within agreed SLAs.
- Conduct blameless post‑incident reviews to identify root causes and define preventative actions.
- Develop and maintain runbooks and playbooks for common incident scenarios to improve response time and consistency.
5. Capacity Planning and Performance Optimization
- Analyze historical system usage and trends to forecast future capacity requirements.
- Perform system and database performance tuning in collaboration with development teams.
- Conduct load and stress testing to identify bottlenecks before they impact production systems.
- Ensure systems are cost‑efficient, scalable, and capable of supporting business growth.
6. Cross‑Functional Collaboration
- Work closely with software development teams during solution design to ensure reliability, scalability, and operational readiness.
- Promote a DevOps and SRE culture through shared ownership of system reliability (“You Build It, You Run It”).
- Share knowledge, best practices, and documentation to uplift operational maturity across teams.
Start hiring with Fuzu
Recruit better talent faster - on your own or with our support.
Explore recruitment platformJob search tips from Fuzu
Selected articles on cover letters, CV structure, and interview preparation.