Site Reliability Engineer
Experience: 1-3 years
Key Responsibilities:
Develop software systems for observability platforms, and enhance product monitoring, reliability, and development efficiency.
You will have on-call responsibilities to monitor and respond to incidents, ensuring service health. Our 8-hour on-call rotation includes workdays, weekends, and holidays.
Provide L2 support for user requests, assisting customers with troubleshooting product usage issues
Learn from incidents through blameless post-mortems and address service reliability issues through hands-on coding
Establish SRE best practices within product teams, including capacity planning, chaos testing, and disaster recovery drills
Improve the efficiency of development and operations teams by reducing toil through automation
Skills Needed:
2+ years of experience in roles such as SRE, DevOps, cloud engineering, observability engineering,zabbix monitoring.
Intermediate to advanced skills in Python,Linux,Terraform,CloudWatch,Grafana, Prometheus, Log management, Networking.
Intermediate to advanced level of expertise in AWS, Kubernetes, Infrastructure as Code,Docker,Jenkins, GitHub Actions and Gitlab CI.
Proficient in production on-call, troubleshooting, and incident management.
Business level English skills
NICE TO HAVES
Hands-on experience in SRE best practices, including SLO monitoring, disaster recovery planning, chaos testing, capacity planning, automation, toil reduction and more
Experience with APM solutions and monitoring systems such as Prometheus, Grafana, and GCP monitoring
Previous experience as a system engineer or Linux administrator.
AWS, GCP, or Kubernetes Certifications.