Site Reliability Engineer
Role : Site Reliability Engineer
Skills – Mandatory - Azure Cloud Infrastructure
Skills - Primary - Azure cloud infrastructure Support
Total Experience : 3 to 8 years
Work Location : Cochin/TVM/Remote
Job Purpose (both Onsite / Offshore)
We are seeking a Site Reliability Engineer (SRE) to support and maintain a 24×7 Azure cloud environment, ensuring high availability, reliability, and performance of infrastructure and hosted services. This role requires the engineer to operate across L1 and L2 support responsibilities, combining proactive monitoring with advanced troubleshooting and root cause analysis.
The ideal candidate is an IT Generalist with strong networking, system administration, and customer service skills, capable of owning to customer issues end-to-end in dynamic and evolving environments. The role demands the ability to troubleshoot complex technical problems and communicate solutions clearly to both technical and non-technical stakeholders
Job Description / Duties and Responsibilities
Azure Cloud Infrastructure Support (L1 & L2)
• Provide 24X7 monitoring, support, and maintenance of Azure cloud infrastructure to ensure high availability, performance, security, and reliability.
• Perform real-time monitoring and alert response using Azure Monitor, Log Analytics, Application Insights, and third-party monitoring tools.
• Manage and support Azure Virtual Machines (Windows and Linux) including provisioning, scaling, start/stop, patching, backup, restore, and performance troubleshooting.
• Support Azure networking components including Virtual Networks (VNets), Subnets, Network Security Groups (NSGs), User Defined Routes (UDRs), Load Balancers, Application Gateways, Azure Firewall, VPN Gateways, and ExpressRoute connectivity.
• Administer and support Azure Active Directory / Entra ID, including user and group management, role-based access control (RBAC), conditional access, and identity troubleshooting.
• Support Azure Storage services (Blob, File, Disk, Queue, Table) including access control, performance tuning, capacity management, and issue resolution.
• Provide L1/L2 support for Azure PaaS services such as App Services, Azure SQL, Managed Instances, and Azure Kubernetes Service (AKS), focusing on availability, connectivity, and configuration-related issues.
• Perform capacity planning and performance analysis, proactively identifying resource constraints and recommending scaling or optimization actions.
• Manage cost monitoring and optimization activities by entifying underutilized resources, supporting right-sizing efforts, and providing usage insights.
System Administration & Remote Support
• Manage users in Windows Terminal Server / Remote Desktop Services (RDS) environments, both on-premises and hosted.
• Act as a Remote Administrator for Customer Windows Servers, including user account management, shared file access, print services, and print queue configuration.
• Support Windows print services, server backup, restore, and recovery operations.
• Perform routine OS patching, system maintenance, and health checks across Windows and Linux environments.
Networking & Firewall Administration
• Support and administer Fortinet / FortiGate firewalls, including policy management, site-to-site and dial-up VPN configuration, VLAN setup, and basic network troubleshooting.
• Troubleshoot customer WAN, LAN, and VPN connectivity issues, including routing and firewall policy problems.
• Diagnose and resolve local LAN and network-related issues impacting application and infrastructure availability.
• Collaborate with internal and customer network teams to ensure secure and reliable connectivity.
Customer Support & Incident Ownership
• Remotely troubleshoot and resolve technical issues over phone and remote tools, taking full ownership from initial contact to resolution.
• Communicate complex technical issues and solutions clearly to non-technical or less tech-savvy users.
• Provide timely updates to customers during incidents, escalations, and service restoration activities.
• Handle after-hours and emergency support requests on a rotational on-call basis.
• Perform incident, problem, and change management activities in line with ITIL processes, including triage, escalation, RCA, and post-incident reviews.
• Conduct root cause analysis (RCA) for recurring or critical incidents and implement preventive measures to reduce future outages.
Documentation & Collaboration
• Document incidents, resolutions, and troubleshooting steps in the call tracking / ticketing system in accordance with defined documentation standards.
• Develop and maintain runbooks, SOPs, and knowledge articles for recurring issues.
• Build strong working relationships with customers and internal teams to improve service quality and operational efficiency.
Job Specification / Skills and Competencies
• Minimum 2 years of experience in a technical customer support, infrastructure support, or systems administration role.
• Hands-on experience supporting Windows Server environments, networking, and firewalls.
• Strong knowledge of Microsoft Azure infrastructure and services.
• Solid Windows system administration skills (Active Directory, RDS, file services, print services).
• Strong networking fundamentals (TCP/IP, DNS, routing, VPNs, firewalls).
• Hands-on experience with Fortinet / FortiGate firewalls or similar enterprise firewalls.
• Experience with monitoring, alerting, and incident management tools.
• Excellent troubleshooting and analytical problem-solving skills.
• Strong customer service mindset with the ability to remain calm under pressure.
• Ability to communicate complex technical concepts clearly to novice or non-technical users.
• Ability to work independently while effectively leveraging available tools, documentation, and team support.
• Strong documentation, time management, and multitasking skills.
Certifications Desired
• Cloud Certifications, ITIL Certifications, CompTIA Network+ Certification or equivalent.
• Experience administering and troubleshooting enterprise firewalls such as Cisco, Fortinet, Palo Alto, or similar.
Work Model
• 24×7 shift-based operations on a rotation basis, including weekends and holidays.
• On-call and after-hours support on a rotational basis.
• Role includes both L1 and L2 responsibilities with full ownership of incidents.
• To adhere to the Information Security Management policies and procedures