Site Reliability Engineer

<p>We are looking for a <strong>Site Reliability Engineer (SRE)</strong> with a strong <strong>application engineering background</strong> to improve <strong>application reliability, observability, and incident resolution</strong> across a complex enterprise landscape.</p><p>This role will focus on <strong>understanding application behavior, diagnosing performance issues, and reducing Mean Time to Resolution (MTTR)</strong>, rather than solely managing infrastructure or CI/CD pipelines.</p><p><strong>Key Responsibilities</strong></p><p><strong>Application Reliability & Issue Resolution</strong></p><ul><li>Analyze and troubleshoot <strong>application failures, latency issues, and degraded performance</strong> across distributed systems</li><li>Perform <strong>deep-dive root cause analysis (RCA)</strong> to identify underlying application-level issues</li><li>Work with engineering teams to <strong>quickly isolate failing components and dependencies</strong></li><li>Reduce <strong>MTTR (Mean Time to Resolution)</strong> through improved diagnostics and runbooks</li></ul><p><strong>Application Observability & Diagnostics</strong></p><ul><li>Assess current application landscape and identify <strong>gaps in logging, tracing, and monitoring</strong></li><li>Implement and enhance <strong>application-level observability</strong> (logs, metrics, traces)</li><li>Enable faster issue identification by improving <strong>service visibility and dependency mapping</strong></li><li>Define and standardize <strong>health checks and alerting strategies</strong> for applications</li></ul><p><strong>System Understanding & Mapping</strong></p><ul><li>Develop a clear understanding of <strong>application architecture, data flows, and service dependencies</strong></li><li>Build and maintain <strong>application topology and dependency maps</strong></li><li>Identify <strong>single points of failure and performance bottlenecks</strong></li></ul><p><strong>Performance Engineering</strong></p><ul><li>Analyze application performance and recommend improvements for <strong>scalability and responsiveness</strong></li><li>Identify issues related to <strong>threading, memory, database interactions, and API latency</strong></li><li>Work with developers to optimize <strong>code paths, queries, and service interactions</strong></li></ul><p><strong>Incident Management & Process Improvement</strong></p><ul><li>Lead or support <strong>incident triage and war-room calls</strong></li><li>Improve <strong>incident response processes and escalation paths</strong></li><li>Create and maintain <strong>runbooks, playbooks, and troubleshooting guides</strong></li><li>Identify recurring issues and drive <strong>permanent fixes vs temporary patches</strong></li></ul><p><strong>Collaboration & Engineering Enablement</strong></p><ul><li>Partner with <strong>application development teams</strong> to embed reliability best practices</li><li>Provide guidance on <strong>error handling, resiliency patterns, and fault tolerance</strong></li><li>Enable teams with tools and practices for <strong>self-service diagnostics</strong></li></ul><p><strong>Required Skills & Experience</strong></p><ul><li><strong>5–10 years of experience</strong> in application engineering, production support, or SRE roles</li><li>Strong experience in <strong>application troubleshooting and debugging</strong> (Java/.NET/Node.js preferred)</li><li>Solid understanding of <strong>distributed systems and microservices architectures</strong></li><li>Experience with <strong>application logs, debugging tools, and performance profiling</strong></li><li>Familiarity with <strong>observability tools</strong> (Splunk, Dynatrace, AppDynamics, Datadog, etc.)</li><li>Strong understanding of <strong>API behavior, database interactions, and system integrations</strong></li><li>Experience working in <strong>production support / incident management environments</strong></li></ul><p><strong>Preferred Skills</strong></p><ul><li>Experience implementing <strong>distributed tracing (OpenTelemetry, Jaeger, Zipkin)</strong></li><li>Knowledge of <strong>cloud environments (AWS/Azure/GCP)</strong></li><li>Exposure to <strong>resiliency patterns (circuit breakers, retries, fallbacks)</strong></li><li>Experience with <strong>performance tuning and load analysis</strong></li></ul><p></p>

Back to blog

Other Jobs To Apply