SRE Lead
About the role
Position: SRE Lead Location: Toronto, ON (Onsite) Employment Type: Full-Time Total Experience: 10 + years Job Summary: We are seeking an experienced SRE Lead to drive reliability, observability, automation, and operational excellence across complex enterprise platforms. The ideal candidate will possess deep expertise in cloud-native and on-prem ecosystems, advanced observability practices, and large-scale transaction processing environments.
Roles & Responsibility: · Lead and execute SRE roadmap initiatives, capability assessments, and reliability improvement programs. · Design, implement, and optimize observability solutions across applications, infrastructure, platforms, and networks. · Serve as the SME for Dynatrace, including DQL, Grail, Gen3 Dashboards, ActiveGate, SRG Workflows, and Business Events. · Drive end-to-end troubleshooting and root cause analysis across distributed enterprise systems. · Build and enhance monitoring frameworks leveraging Metrics, Events, Logs, and Traces (MELT). · Implement SRE best practices, platform engineering capabilities, self-service tooling, and policy-as-code frameworks. · Develop automation solutions using Python, Node.js, AWS Lambda, ECS, and backend integrations. · Establish cloud observability standards across AWS services including CloudWatch, API Gateway, Lambda, and Application Signals. · Design monitoring strategies for highly integrated enterprise and financial systems, including middleware and AI-driven platforms.
Required Skills & Qualifications: · 10+ years of experience in Site Reliability Engineering, Production Support, Platform Engineering, or Observability Engineering. · Strong expertise in Dynatrace and enterprise observability platforms. · Hands-on experience with AWS cloud services and monitoring ecosystems. · Proficiency in Python and/or Node.js for automation and operational tooling. · Deep understanding of distributed systems, performance engineering, and reliability practices. · Experience supporting large-scale financial services or other mission-critical enterprise environments. · Strong leadership, stakeholder management, and strategic planning capabilities.
Preferred Qualifications: · Experience with IBM DataPower, API platforms, and enterprise integration technologies. · Knowledge of Google SRE principles and modern platform engineering practices. · Experience monitoring AI/ML-driven applications and services.