Manager of Platform & Quality Engineering (TELUS Consumer Health)
Toronto, ON, CA, M5J 2V5 Vancouver, BC, CA Edmonton, AB, CA Calgary, AB, CA
Description
About TELUS Consumer Health
At TELUS, we are at the forefront of transforming healthcare through innovative technology. Our Consumer Health team is on a mission to empower individuals to live healthier lives, driven by the goal to become the leading provider of integrated digital health solutions in Canada.
We are currently undergoing a fundamental transformation, shifting our focus from sustainability toward aggressive growth. Our diverse portfolio includes virtual primary care (TELUS Health MyCare), remote patient monitoring (Home Health Monitoring), senior care solutions (TELUS Health Medical Alert), and even pet health (TELUS Health MyPet). We are looking for leaders who are passionate about building scalable, resilient systems that have a tangible, positive impact on peoples' lives and who are excited to solve these large-scale technical challenges.
About the Role
The Manager of Platform Engineering will lead a high-performing team of Platform and Quality Engineers responsible for the design, development, and operation of our end-to-end cloud-native platform across AWS and GCP. This strategic leadership role requires balancing technical innovation with operational stability, driving a DevSecOps culture, and ensuring the highest standards of service reliability and quality engineering (QE) across all services. The successful candidate will transition from individual output to maximizing impact through team growth, architectural guidance, and process optimization, specifically leveraging AI/ML techniques to drive proactive, predictive, and efficient platform operations.
Key Responsibilities
Platform Strategy & Architecture
- Vision & Roadmap: Define the strategic vision, roadmap, and technical direction for the Internal Developer Platform (IDP), ensuring alignment with organizational goals for speed, reliability, and cost-efficiency
- Cloud Governance: Own and manage the entire AWS/GCP infrastructure footprint, ensuring best practices in networking, security, account structure, and FinOps (cloud cost optimization), including leveraging AI-powered cost optimization tools for predictive spending and anomaly detection
- System Design: Oversee the architectural design and implementation of core platform components, including Kubernetes clusters, service mesh (Istio), and API gateways (Kong), leveraging expertise in Infrastructure as Code (Terraform, Terragrunt) and containerization
- Technology Evaluation: Actively stay current with new technologies in the Cloud Native (L4 DevOps) and SRE space, including AIOps, to drive continuous improvement and future-proof the platform
Service Reliability Engineering (SRE) & Operations
- Reliability Ownership: Lead the implementation of SRE practices, including defining and monitoring key performance indicators (KPIs), Service Level Indicators (SLIs), and Service Level Objectives (SLOs) for critical platform services
- Incident Management: Oversee the on-call rotation, drive efficiency in incident response processes, and ensure effective post-mortem analysis and preventative action implementation to improve system health (L3/L4 SRE), with a focus on leveraging AIOps for proactive failure prediction and intelligent alert correlation to reduce Mean Time To Detect (MTTD) and Mean Time To Respond (MTTR)
- Observability: Drive the strategy for comprehensive observability (logging, metrics, tracing) using tools like Datadog/Prometheus/Grafana, integrating AI-driven insights to predict performance bottlenecks and detect operational anomalies before they impact users
Quality Engineering (QE) & DevSecOps
- Quality Integration: Integrate Quality Engineering principles ("shift-left testing") into the platform, ensuring automated testing and quality gates are built directly into the CI/CD pipelines (GitHub Actions, FluxCD), exploring the use of AI-driven test case generation and optimization for enhanced coverage
- Security & Compliance: Partner with the security team to enforce robust security practices and compliance requirements across the platform, pipelines, and application services (e.g., OWASP, Vault management), including the adoption of AI-enhanced security monitoring (SecOps) tools for threat pattern recognition and automated remediation
- Automation: Champion automation initiatives across configuration, deployment, testing, and operational tasks to scale systems sustainably and eliminate toil (L3/L4 DevOps), using AI/ML to identify and automatically address common sources of toil
People Leadership and Team Development
- Team Building and Growth: Build, lead, and scale a high-performing team of Platform and Quality Engineers, including hiring, onboarding, and retention. Conduct regular 1:1s focused on development, feedback, and removing blockers. Foster a culture of ownership, accountability, and continuous learning
- Performance Management: Set measurable goals using OKRs or similar frameworks aligned with company objectives. Conduct performance reviews with actionable feedback and development plans. Address and coach through performance issues with supportive improvement plans
- Team Operations & Culture: Lead sprint planning, backlog grooming, and sprint reviews. Track deliverables, identify risks, and proactively address blockers. Facilitate technical discussions and ensure proper delegation of agile ceremonies. Build a blameless culture focused on learning from incidents
Cross-Functional Collaboration & Stakeholder Management
- Stakeholders: Align the platform roadmap with business goals by working with Engineering Directors and Software Engineering Leaders/teams teams. Collaborate with Security and Compliance on regulatory needs, coordinate with Finance on cloud budgeting, support Product Engineering to resolve developer challenges, and partner with Talent Acquisition to develop hiring and evaluation strategies
- Communication: Translate technical concepts into business value for executives and non-technical stakeholders. Maintain transparent communication across all channels. Build team consensus on architecture, technology, and process decisions
Qualifications
Required
- Experience: 7+ years of hands-on experience in Platform/DevOps/SRE, including 5+ years in a technical leadership or people management role with a focus on Agile methodologies (Scrum, Kanban)
- Demonstrated cross-functional leadership, project/program management, and financial management skills
- Technical Expertise: Expert-level knowledge of Kubernetes administration, architecture, and running applications at scale
- Cloud Mastery: Extensive experience with AWS and/or GCP, including networking, security, and cloud-native services
- IaC: Deep proficiency with Terraform (and ideally Terragrunt) for building, managing, and securing infrastructure
- Automation: Proven track record in designing, building, and maintaining robust CI/CD pipelines using tools like GitHub Actions
- SRE Fundamentals: Demonstrated experience implementing and utilizing SLOs, SLIs, and Error Budgets to manage service reliability
- Leadership: Excellent problem-solving, decision-making, and communication skills, with proven ability to lead technical initiatives and mentor senior engineers
- Stakeholder & Vendor Management: Proven ability to build strong relationships with stakeholders and effectively manage external vendors
Preferred
- Master's degree in Computer Science or related field
- Certified Kubernetes Administrator (CKA) or equivalent cloud certification (e.g., AWS Solutions Architect / GCP Professional Cloud DevOps Engineer)
- Experience in configuring and maintaining GitOps tools (e.g., FluxCD)
- Experience managing platforms in microservices and event-driven architecture
- Experience with advanced observability platforms and security tooling (e.g., Hashicorp Vault)
- Direct experience implementing or operating an AIOps solution for SRE/Cloud cost management
- AIOps Acumen: Understanding of AIOps principles and experience evaluating or implementing AI/ML-driven tooling for log analysis, predictive alerting, or performance optimization
- Experience in Mobile Device Management tools and processes