Site Reliability Engineer Program Manager

  • £50000 - £55000 per annum
  • Maidenhead
  • Posted: 31/07/2018
  • Permanent
  • Job Ref: 216102609

Job Details

Site Reliability Engineer (SRE) – Program Manager

Location: Maidenhead, Berkshire


• Own end-to-end availability for a product service

• Work with product service teams to establish SLIs and error budget's, and nurture an environment that appreciates the value that they add

• Identify opportunity for increased monitoring capabilities (white-box & black-box)

• Identify long-term trends for product services (how is my traffic growing over time? How big is the database getting? What does our resource usage patterns look like over time?)

• Ensuring that short-term hacks, are replaced with long-term solutions

• Co-ordinating incident response as part of an on-call rotation, ensuring the SREs aren't being overloaded by on-call, and continually refine the process and tools that enable us to do incident response successfully

• Ensuring that RCAs are being carried out effectively, and that they are being done in a blame-free manner

• Attend the portfolio management team meetings to flag reliability considerations for upcoming work, and to reason about any reliability concerns from other stakeholders

• Populate the SRE backlog

• Identify requirements surrounding load testing, security testing, availability and disaster recovery

• Help mature the delivery process for teams; defining Jenkins pipelines, designing canary release deploys, building in automated fallbacks, optimizing the build chain etc

• Optimize product service code to ensure that it's secure, scalable and performant

• Optimize release engineering code to ensure that it's stable, repeatable and fast

• Improve the fault detection for our services

• Create dashboards which help communicate the metrics for a given product service

• Work with product owners and product engineering teams to perform capacity planning

• Work with product engineering teams to understand performance and behavior patterns

• Help carry out root cause analysis for incidents, and design solutions (both software and human processes) that will help to ensure the same problem doesn't happen in the same way again

Critical Skills / Competencies:

• Comfortable writing code with one or more of the following languages: Python / Go / Java / C# / C / C++

• Experience working with product owners and product development to prioritize work, flag risk and identify potential production engineering issues (e.g. scalability, resiliency, performance)

• A positive attitude and willingness to learn

• Experience managing services in AWS

• Experience with IaaS and Serverless services from a cloud provider

• An understanding in TCP/IP, DNS and experience designing networks

• Linux system administration experience

• Strong conflict resolution competence

• Excellent written and verbal communication skills

• Experience implementing fault detection, and automating fixes

• An understanding of SQL databases

• Experience designing scalable services

• Experience designing distributed, fault-tolerant systems

• Detail oriented. The ideal candidate is one who naturally digs as deep as they need to understand the why

