Manager, Site Reliability Engineering
MINDBODY | Products & Engineering
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that MINDBODY's services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
SRE is also a mindset and a set of engineering approaches to running better production systems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.
SRE's culture of diversity, intellectual curiosity, problem solving, and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
PRINCIPAL DUTIES AND RESPONSIBILITIES:
- Lead and manage a team’s responsible for: Incident Management, Detection, Change Execution/Approvals, and maintenance for all integrated properties, as well as root cause analysis/remediation and other proactive measures to improve the stability of customer performance and minimize risk of impact to customers.
- Review existing processes and recommend changes or institute new processes as necessary, including the areas of monitoring, upgrades, and tuning, etc.
- Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence; automate response to all non-exceptional service conditions.
- Lead by example, care for your team and establish credibility with the quality of your and your team's technical execution.
- Manage on-call rotations across continents, using a follow-the-sun model.
- Work collaboratively with internal teams to drive resiliency improvements and reduce our MTTD and MTTR
- Participate in After Action Reviews and facilitate discover of Root Cause. Conduct Root Cause Analysis and drive repair of Problem
- Ensures the adherence to standards, policies, and procedures.
- All other duties as assigned
- Bachelor's degree in Computer Science or related technical field, or equivalent practical experience. or equivalent practical experience.
- Experience in software development in one or more of the following: C#/.NET, C, C++, Java, Go and/or Perl, Python, Ruby.
- Azure DevOps (VSTS), ASP.NET experience desired
- Experience managing an engineering team on large-scale projects with technical deep-dives into code, networking, operating systems and/or storage.
- Experience managing IT systems teams in a virtualized or cloud environment
- Proficiency working with algorithms, data structures and production troubleshooting.
- Expertise in problem solving and analyzing global scale distributed systems.
- Familiar with Agile methodology
- Strong sense of ownership
- Mastery understanding of scripting languages (PowerShell, Python, Perl)
- Mastery understanding of container orchestration tools (Kubernetes, Docker Swarm)
- Mastery understanding of cloud infrastructure services (AWS, GCP, Azure)
- Detailed oriented, professional and possesses a positive work attitude