In January 2018, digital experience monitoring firm Catchpoint conducted a survey of 416 professionals with the title or responsibilities of a Site Reliability Engineer (SRE). The goal of the survey was to find out what it really means to be an SRE, examining the types of organizations, skills, and culture that exist where site reliability engineers work.
The survey revealed that while there is tremendous variability in the definition of an SRE across and even within organizations there are also similarities. Knowing what to expect when you take on a role as an SRE can save a lot of headaches down the road as one respondent put it:
"People often 'label' SREs and try to define a specific operational role that is external to the development process, instead of recognizing that they are as intrinsic and important to the development process as having DBAs, back-end developers, front-end developers, UI devs, etc., who—all together—assume responsibility for building and running a scalable, well-functioning service."—Gary Colman, Senior Engineering Manager, LinkedIn
If you are looking to build out an SRE team in your organization, here are some key considerations.
Focus on diversity of thought and skills
Read through a typical SRE job description, and one of the first qualifications you'll see is a degree in computer science. However, our survey found that 39% of SREs do not meet this criterion. In fact, 20% of respondents do not hold a degree, and 19% studied something other than computer science. Areas of study included philosophy, political science, theater, zoology, and business. Restricting your search to only candidates with a computer science degree will result in a less diverse team and potentially exclude a large number of skilled SREs.
“A good SRE has an ability to critically examine a system and use that to guide them when asking questions of the system. Technical skill allows you to ask the question, but it doesn't help you ask the right ones to find the cause of a problem.”—Jamie Wilkinson, SRE at Google
A July 2017 article in the Harvard Business Review cited the need for a well-rounded learning experience to help people develop the ability to ask the right questions and understand and respond to human needs. If you’re not asking the right questions, you won’t be able to solve the right problems.
SREs need to be able to solve problems effectively, learn on the fly, and make decisions quickly. These skills are enhanced by the top 5 non-technical skills for an SRE:
- Composure under pressure
- Written communication
- Verbal communication
Being able to solve problems effectively requires the ability to work well with others. SREs should not be expected to know all the answers; instead, they should be able to know who on the team or within the organization to ask for help and how to communicate with them. Look for SREs from a variety of backgrounds and majors.
Be clear about expectations and responsibilities
No two SREs or SRE teams are the same; the role encompasses a variety of skills and responsibilities. Google's SRE team may have written a book on site reliability engineering, but each company has its own unique needs. The role's importance needs to be communicated throughout the organization, not just in engineering and operations. Less than 50% of SREs feel their role is well communicated within engineering, and that number drops to 44% when the larger organization is considered. This leads to SREs feeling undervalued and not respected.
Smaller organizations may require SREs to take on additional responsibilities that fall to other departments in larger organizations, while SREs at larger organizations are more likely to contribute to the product roadmap and develop new product features.
Be upfront about the pace of work. Most SREs work at organizations that do multiple code deployments every day. This fast pace does not mean that technical documentation and maintenance of operational runbooks can be overlooked. SREs need to take the time to document processes as they go, which requires strong written communication skills.
Whether you require SREs to build their own tools or you provide a toolbox of open source and commercial solutions depends on the size of your company. Smaller organizations often focus more on open-source and vendor solutions, while many larger enterprises build tools internally to meet their needs.
SRE != DevOps
Building an SRE team does not automatically make you DevOps. DevOps is about more than toolsets or employee titles. Site reliability engineers work on automating and increasing the reliability of supported services, which is one way to employ DevOps. DevOps refers more to the culture of an organization and the processes used. Gene Kim defines the DevOps way as including the processes of flow, feedback, and continuous improvement and learning. Even if you are automating and measuring tasks and sharing the information, if the culture does not include collecting feedback and learning, it is not DevOps. The SRE role should not be seen as a cure for broken culture; the culture must be fixed first.
Based on the findings of the SRE survey, here are some key elements to include in the SRE job description, regardless of the size of your organization:
- Bachelor's degree
- 2+ years in operations or software engineering role
- Excellent verbal and written communication skills
- Strong problem-solving skills
- Passion for technology as well as helping customers and team members
- Expertise with cloud-continuous deployment based software development lifecycles
- Mastery of infrastructure automation technologies
- Eagerness to learn
Download the Open Organization Guide to IT Culture Change
Open principles and practices for delivering unparalleled business value.