(Updated: )

An SRE's Most Important Skill? Communication

Share on social

Table of contents

I wish someone had told me that I shouldn’t hop between frameworks. Just like learning four programming languages in your first year, in my experience spending time content switching as a beginner is wasted effort. If I’d spent a solid year learning how to deploy services on AWS, then when it was time to learn Azure, I’d see more similarities than differences and find it a lot easier to pick up a second public cloud. The same is true for the time I spent learning Selenium AND Cypress for end-to-end testing. I wish I’d just stuck with one or, even better, waited ‘til Playwright was released 😉.

Last week I asked the people of the r/SRE subreddit the question: ‘What’s one piece of advice you wish you’d gotten sooner?’ and got a great variety of replies. Interestingly, some commenters even disagreed with my one piece of advice, saying that being a broad generalist was preferable to focusing on one framework!

Here’s some of the advice I got, and some general lessons I think all SRE’s could benefit from.

People and communication matter

Technical proficiency is only one part of the puzzle. Any shared wisdom should include the significance of effective communication, especially in the tech environment. We often ignore this at our peril, as it is the glue that holds teams, concepts, and advancements together. User u/wugiewugiewugie underscored the significance of effective communication in a tech environment, sharing their experience of bridging the gap between technical jargon and business objectives:

It took me half a decade to appropriately prioritize the human aspect of technology in what made me do a better job — today I tend to quote Deming's "System of profound knowledge”

This theme of translation and collaboration resonated throughout the thread, emphasizing the importance of finding common ground between different stakeholders as u/bigvalen points out.

Sometimes leadership does not speak the same language. Find a translator and work with them.

User u/bivalen follows up with a fantastic anecdote about speaking the language that others understand:

I had a giant fight many years ago, trying to buy $200 of hardware that would make a single point of failure more reliable. Because I couldn't show what the reliability improvement would be in terms that mattered to an accountant. Until a sales guy suggested leading with how much revenue we booked through the box. I was more interested in company reputation. Once I pointed out that one box made us €50k/month, the accountant countered with "how about I give you $5k, and you upgrade everything".

While it might seem like you’re fighting needless beuracracy and top-down control, the idea of a ‘translator’ can make working in these systems quite a bit easier. As technical people we all know that single points of failure are bad and that high traffic services should have redundancy. We often don’t know or care how much revenue each bit of traffic represents. Finding that supporting business impact made this much easier than before.

Failure is always possible

The conversation also touched on the inevitability of failure in the tech world. User u/devoopseng emphasized the value of embracing failure as a catalyst for growth, echoing the sentiments of quality management pioneer Deming's "System of profound knowledge." In an industry driven by innovation, setbacks serve as invaluable learning opportunities.

Every incident, outage, or performance degradation presents a chance to analyze, understand, and mitigate potential future issues. Instead of fearing failure, see it as a chance to innovate, iterate, and ultimately improve system reliability. Hard to remember sometimes when you’re early in your career and perfection driven.

When things fail, when the PR we worked so hard on is the one that takes down production, and when it feels like no one has made mistakes as big as we have, it’s easy to get discouraged. Many commenters noted the importance of adaptability and resilience. User u/durden0 cautioned against succumbing to analysis paralysis, advocating for a pragmatic approach to learning and implementation. In a field where change is constant, flexibility and agility are key to success.

Don’t waste too much time trying to research and learn everything about a new technology to decide which one to use. Try stuff out…then implement and design things in a way that makes it easy to evolve or change course. In other words, avoid analysis paralysis by making change as easy as possible.

User u/CapitanFlama gave some more basic advice that I think every SRE needs to hear: just own your mistakes and move forward:

If you screw up, if you break something, quickly raise your hand/give notice. It's easier to get help to fix something that just broke or was misconfigured than untangle something that broke and was quickly tried and failed to get fixed.

This one spoke to me personally: I was guilty in my very early career of spending a solid 20 minutes every time I broke something trying to fix it on my own. Eventually I had the experience of not just wasting time doing this but actively making the problem worse, and finally got the lesson: no one expects you to be perfect, but they do expect you to let someone know when you made a mistake.

Conclusions: the SRE Journey

I wasn’t surprised that the theme that came up more than once was about the human side of SRE. Making others prioritize the fixes and maintenance that we know is necessary is a huge part of the job, as is encouraging cultural buy-in on improved practices for better performance.

The Reddit thread offered hard-earned wisdom for the SRE community. From prioritizing depth over breadth in learning to fostering effective communication and embracing failure, the journey to becoming an experienced SRE is more about personal growth than the growth in your technical skills.

Share on social