How to monitor your data center like a Google SRE
Back in the early ’00s, when Google was beginning to expand its portfolio of services beyond search, it encountered a combination of challenges. Some of these emerged from familiar, classic disconnects between developers and operations folks, or IT services and line-of-business owners. Others were brand new, never-before-seen failure modes that arose from providing services on novel cloud platforms, and doing so at planetary scales.
To confront these challenges, Google began evolving a discipline called Site Reliability Engineering, about which they published a very useful and fascinating book in 2016. SRE and DevOps (at least the contemporary version of DevOps that’s expanded into a vision for how IT operations should work in the era of cloud) share a lot of conceptual and an increasing amount of practical DNA; particularly true since cloud software and tooling have now evolved to enable ambitious folks to begin emulating parts of Google’s infrastructure using open source software like Kubernetes.