How to monitor your data center like a Google SRE

Taking a leaf from Google’s book on how to run production systems
May 04, 2018

Datacenter Dynamics May 2018

Back in the early ’00s, when Google was beginning to expand its portfolio of services beyond search, it encountered a combination of challenges. Some of these emerged from familiar, classic disconnects between developers and operations folks, or IT services and line-of-business owners. Others were brand new, never-before-seen failure modes that arose from providing services on novel cloud platforms, and doing so at planetary scales.

To confront these challenges, Google began evolving a discipline called Site Reliability Engineering, about which they published a very useful and fascinating book in 2016. SRE and DevOps (at least the contemporary version of DevOps that’s expanded into a vision for how IT operations should work in the era of cloud) share a lot of conceptual and an increasing amount of practical DNA; particularly true since cloud software and tooling have now evolved to enable ambitious folks to begin emulating parts of Google’s infrastructure using open source software like Kubernetes.

