Monitoring data, like all operations data, is at its most valuable when it leverages a presentation layer that puts the information in the proper...
You are here
IoT, DNS, Feature Flags, and Chaos-as-a-Service
O’Reilly Velocity 2018 Distributed Systems Conference was held October 1-3 at New York’s midtown Hilton. As sponsors, Team Opsview arrived with a song in our hearts and a lot of purple Opsview-branded swag to give away, including a mystery flavor lip-balm that proved irresistibly magnetic to attendees (or perhaps it was the Nintendo Switch we were giving away to someone who correctly guessed the flavor).
Velocity is a bigger, broader, more-sprawling event than some of the others we’ve attended recently. Pre-show training days, half-day bootcamps, multiple presentation tracks, plus show-floor and sponsor pavilion meant no single person could see everything. So our choice of highlights, this time around, is more spartan, varied, and subjective. You can dig deeper by visiting the Velocity NYC 2018 website, and spelunking the schedule for links to video highlights and slides. Full videos are available through O’Reilly’s online learning platform, for which a free ten-day, no-credit-card-required trial is available.
SaaS Services for Dev and Ops
Samsara runs a platform for monitoring huge flocks of distributed IoT sensors in fleet, industrial, and other applications. As you can imagine, they understand scale (the company’s name, drawn from Sanskrit, means something like “wandering through the infinite cycles of the world”). In her talk, titled Practical Performance Theory, Samsara’s Kavya Joshi discusses (mostly relatively simple) models of applications; explains their limitations; shows how to reason with them; and how to derive useful predictions from them. She moves back and forth between basic queueing theory and spookier, more complex, multi-regime scenarios, and provides hugely valuable insights that anyone working on parallel transactions at high volumes (e.g., serverless apps) can put to work today.
NS1 provide DNS and traffic-shaping services to some of the world’s biggest internet companies. Their CEO/founder, Kris Beevers, gave a keynote on how to find the sweet spot between (as he puts it) “not getting to make big mistakes” and delivering new products and features at speed. His keynote on “Balancing Good Enough and Perfect” opens by discussing a pathfinding/optimization problem on a topo map -- a physical example that turns out to have deep roots in explorations of chaos theory and the problem of avoiding premature optimization to local maxima.
Imagine being able to continuously deliver features without hassles -- turning them on and off on production systems without rollbacks or drama. Imagine being able to A/B or canary-test new features: presenting them to fine-tuned subsets of customers -- even selecting behaviorally or from user data such as PII -- while presenting the rest of your customers with tried-and-true functionality. LaunchDarkly provides a SaaS or premise-based platform that enables all this: it serves feature-flags to your users in realtime, giving you perfect manual or automated control over user experience.
Developers love the idea (sometimes not so much the practice) of Chaos Engineering -- challenging the resilience of production systems (and testing the systems and people monitoring and managing them) by automatically disabling components and servers at intervals and seeing what happens. Doing this in practice has mostly focused on a relatively hard-to-use solution called ChaosMonkey, invented and later open-sourced by Netflix. Now Gremlin, who we met at Velocity (and who were handing out the best gremlin-shaped swag mints ever!), has introduced what they call a Failure-as-a-Service platform: a hosted service that lets you perform selective experiments in automated system breakage (and/or simulated load application) with the goal of identifying areas of weakness and improving resilience. Clear web displays let you target server resources (e.g., CPU, memory, etc.), network traffic, or simulate various kinds of bad behavior, from unexpected restarts to process failures to time drift: a common cause of “split brain” failures in large-scale cloud frameworks. Not inexpensive, but if highly-controllable Chaos is what your team needs, probably very affordable.
More like this
The O'Reilly Velocity Conference helps systems engineers, software developers, and DevOps teams stay ahead of the game.
Kubernetes’ extraordinary resilience tends to change the emphasis of monitoring from alerting to resource and performance management.