A full guide on using InfluxDB as a time series provider in the recently released Opsview Monitor 5.3.
You are here
Event Diary: InfluxDays NYC
InfluxDB is the pre-eminent open source platform for ingesting, pre-processing, storing, querying, and analyzing metric time series. It’s the ‘I’ of the so-called TICK stack (also comprising Telegraf, a pluggable, bidirectional agent; Chronograf, a UI and Query system; and Kapacitor, a data- and stream-processing engine), and part of a growing ecosystem of complementary open source tools like Grafana, for time-series analytics, graphing, and dashboard creation.
InfluxDB was developed to handle the fast-growing volumes of time-series data involved in monitoring premise/public hybrid IT infrastructure and applications. Today, it’s most often found at the core of sophisticated, home-grown telemetry systems used by DevOps teams at large enterprises and service providers to monitor big hybrid IT estates (hybrid clouds, container farms, PaaS), large-scale, cloud-native applications, and distributed, mission-critical industrial processes (e.g., IoT, renewable power-grid, etc.).
Sound pretty rarefied? In fact, as I learned this past Tuesday (February 13) at InfluxDays NYC — one in a series of two-day conference/workshops hosted by InfluxData — the best practice that organizations like Tesla and Comcast are evolving around InfluxDB and time-series analytics is being driven by powerful technical and economic trends, shaped by hugely-innovative (and often counter-intuitive) thinking, informed by (really) big data, and verified empirically at scale. This movement will soon influence (read: is already influencing) not only how everybody will do monitoring, but how business itself will work.
Crises of Observability, Understanding, and Scale
What are the trends? Some are pretty obvious. Hybrid cloud use is exploding. Developers are adopting new and better ways of building, deploying, managing and updating applications (heavy use of open source components, continuous delivery/integration, DevOps, infrastructure-as-code, the ‘automate everything’ mandate), and running those apps on ever-more-abstract platforms: container orchestration frameworks, PaaS, serverless computing. All this is creating more and more complexity, speed, and dynamism, e.g., auto-deployed, microservice-based, scale-out workloads with lifespans measured in minutes. And it’s also hiding a lot of what’s going on.
For users, hiding complexity is great news: why worry about what the platform is doing when you can just package up the container or write the function and let the platform scale it out based on traffic? For operators, however, it’s creating what you might term a crisis of observability and a concomitant crisis of understanding. How can you manage (and cost-optimize) what you can’t see? And (if you can overcome the observability problem) how can you monitor all the complicated things intelligently?
Pro Tip: Don’t Start by Rolling Your Own TSDB
Observability is easier. You make things observable by instrumenting and monitoring them — using powerful tools, like InfluxDB, optimized to work well at scale. But as Tesla’s Colin Breck wryly noted in his InfluxDays talk titled From a Time-Series Database to a Key Operational Technology for the Enterprise, many folks, faced with these problems, start by trying to write their own time-series database.
Breck, who himself wrote a widely-used time-series database (part of the PI System from OSIsoft), explained why this is a very bad idea. And this theme was echoed in several other presentations, including a very funny (and also somewhat sad) talk by David Cromberge, a senior back-end dev at Outlyer, a SaaS monitoring service for microservice workloads, titled Why you Definitely Don’t Want to Build Your Own Time Series Database. Cromberge’s team spent two years trying to scale their TSDB, and eventually (and happily) ended up going in a different direction.
Time-series storage (who knew?) turns out to be very hard to do. Tesla’s Breck, who develops sophisticated distributed systems for managing elements in renewable power grids, offered a daunting synopsis of challenges that TSDBs need to meet in transitioning from small-scale applications to becoming critical ganglia in the enterprise nervous system. He noted the need to manage intermittent failures of connectivity with data sources, the challenge of processing data to preserve utility while dealing with massive inbound flows, the need to use in-memory and on-disk caching to prevent data-loss, the problem of managing highly-granular security in monitoring systems that look at assets owned by many different organizations, and touched on complex issues entailed in providing query services against databases designed to ensure only eventual consistency.
FYI, Colin Breck blogs at blog.colinbreck.com — if you’re into stuff like Akka (concurrent, distributed, fault-tolerant application framework for Java) among other things, he is your guy.
Hands-on with the TICK Stack
InfluxDays (the conference, as opposed to the subsequent one-day workshops) wasn’t explicitly aimed at developers — or at least, not exclusively so. Even when talks pivoted (necessarily) on complex technical arguments, programming arcana, and fast-moving live demos, speakers all took remarkable care (and were, I think, largely successful) in providing a comprehensible high-level “why should you care?” narrative, and valuable take-aways for business folks. Short (35-minute) presentations, well-timed and practiced, with proactively-managed, five-minute Q&A sessions after each ensured that few eyes glazed over, and that Death By PowerPoint never became an issue.
Technologists not especially familiar with the TICK stack were in maybe the sweetest spot — I’m not a TICK stack wizard by any means, but I had no trouble following the day’s most technical presentations.
Of these, there were several stand-outs. InfluxData CTO and founder Paul Dix — the father of InfluxDB — opened the show with a review of InfluxDB’s somewhat painful history of deep refactoring and total rewrites (Scala to Go) as InfluxData’s and the community’s understanding of the problem set grew more nuanced over time. This related to his core topic: IFQL, the new, functional-programming-oriented query language (and architecture) for InfluxDB. Dix closed with news that InfluxData would be providing its Go implementation to the Apache Arrow (columnar, in-memory database compatible with processor SIMD vector instructions) project, and later in the day, tweeted the happy news that InfluxData had gained a $35 million investment round.
Dix also offered what I thought was simultaneously the funniest and wisest throwaway line of the day: a powerful reminder that time-series can appear — and even be — conceptually simple, while also being deep. “You can understand everything that’s most important about how time-series databases work,” he said, “by watching Ryder Carroll’s five minute into video about how to keep a bullet journal.” I later watched that video, and he’s absolutely right. Bullet journals frame a method for turning the simplest physical data structures – lists, flags, labels – into a powerful tool for self-mastery, self-improvement, and ultimately, for prediction. And that’s what all this time-series stuff – with its deep history in information theory, signal processing, feature extraction, gisting, and relevance to machine learning – is all about.
Ryan Betts, InfluxData’s Director of Engineering, talked about InfluxDB internals, expanding on CTO Paul Dix’s earlier, introductory remarks on progress with, and exploring how TSI (Time-Series Indexing — a method for indexing high-cardinality time-series on disk, avoiding limitations of in-memory indexing and improving performance with ephemeral time-series, such as are generated by short-lifespan container workloads) is helping InfluxDB deal gracefully with crazy-high numbers of time series. He also waxed philosphical on the eternal tension between storage and query: i.e., if you want to read something back easily, you have to write it in very clever and laborious ways; whereas if you want easy writes, searches are hard. (Meanwhile, Paul Dix posted his IFQL and InfluxData roadmap slides here, and the video of his opening presentation is here.)
Former Googler, Tom Wilkie, the founder of Kausal, a new venture focused on Prometheus monitoring, proposed a method for instrumenting microservices he terms RED: for Requests-per-second, Errors, and Duration (time required to process a request) plus Saturation, a metric of how close the service is being driven to its maximum capacity. His method is intuitively simple and (I think) hugely valuable — I’m going to apply it in research we’re starting for a white paper, slated for next quarter, on sensing impending failure in complex systems. Joab Jackson, a first-class writer for TheNewStack.io, wrote up Wilkie’s talk here.
Another valuable talk came from Rob Frohnapfel, Director (of something - his posted title is intriguingly vague) at Comcast, who related his team’s experience using InfluxDB and automation to provide disciplined and cost-effective monitoring of Cloud Foundry and Kubernetes. Jacob Lisi, of Grafana Labs, taught us all a thing or two about how to monitor Kubernetes, including some golden metrics that are important proxies of cluster health. Lisi wrote a Grafana plugin for PromQL.
The Conversation Continues
I found attending InfluxDays to be hugely valuable: an opportunity to deepen my familiarity with an important toolchain, learn lots of immediately-applicable best practice, and connect with a robust, smart, and welcoming community. It was also, of course, an opportunity to compare strategic notes between InfluxDB’s model of the universe, and Opsview’s own innovation agenda. In so doing, I was gratified to learn that our own strategians, notably our Innovation Lead, Bill Bauman, his colleagues in Engineering, and Opsview’s leadership, are very much on the same page as many influential voices in the time-series and data science communities about how monitoring should work, and how it should evolve.
All these folks acknowledge the paired crises of observability and understanding that threaten IT operations. They advocate meeting these challenges by consuming mature technologies (Opsview’s upcoming releases will incorporate InfluxDB for time-series databasing) and working closely with open source communities of practice to package and share operational best-practice (Opsview has recently open sourced all its Opspacks and integrations). They also, it seems, agree that the most important goals of monitoring include actionable insight: including anomaly detection and issue prediction, driven by powerful analytics workflows (and potentially, by machine intelligence).
We probably disagree about some things, too. I was struck, for example, by how thought-leaders at InfluxDays consistently discouraged end-users from attempting to write their own time-series databases, but simultaneously endorsed these same users building very large-scale, home-grown monitoring solutions using the TICK toolchain. Most organizations will find it far more practical to seek out finished monitoring solutions like Opsview Monitor that already incorporate operational intelligence about the systems they use, plus growing abilities to auto-discover IT assets, and act intelligently in the face of unknowns.
This is an important conversation, and people in the monitoring community need to keep it going. For people on the US East Coast, the next opportunity to do so may be IT OpShare, on March 15, at Harvard Law School, in Cambridge, MA (USA). The afternoon event will bring together Opsview customers from the IT departments of Ivy League and other nearby institutions, speakers from Mesosphere, HAProxy, VictorOps, and TeleComputing (among others), plus a closing technical address by Opsview’s Bill Bauman, on the value of machine learning and predictive analytics to DevOps. Join us!
More like this
Monitorama PDX 2018, in Portland, offered an intense, three-day conference program -- by and for monitoring and DevOps practitioners.
Opsview's Bill Bauman and John Jainschigg attended Percona Live 2018 -- to talk about serverless computing, database monitoring, and catch up with...