Version 1 relies on Opsview Monitor to perform 35,000 different service checks across their infrastructure, databases and applications.
You are here
Can you afford an IT outage?
You don’t have to look hard to find out how disruptive an IT outage can be. British Airways, Delta, TSB, Facebook, Google – all of these companies have experienced devastating outages of late and have all felt the effects afterwards. Yet still, despite these high-profile examples, companies are falling into the trap of failing to act before it’s too late.
But what can firms do to avoid following in the footsteps of the most digitally advanced companies, and stop their system going down – thereby escaping the financial consequences?
Go back to IT infrastructure basics
The first step in managing IT infrastructure more effectively is to go back to basics and do a rudimentary health check. It’s amazing how small, hygiene-factor actions can be missed, which can lead to devastating effects that have the potential to snowball. A benchmark report by the Ponemon Institute this year revealed that an astonishing 71% of IT professionals believe that their organization does not know exactly how many keys and certificates it has. In terms of basic security and administration, this is a worrying number for firms.
The effects were stark. 74% of respondents from the report said digital certificates have caused, and worse, still cause unanticipated downtime or outages; at an average cost per organization of more than $11M. The cumulative cost of an IT outage and the consequent downtime averaged out at $67.2 million per company over just two years. System administration and support time, revenue loss, lost productivity and eroded brand reputation were all consequences. And to think, this could be avoided by keeping certificates in check.
Get your head out of the clouds
In the rush to move operations into the Cloud, often to enable new business practices such as digital transformation, some companies have lost sight of their IT strategy, and how they monitor system health, as there are so many composites. This results in a mélange of decentralized and disconnected monitoring tools, heightening the risk of disruption as no one is quite sure of how the system operates.
Operating in the Cloud also adds another element of risk – the supply chain. Whilst the supply chain has always been a risk (such as O2 seeking up to £100 million from Ericsson for its one-day data outage), a cloud provider going down would be a massive issue for firms to overcome. In fact, Lloyd’s Insurance and risk-modeler AIR Worldwide have calculated that one incident for a top cloud provider (such as Google or AWS) in the US for three to six days, would result in $15 billion losses. Even a fraction of these numbers would significantly hurt most companies, so it’s abundantly clear that appropriate measures must be taken to mitigate risk.
Don’t just throw away legacy IT systems
To prevent and IT outage, firms also need to have a better appreciation of legacy IT systems, and their importance within the wider infrastructure. As IT systems have become more sophisticated and the consumerization of IT has led to more innovative, customer-facing systems, infrastructure has been patched together. The problem many firms face is that legacy cannot simply be ripped out and started from scratch – these systems underpin key functions, leaving the IT department with difficult decisions to make with regard to effectively integrating older systems with new modernized applications. A key to avoiding and IT outage is understanding and appreciating how these systems work – keeping them healthy and therefore the business.
Think lifetime, not one off
A common issue businesses face is that they fail to see the bigger picture when it comes to costs. However, it must be said that the total cost of ownership is significantly less than an IT outage. Gartner has previously estimated that IT downtime costs $300,000 per hour, rising to over $500,000 for the biggest brands (it famously cost the New York Stock Exchange $2.5 million per hour in its four-hour outage). Therefore, for companies saying they don’t want to cover software costs, the line of argument is a dangerous one.
This Gartner report also did not take into account the reputational damage companies suffer. Whilst reputation is largely an intangible asset, the fallout from the TSB outage is clear to see. Alongside the initial financial hit it took (£330 million), it also lost 12,500 customers. In today’s competitive marketplace, brands can’t afford to take these risks as people are too willing to jump ship. This was again highlighted when WhatsApp went down. Although a popular platform with a loyal fanbase, when it went down, people’s frustration and desire to talk to their friends saw rival Telegram gain over three million new users in just a few hours.
Firms should also think about the internal costs too. Decentralized, disconnected systems lead to lost man-hours as people wade through convoluted IT systems. This results in people taking matters into their own hands, with Enterprise Management Associates reporting that many firms use up to 10 monitoring tools at once, just to see what is going on. This creates data islands, dilutes data and results in an average of three-six hours to source performance issues.
There’s also a hidden cost on employee satisfaction. Oxford Economics has estimated it costs £30,000 to replace an employee, so for companies not offering growth opportunities to IT professionals who are stuck in the weeds, trying to fix reoccurring issues instead of consulting on matters which could have real business impact; this is also a problem.
Get your digital house in order
The vast array of IT outage examples demonstrates that companies commonly have unfit solutions in place. They have been designed for static on-site systems rather than today’s cloud and virtual-based digital systems. Therefore, to get a true picture of system health, the only answer is to unify digital operations and monitoring under one, single pane of glass. This provides a holistic view of what is happening, and one version of the truth – meaning no more siloes or duplication of effort.
Outages occur abruptly and without warning. When this does happen, it’s vital to detect any failure quickly and efficiently – identifying affected areas and mitigating the issue. This reduces downtime, unsatisfactory user experience and lost revenue. Learn from the mistakes of others and prepare for failure – otherwise prepare to fail.
More like this
Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
Here are three reasons why sysadmins should implement 'Read Only Fridays' and avoid making large-scale changes at the end of the week.