Introducing Opsview Cloud, the monitoring service that empowers IT ops teams to focus on solutions that deliver business value and impact the...
You are here
How to Avoid the IT Outage War Room with IT Ops
An IT outage is like a heart attack – starving businesses of critical information and rendering them paralyzed. It forces them onto the back foot, taking on multiple combatants, with attacks coming in from all sides. This can make an IT war room situation the only viable option to regain operational control, but it’s not to be taken lightly – it requires a major investment of time, people and money in its own right.
That makes prevention by far the best option. But with sprawling IT systems, poor visibility, siloed teams and multiple tools, IT operations monitoring is not always fit-for-purpose. IT teams need to get back to basics, with a single, unified version of the truth, to keep them out of the war room.
When things go wrong
Modern enterprises are rushing in droves to embrace digital platforms. According to IDC, nearly three-quarters (73%) of organizations had at least one application or some of their infrastructure in the cloud as of last year, and another 17% planned to do so in the next 12 months. That’s not to mention the take-up of IoT, mobile, software-defined datacenters and more. These transformative technologies are helping businesses differentiate on innovation and customer-centricity, but crucially they also add complexity. Many modern digital platforms are built on or linked in some way to legacy solutions. When something goes wrong, it can be difficult knowing where to find the problem.
There’s a common misconception that day-to-day outages, whether it’s a disk failure, a crashed application or a power outage, just “happen”. In reality, there’s always a reason. It could be because of a lack of effective oversight of mission critical operational systems; a misconfiguration or other human error; or even dilapidated hardware. Or it could be some or all of the above, coming together in a perfect storm.
On the rise
The bad news is that such incidents are appearing increasingly frequently, with serious repercussions for the bottom line and corporate reputation.
This summer, we saw a slew of outages affecting some of the biggest names on the web. In early June, a routine configuration took out Google Cloud on the east coast of the US, hitting countless third-party sites including Vimeo and Snapchat. Similarly, Cloudflare dropped 15% of its global traffic after a BGP problem at Verizon misrouted a large chunk of the internet, and a week later the company’s customers were again left without service after an internal code update went wrong. That’s not to mention serious issues with Facebook, WhatsApp and Instagram that were triggered during routine maintenance.
It’s not just digital platforms that are at risk. Legacy systems have also been found wanting. A CAST report from 2018 revealed that nearly half (47%) of financial services firms operate anywhere from 26-50% of their business on legacy systems. That certainly seems to have had an impact in the UK, where the likes of TSB and HSBC have both suffered notable outages. A report from regulator the FCA, found that the UK’s five biggest high street lenders suffered 64 payment outages in Q2 2018 alone. It has now mandated lenders to publish details on major incidents and is proposing a “maximum outage time” of two days. Even the Bank of England has been criticized by MPs for the slow progress of IT modernization efforts. It’s not just the UK that is impacted either, this is a global issue: banks ranging from the US’ Wells Fargo to National Australia Bank have reported nationwide outages over the past year.
An expensive business
The impact of such outages on organizations is difficult to over-estimate, given how much they rely on IT systems to maintain day-to-day operations and drive business growth. Staff downtime, reputational damage, lost business, and the hit to IT productivity all have serious financial knock-on effects. Gartner claims the average cost of network downtime alone is around $5,600 per minute. Separate research from YouGov claims IT failures cost UK firms £35bn each year, with employees losing a whole working day each month as a result. These costs extend to the IT war room. A major outage may require a serious response like this to get things up and running again. But bringing together the right stakeholders to manage the technical, PR, legal and other implications of a serious incident requires significant resources and diverts key personnel away from their day-to-day job supporting business innovation and growth.
It’s not just the direct financial impact of outages that organizations need to bear in mind, but the potential degradation of customer trust. In sectors where loyalty is hard won but easily lost, like online retail, outages could have a serious long-term impact on a brand. And the longer one lasts, the more unhappy customers there will be.
How IT Ops can help
There’s no silver bullet solution to miraculously solve all IT outages. But the approach all IT leaders should be aiming for is one of proactive prevention – which in turn requires enhanced visibility. IT operations management tools are a great way to achieve this, but there are challenges. Many are not designed to provide visibility into both static, on-premises legacy systems and dynamic, decentralized digital platforms – with some of the latter existing out of the control of the IT department.
A related challenge is that many IT teams are laboring with multiple IT Ops tools. One report from Enterprise Management Associates suggests that 65% of organizations have more than 10 monitoring solutions in place. It’s perhaps not surprising given this tool sprawl, that efforts to identify and contain outages are being hampered. Two-thirds say that it takes more than three hours to determine the root cause of an app-related issue, rising to over six hours for a third of organizations.
There’s only one way to sort out this mess: centralize IT Ops onto a single platform capable of providing visibility into legacy and modern IT systems. A single version of the truth removes inefficiency, helps to break down siloes and accelerates response times, minimizing downtime and any impact on corporate reputation. Even better: it could help identify issues before they’ve even turned into full-blown outages.
That’s the best way to stay out of the war room, and put IT back where it belongs: supporting business growth.
More like this
Every enterprise organization has a unique set of priorities and requirements surrounding the monitoring of its infrastructure.
Here are three reasons why sysadmins should implement 'Read Only Fridays' and avoid making large-scale changes at the end of the week.