Instead of Dreading a System Crash, Schedule One and Learn to Avoid Them

Entrepreneur

10 October 2014 at 2:30 pm

Instead of Dreading a System Crash, Schedule One and Learn to Avoid Them

According to a survey by CA Technologies, companies in North America and Europe lost more than $26.5 billion in revenue due to downtime, and that’s from 2010!

There are various ways to calculate the monetary cost of system outages but the damage to a company’s reputation is immeasurable. When Microsoft’s Azure cloud-computing service experienced a major outage recently, experts speculated that it could be a major blow to the software giant’s attempt to compete against rivals Google and Amazon.

Related: Safety Dance

Good CEOs and CIOs refuse to accept excuses for even small levels of downtime but it’s not easy to hit five nines of reliability. Nonetheless, no matter how complex a company’s systems and business, there are always ways to engineer and deliver higher reliability and quality of service. Below are the actions that CEOs need to take to boost their company’s reliability:

1. Stop waiting for an outage. Create one.

If you wait for a customer to do something that causes a failure, you’re too late. For example, Netflix has tackled unexpected outages using their “Simian Army,” a set of automated tools that test applications for failure resilience. However, for most companies, the best way to handle this is to keep it simple.

Encourage your ops and dev teams to schedule a recurring meeting and create outages manually. Injecting failure reveals implementation issues that reduce resiliency while proactively uncovering deficiencies that would otherwise be the root cause of an outage.

Scheduled outages build a strong collaborative culture simply by bringing teams together on a regular basis. Working together to fix artificial failures will combat the idea that an actual failure can be ignored or justified with explanations.

2. Create (and protect) time for learning

No good engineer fixes the same problems without learning in the process. Make sure the teams responsible for resolving incidents have time to work through comprehensive postmortems.

Empower your teams to analyze what worked and what didn’t, without forcing them to determine a root cause. All too often, human error is the focus of these conversations but that just isn’t healthy. Blameless retrospectives allow teams to uncover the real issues and make proactive adjustments.

Businesses want to move fast but resist the temptation to move onto other issues when systems resume running or when everyone agrees on a “root cause.” Invest the time needed to understand how your systems and teams work. See it as an opportunity for the contextual learning needed to make real-time decisions that will improve your company’s mean-time-to-resolution.

3. Treat your ops and dev teams like sales and marketing. They drive revenue.

If you didn’t support your sales teams with tools, training and incentives to hit their goals, people would think you were nuts. Despite their critical role in ensuring your customers are getting value from your company, ops and dev teams often get less attention than their customer-facing counterparts.

Give these employees the infrastructure and tools to achieve peak performance. That includes the latest operations management tools, time and resources for training and goals with incentives to meet them. If you don't provide them with necessary support and recognition, how can you expect them to deliver a high-value product with high availability?

4. Set a high bar for uptime

Even short periods of downtime have a material impact on your bottom line and market perception but once you’re committed to supporting your engineering teams, you’re in a much better position to set a higher bar for uptime. Build, buy or partner to get the technology and skill sets you need.

Unfortunately, many companies still use homegrown operations management systems without redundancy, and still use disparate tools and manual processes to meander through the incident lifecycle. A focus on reducing ops team costs instead of setting the right culture from the start simply doesn’t make sense. The time spent on fixes alone will quickly become a greater cost for your company. Your product and services will suffer as a result.

CEOs who understand the importance of reliability in today’s always-on world don’t wait until there’s an outage to improve operations. They don’t ignore the rich learning that come from resolving incidents. They don’t treat operations and development teams like the “back office.” The CEOs of highly reliable companies invest in their operations infrastructure, processes and people because they care about the growth of their business and the loyalty of their customers.

More From Entrepreneur

Chronicle Live
This Morning's Ben Shephard calls halt to ITV show as 'distressing' news breaks
This Morning host Ben Shephard called a halt to the ITV show as breaking news came into the studio live on air
a day ago
Birmingham Live
Meghan Markle's 'three-word reply' to Prince Harry's 'unexpected interruption' at event
Meghan Markle and Prince Harry made their first public appearance alongside Prince William and Kate Middleton in 2018, but a Royal author has claimed it was marred by an awkward moment
20 hours ago
Daily Record
ITV For the Love of Dogs fans fume over Alison Hammond's 'bad habit' as ratings plummet
For the Love of Dogs viewers were left infuriated as they pointed out a 'bad habit' of presenter Alison Hammond's during the latest episode of the ITV show, which has seen ratings plummet.
17 hours ago
Edinburgh Live
Dad with six months to live gets 'second shot of life' as Edinburgh medics make surprising discovery
Callum Laing, 40, was diagnosed with a stage 4 Glioblastoma brain tumour, after he had been suffering from intense headaches - but surgeons in Edinburgh have made a surprising discovery.
7 hours ago
Cosmo
Sabrina Carpenter looks practically naked in completely see-through lace mini dress
Sabrina Carpenter went braless wearing the Mirror Palais Anemone Dress in butter featuring illusion tulle adorned with lace appliqués along the neckline and hem
2 hours ago
Manchester Evening News
Two men caught having sex in Lidl car park bushes
Police officers had received reports of two men 'exposing themselves'
6 hours ago
Bristol Live
Brits say 'UK is finished' after spotting how much egg and cress M&S sandwich costs
Many shoppers have been left outraged after spotting M&S charging 'ridiculous amount' for a sandwich
2 days ago
OK! Magazine
Prince Harry 'extremely disappointed as he's forced to consider cancelling UK return'
Prince Harry, the Duke of Sussex, was due to travel to the UK in May to celebrate ten years of the Invictus Games, but he's said to be reconsidering due to concerns over security
2 hours ago
Liverpool Echo
Coronation Street fans say new addition 'has no place' in ITV soap
Viewers were quick to react to Corrie scenes on Wednesday
17 hours ago
Daily Record
David Jason and Nicholas Lyndhurst's 'feud' - from on-set row to vow never to work together
David Jason and Nicholas Lyndhurst will forever be remembered for their roles as Del Boy and Rodney Trotter in Only Fools and Horses, but the pair's relationship off-screen has been far from perfect.
2 days ago
HuffPost
'How Embarrassing': Trump Mocked For 'Pretending To Be President' In Strange Ceremony
The former president gave a truly bizarre "White House" gift to a visitor.
5 hours ago
INSIDER
I tried Gordon Ramsay's favorite 10-minute pasta and now I know why he makes it every week
Gordon Ramsay swears by this easy 10-minute pasta dish, which he said has become a "regular midweek family meal" in his house.
15 hours ago
Prevention
Archaeologists Discovered an ‘Unprecedented’ Ancient Monument That Could Rewrite History
It’s practically impossible.
a day ago
Gloucestershire Live
Dr Michael Mosley says a teaspoon of 'nutritional powerhouse' food can reduce blood pressure and improve skin
They contain a triple whammy of alpha linolenic acid, fibre and lignans which help protect against heart disease, high blood pressure and high cholesterol
a day ago
Yahoo News UK
Moment members of public step in to calm horse charging through London
Footage taken by a Brazilian tourist shows several pedestrians stepping to soothe the black horse which caused damage after running through the London streets on Wednesday.
22 hours ago
Daily Record
Giant rat found trapped in Glasgow garden bird feeder as homeowner left horrified
The moment was filmed by a Pollok resident who was left disgusted after spotting the vermin from her kitchen window in the Templeland Road area of Glasgow.
4 hours ago
The Telegraph
Humza has delivered a final blow to Nicola Sturgeon
Whatever the long-term electoral consequences of the ending of the SNP-Greens agreement, Humza Yousaf has at last seized one of the most important elements in politics: the initiative.
3 hours ago
People
Joan Collins, 90, Wears Sheer, Embellished Top and Oversize Bow for London Date Night with Her Husband Percy Gibson
The couple, who wed in 2002, supported their friend Gabriela Peacock’s book launch at the Broadwick Soho
18 hours ago
HuffPost
Trump Throws Absolute Fit In Late Night Rant For The Strangest Possible Reason
The former president delivered a scathing response to a critic who just endorsed him.
6 hours ago
OK! Magazine
Death in Paradise's Neville Parker 'replaced by most-hated colleague' after exit
Death in Paradise fans are continuing to speculate about DI Neville Parker's replacement following actor Ralf Little's exit, with another name being thrown into the mix
5 hours ago

1. Stop waiting for an outage. Create one.

2. Create (and protect) time for learning

3. Treat your ops and dev teams like sales and marketing. They drive revenue.

4. Set a high bar for uptime

Latest stories