The Uptime Institute has stated that with more investment in management, process and training, outage frequency of data centres would almost certainly fall significantly. Sarah Parks, Marketing and Communications Director, CNet Training, considers some of the ways to reduce highly preventable downtime as human error accounts for a large number of outages.
What poses the single biggest risk in a data centre? Unfortunately, the answer is that usually, your people are the single point of failures in a data centre. Human error still accounts for a large proportion of outages and quite simply, not enough is being done to address the issue. Findings from a recent survey from Uptime Institute show that outages are becoming more frequent and more expensive. Now really is the time for the industry to wake up and start paying attention by looking at ways to reduce this highly preventable downtime.
When you have humans involved, mistakes are going to happen, it is inevitable. However, just accepting the risk that mistakes might happen is not really recommended. Processes and putting rigorous professional development plans in place needs to be the norm. Organisations need to learn from their previous outage experiences and help to mitigate against the inevitable human error probability by continually assessing, reviewing and enhancing the skills set of their teams.
We are all aware that outages are hugely costly for organisations; but it’s not just the financial impact, it’s also the detrimental effect it can have on brand reputation, customer confidence and perceived compliance. Investing in staff education/training and personal development can pay significant dividends, educated/officially certified individuals could potentially save their employer millions by doing things right and therefore mitigating the possibility of an outage, or recognising and resolving an issue early and preventing an outage.
Looking at the 2020 Data Center Industry Survey Results report by Uptime Institute, we see what IT and data centre managers around the world are thinking, doing and planning in the areas of efficiency, resiliency, workload placement, staffing and new technology adoption. A total of 78% of organisations have stated that they have had an IT-related outage in the last three years, with 75% saying that their most recent outage could have been prevented with better management, meaning that a large proportion of outages were a result of human error. This figure has increased by 15% since 2019, when the survey asked the same question. This just highlights how the problem is worsening, rather than improving.
There is not one quick resolution to preventing the problem when humans are involved. To some degree, problems will still occur but organisations need to be taking a hard look at the steps that can be taken to ensure teams are the most competent and skilled as can be, so as not to become complacent. Organisations can’t just accept that outages which are preventable are acceptable, especially when the survey also reveals that in 2020, a greater percentage of outages cost more than US$1 million (now nearly one in six rather than one in 10, as in 2019), and a greater percentage cost between US$100,000 and US$1 million. Surely, if organisations get better at spotting the knowledge, competency and skills gaps in their teams and invest to fill these gaps while ensuring the processes and procedures are kept up to date, the outcome could be significantly different. With industry-supported education programmes awarding official certification and qualifications out there, alongside advances in individual and team analytical tools, backed by science and psychological methodology that identifies exactly where knowledge, competency and even confidence levels are lacking, there are so many opportunities for organisations to take important steps to work towards human risk mitigation.
Ultimately, it is industry best practice to regularly test and monitor the life cycle of mission-critical equipment. As an industry, we service our technical equipment to check it is still functioning as expected and plan its future lifespan and renew or restore to prevent against outages. The same thinking needs to be applied and in place for the teams working in data centres.
The individuals responsible for the outages are not individuals looking to sabotage, they are usually experienced members of the technical team that for one reason or another are not following processes or have knowledge, competence or confidence gaps. It’s a fact that if people have been doing the same job for an extended period, their confidence can take over and this can cause individuals to overlook details and specific processes which in turn can cause catastrophic failures – they could be confidently doing things wrong.
One of the big challenges organisations face is that continual professional development budgets are usually limited or cut to boost other areas of the organisation. There is also a common misconception about education/training allocation, as these activities are often used to provide a reward to those people who are most loyal or high-performing, rather than those who actually need it the most. This misconception results in the employees gaining very little from the development activities and therefore provides little or no benefit or ROI to the organisation itself. It’s crazy when you think of the massive risk data centre operators are taking by not investing in their people. It could cost them thousands per minute during an outage and the statistics continue to show that a large portion of these outages caused by human error are avoidable.
The Uptime Institute survey states that with more investment in management, process and training, the outage frequency would almost certainly fall significantly. Hopefully this will raise alarm bells to the rest of the industry to turn their attention to these areas. The pandemic has highlighted the critical importance of the digital infrastructure industry and demand is only going to increase. Alongside an increasing skills shortage and an ageing workforce, this is a stark warning that if organisations don’t properly develop, train and invest in teams throughout the entire workforce, outages are likely be become bigger and more expensive (as the current figures suggest).
With an ageing workforce, many experienced industry professionals will soon be looking to retire. With decades of industry and on-the-job experience, we must question whether those team members that will be taking their place are sufficiently trained, experienced and ready to handle any future issues that might arise. Organisations need to address the problem head-on instead of waiting for things to start going wrong.