What are the most effective strategies for ensuring uninterrupted mission-critical operations in a data centre, particularly in the face of unexpected disruptions?

Unprecedented demand on data centres to reduce downtime has prompted contexts that require unique solutions. Although the digital infrastructure sector has advanced over the last few years, global events like climate crises, invasions and COVID-19 have accumulated uncertainty among supply chains with increased costs, as well as pressures to perform sustainably.

Operators are facing a continued state of flux which widens opportunities for growth, or if managed insufficiently could lead to fragile performance and interrupted operations.

2023 was recorded as the warmest calendar year in global temperature data records going back to 1850, according to The Copernicus Climate Change Service (C3S). Billy Durie, Global Sector Head Data Centres, Aggreko, mentioned in Uptime on the Line that heatwaves are causing facilities to power down their servers to avoid irreparable damage, and countries such as France and the Netherlands are seeing rivers and reservoirs run dry, complicating temperature control where water storage is used.

The resilience of data centres is being challenged from these threats, and while operators may use these to fuel the progress of operations – such as more sustainable practices – the prospect of continual uptime is fading from those ignoring the necessity of reliable power supplies, Disaster Recovery efforts and centralised monitoring systems.

Data centre customers will gradually become more conscious of the management models that operators use and if they can cope with the hand being dealt regarding unprecedented events. This will inevitably apply more pressure to facilities and the talent inside them.

A recent trends report released by Uptime Institute discusses how Artificial Intelligence (AI) will help make operational decisions, relating to resiliency and energy efficiency. While this prediction for 2024 highlights the tools available to assist in critical system management, it is yet to be seen whether the infrastructure can balance out the power requirements of new technology along with its own set of on-premise challenges.

While extreme weather conditions are expected to continue, the instances of when and how it will affect each facility will remain unpredictable and erratic. The dependency on data for Business Continuity is growing and data centre organisations will need to prove their reliability. It’s important to consider a dynamic and humble approach to adapt to external and autonomous environments.

Offering their views on how to tackle service disruptions and preventative measures for ensuring uptime, are industry experts: Shane Kilfoil, President Mission Critical Environments, Subzero Engineering; Alexandre Silvestre, Data Centre Fabric Business Development Manager, Nokia; Matthew Farnell, Global Director, EkkoSense; and Nick Ewing, Managing Director, EfficiencyIT…

Shane Kilfoil, President Mission Critical Environments, Subzero Engineering

Today’s data centres are the backbone of numerous industries, supporting critical applications and services. To ensure uninterrupted operations, particularly in the face of unexpected disruptions, organisations must adopt strategic measures focusing on monitoring, redundancy and failover capabilities, in addition to implementing Disaster Recovery plans and operational continuity procedures.

Of course, it’s relatively easy to design an efficient and reliable data centre if you’re building one from scratch. With potential future uses, scalable capacities, data centre densities and specific cooling needs considered when in the design phase, redundancy measures can be implemented from the outset with the knowledge that power surges and cooling requirements have been allowed for.

Separation and zoning of mission-critical and high-density servers that require different cooling technologies, and backup generators and monitoring infrastructure can also be built into the data centre space at the construction stage. But while these strategies can also be retrofitted into an existing data centre setting, additional considerations may be needed to maximise the utilization of existing infrastructure and resources.

Being able to flexibly expand and meet spikes in demand without compromising performance or reliability means monitoring systems are invaluable. These can proactively identify and address potential issues before unplanned incidences occur. Having monitoring software that can produce compliance-ready ESG reporting to track sustainability efforts and make data-based decisions, helps to prioritise energy efficiency measures and optimise resource utilisation. Monitoring also aids in identifying right-size computing requirements and working out when to retire older or more inefficient hardware to optimise energy consumption.

It’s all very well having systems and components such as backup generators in place to prevent single points of failure, but the data centre also needs to conduct regular drills and tests to prepare for unexpected disruptions. Maintaining a cross-trained workforce that can quickly respond and resolve issues across different systems is vital to ensure unforeseen outages can be assessed and resolved as swiftly as possible.

A robust supply chain should also be developed to ensure reliable access to necessary materials and parts. As we have seen in the recent past, the pandemic rapidly identified vulnerable supply chains, causing untold misery across many industries and markets. It’s important to harden the supply chain to ensure reliable access to necessary materials and parts to withstand unplanned incidents such as natural disasters, cyberattacks or unexpected interruptions of operation.

By implementing these strategies, organisations can ensure sustained operations in their data centres, in addition to optimising resource utilisation, enhancing sustainability and building resilient infrastructures capable of adapting to future challenges in the dynamic digital landscape.

Alexandre Silvestre, Data Centre Fabric Business Development Manager, Nokia

Data centre traffic is evolving and growing quickly as enterprises embrace innovation through Digital Transformation and distributed cloud services. Enterprises want to boost data centre performance, efficiency and scalability to meet the demands of cloud, 5G, AI, IoT and Industry 4.0 application workloads. But many fear that these changes could put their mission-critical operations at risk.

We believe that modern, open and automated data centre networking approaches offer the best way for enterprises to keep up with new demands and keep their mission-critical operations up and running. By implementing these approaches, enterprises can build next-generation data centre fabrics that scale easily and flexibly while providing much greater efficiency and resiliency. They will give enterprises freedom to innovate with confidence.

The cornerstone of a next-generation data centre fabric is a truly open Linux-based network operating system (NOS). To efficiently handle new application workloads and preserve mission-critical operations, the NOS must take data centre agility and flexibility to new heights through features such as:

A programmable cloud-native design
Robust and field-proven IP stack with recognized stability, interoperability and security
Open and scalable telemetry interface
Model-driven management to support complete openness and a fully customisable and programmable command line interface
An unmodified Linux kernel that can be leveraged by new, custom-designed applications
Microservices that support hitless upgrades and resilient networking
Standards-based architecture for network redundancy and dual-homing with vendor interoperability

An advanced network automation toolkit can help enterprises take full advantage of these capabilities to support innovation without risking critical operations. Tools that use open frameworks to enable programmability and intent-based automation can make it easy to meet new demands, minimise human errors, respond to disruption and increase efficiency at every phase of the data centre fabric lifecycle.

A solution that provides a Digital Twin of the data centre fabric reduces risk by allowing network teams to safely test potentially disruptive changes before applying them to the live network.

Enterprises also need high-performance hardware that can seamlessly handle the workloads created by 5G, Industry 4.0 and AI applications. An ideal solution will provide platforms that enable network teams to implement modern, massively scalable and highly reliable data centre switching architectures. Platforms that use the same hardware and software design and come in different form factors and configurations will provide maximum flexibility to support leaf/spine, spine and super-spine applications. They should offer high capacity, a variety of port speeds and robust switching, routing, QoS, model-driven management, telemetry and security capabilities.

With modern data centre fabrics built on a fully open NOS, advanced automation tools and high-performance hardware, enterprise network teams will be well equipped to ensure uninterrupted mission-critical operations as they work to unlock the benefits of Digital Transformation and distributed clouds.

Matthew Farnell, Global Director, EkkoSense

According to research from Uptime, data centre downtime costs – and their impact – continue to be problematic for many operators, with reported US$100,000+ incidents increasing 39% since 2019 and those over US$1,000,000 by 15%. Power problems and human error are cited as among the main downtime causes.

Implementing a strategy to mitigate downtime must encompass all aspects of data centre operations: systems, people and processes. These range from ensuring redundancy in the design of failover systems and UPS power backup systems; effective climate and environmental monitoring; data replication; Disaster Recovery; monitoring and alerting systems; planned maintenance scheduling and employee training; and documentation and process control.

With respect to climate and environmental, and monitoring and alerting systems, many data centre operators continue to rely on Building Management Systems (BMS). While the BMS is an essential and overarching system management platform for day-to-day data centre operations, event-based alerting tends to happen after the event and leads operations teams to be reactive rather than pro-active.

As the industry grapples to come to terms with the impact of hosting high-density AI systems, it will be interesting to see how capacity planning strategies must adapt with the impact of high-density 60kW AI systems and the subsequent heat generation in data halls that were originally designed to host the traditional 3-5kW per rack. Liquid cooling technologies will need to play a part, and hybrid cooling strategies will become commonplace. The other dynamic that affects human error is the shortage of skilled data centre staff across the industry, and the pressures that places on today’s operations teams and management.

Ensuring uninterrupted mission-critical operations remains the highest priority for data centre teams, and throughout 2024 we’re busy developing solutions that will help operators to maintain uptime. Applying AI and Machine Learning proves effective in analysing the very large datasets produced by M&E systems – providing real-time visibility into what’s really happening in the data hall. A key factor here is the use of gaming technology to provide a 3D user experience, making it much easier for operations teams to visualise what’s going on without the need for intensive training.

Another key innovation is the cooling anomaly advisor that uses data analytics to highlight cooling trends. If cooling unit performance trends up or down, we notify the operator but also highlight the underlying causes. This helps operations teams to be much more proactive and get ahead of potential issues.

Nick Ewing, Managing Director, EfficiencyIT

Nick Ewing, Managing Director, EfficiencyIT

Today there are a host of strategies that owners and operators can undertake to ensure mission-critical reliability, especially in the face of unexpected disruption. But just as with all data centre and distributed IT deployments, there’s no one size fits all – and at EfficiencyIT, we believe in taking a consultative approach to critical applications, addressing each customers requirement on a case-by-case basis.

When seeking to avoid downtime or improve the resiliency of your infrastructure systems, however, there are three key areas we believe are vital for organisations to address and synonymous with most, if not all data centre applications.

The first is having complete visibility of your critical systems and infrastructure assets – where they’re located, what their health or operating status is currently and how they’ve performed not only during the last 24 hours but over the last three, six or 12 months.

Here, the three R’s – reliability, redundancy and resiliency – are vital, and that extends from the operators’ critical power systems all the way through to their cooling and generator equipment. A failure in just one of these places will often trigger a series of unanticipated events and if left undetected or without remedy, can have a catastrophic effect – loss of service, business-critical data and even revenue.

Often the devil is in the data, so leveraging a Data Centre Infrastructure Management software platform (DCIM) that offers the ability to aggregate all your systems and data in to one platform, and thereby utilise AI to process and generate real-time, actionable information, can be the very difference between failure and success.

To that effect, the second key area is ensuring you have a regular and robust condition-based maintenance programme in place, and that you’re working with an expert engineering team to address potential issues proactively, before they have major implications. Through new DCIM platforms, customers can share insights with their engineering and services teams securely, allowing them to address said issues – a battery or cooling failure, for example – before they cause an outage.

Thirdly, designing your data centre for resiliency in an N+1 configuration and to recognised standards, such as BSN5600, will allow you to ensure greater redundancy in all your equipment – Uninterruptible Power Supplies (UPS), Power Distribution Units (PDUs), generators, switchgear and cooling. Doing so will provide an essential safeguard for all your critical systems, enabling you to future-proof and minimise the impact of downtime.

Further, the additional redundancy will also allow you to plan outage scenarios in advance, and both test and turn-off equipment in a controlled manner, ensuring everything works as expected in the face of unexpected disruption. Ultimately, when seeking to avoid an outage, prevention is far better than cure and with Uptime Institute stating that human error plays a role in about two-thirds of all outages, it’s better to be safe than sorry.

Click below to share this article

What are the most effective strategies for ensuring uninterrupted mission-critical operations in a data centre, particularly in the face of unexpected disruptions?

Shane Kilfoil, President Mission Critical Environments, Subzero Engineering

Alexandre Silvestre, Data Centre Fabric Business Development Manager, Nokia

Matthew Farnell, Global Director, EkkoSense

Nick Ewing, Managing Director, EfficiencyIT

Intelligent Technologies

Regional News

Analysis

Content Hubs

Other Websites