Durham University, one of the UK’s leading HPC research facilities, has chosen 232 nodes of its COSMA7 DiRAC system for a test deployment of a new Rockport switchless network architecture which, based on codes, is seeing a performance improvement over InfiniBand. The Arepo cosmology code saw up to 28% performance improvement over a comparable EDR InfiniBand network when running on a small test cluster. By conquering congestion – a workload killer – the university hopes to speed up research results with more predictability, better resource utilisation and economic efficiency. Alastair Basden, DiRAC/Durham University, Technical Manager of COSMA HPC Cluster, tells us more and explains the necessary priorities in the design and build of a data centre and how this contributes to uptime.
Durham University’s Institute for Computational Cosmology (ICC) has selected the Rockport Switchless Network as part of the ExCALIBUR programme around new networking technologies for a test deployment on its COSMA7 supercomputer, part of the DiRAC HPC facility. For the first time, the COSMA7 cluster and its 232 Rockport nodes will provide Durham University, DiRAC and the ExCALIBUR programme with insight into the benefit of potentially reduced congestion as they model exascale workloads and use codes to run on future exascale systems.
Durham and ExCALIBUR want to explore whether Rockport’s switchless architecture can provide speed enhancements for performance-intensive workloads when deployed at scale, as well as assessing other potential benefits relative to switch-based networks such as enabling elastic scaling and simplifying operations. Rockport’s fully distributed network fabric has the potential to deliver better performance, resource utilisation and network economics by using smart congestion control without the need for external switches.
The ICC’s massively parallel performance – and data-intensive research into dark matter and energy, black holes, planet formation and collisions – requires tremendous computational power where traditionally the interconnect can be a limiting factor, as well as memory bandwidth and latency. The Durham system is the DiRAC memory-intensive service which uses large RAM nodes to support workloads; the challenge is ensuring that the network can keep up. Interconnect-related performance and congestion challenges experienced when processing advanced computing workloads result in unpredictable completion times and under-utilisation of expensive compute and storage resources.
After an initial Rockport proof-of-concept deployment in the Durham Intelligent NIC Environment (DINE) supercomputer, the university quickly saw Rockport’s technology as a potential way to address congestion.
“Based on the results and our first experience with Rockport’s switchless architecture, we were confident that larger scale investigation was warranted, as part of our mission to improve our exascale modelling performance – all supported by the right economics,” said Dr Alastair Basden, DiRAC/Durham University, Technical Manager of COSMA HPC Cluster.
COSMA7 is helping scientists analyse space’s biggest mysteries including dark energy, black holes and the origins of the universe. By deploying the Rockport Switchless Network in COSMA7, ICC researchers and their collaborators around the world can gain first-hand experience of Rockport’s architecture and see for themselves whether it can benefit their work by reducing research delays and thus help them to create more complex simulations of the universe. Rockport’s monitoring tools also deliver deep insights into how codes are performing which can be used to further improve code performance.
The Rockport Switchless Network distributes the network switching function to its COSMA7 endpoint devices (nodes) which become the network. By eliminating layers of switches, the Rockport network ensures compute and storage resources are not left starved for data and researchers have more predictability regarding workload completion time.
Rethinking network switches creates an opportunity to leverage direct interconnect topologies that provide a connectivity mesh in which every network endpoint can efficiently forward traffic to every other endpoint. The Rockport Switchless Network is a distributed, highly reliable, high-performance interconnect providing pre-wired supercomputer topologies through a standard plug-and-play Ethernet interface.
Alastair Basden, DiRAC/Durham University, Technical Manager of COSMA HPC Cluster, discusses the project in further detail and outlines the benefits Rockport has provided.
What does it mean to be the technical manager of COSMA HPC Cluster at Durham University – what does this require?
I manage a HPC system here. It’s part of the DiRAC Tier I national facility, operated by one of the UK research councils. DiRAC has four different HPC deployments around the country and the one that we have here in Durham is called COSMA. What it means is that I basically keep the system running on a day-to-day basis with the help of my team, and we do various repairs, etc. that might be required. We answer user queries, but we also plan for future extensions and consider where we need to be taking the system in the future to be able to meet the future needs of our researchers.
Can you tell us about some of the challenges the university was looking to overcome ahead of its work with Rockport?
A HPC system is comprised of three main elements – there’s the compute side of things which is just lots of processes; there’s the storage side of things which is where the data is saved: we do large simulations here – cosmology simulations – and we have to save all our data to storage; and then there’s the network fabric which links the nodes together and couples the storage as well. One of the problems we find is that the network isn’t as fast as we’d like. Ideally, we would have an infinitely fast network but we’re never going to get that. One of the problems we have is if there’s congestion on the network, things can slow down. So, if there are other jobs or simulations running that use lots of network bandwidth, then it can affect our jobs. So we were trying to look at ways of reducing the impact of congestion and the Rockport network looked like it was a good way of achieving that.
How do data centres play a part in the network operations of Durham University?
The university itself has several data centres – COSMA is hosted in one of those. COSMA is a self-contained unit used by the university researchers but also researchers from all over the UK, in fact all over the world.
What would you prioritise in the design and build of a data centre – how does this contribute to uptime?
When we’re planning and building a new data centre, the two key requirements are the input power – there has to be enough power coming in from the grid; but also how to get rid of the excess heat – so the cooling facilities. So that’s two key things you have to think about when designing a data centre, as well as physical floor space.
Once you’ve got your facility, your building and the infrastructure in place, you have to think about how you design the kit inside it – so, the type of compute, type of network and type of storage.
We are in a semi-fortunate position that we don’t need 100% uptime (we’re not a mission-critical service). Our researchers know that sometimes the system will go down and they can plan their research around these regular maintenance periods, three times a year. This means we can get away with less redundancy than a system which cloud providers might run that require 100% uptime. This in turn means we have cost savings we can invest into more compute nodes.
How do you maintain and operate the data centre once up and running – where does networking come into this?
There’s a team of us that keep the system running. We do preventative maintenance, we do active maintenance – if something has failed we go in and replace parts. We are also always on the lookout for security issues so we also do a lot of software patching. And networking generally is one of those things that, until something goes wrong, we can leave it be. We regularly look at the status of the network and if there are problems such as links that have died etc. – we can go in and replace components, whether that be network cards or cables etc.
Do you have colocated data centres throughout the campus and if so, what benefits does this bring?
Around the university there are a small number of data centres and each has its own remit, its own job to do. There are two main data centres for doing high performance computing – we have bits of kit in both of those, but we’re fairly well isolated within a single data centre.
How do you expect your work with Rockport to evolve moving forward?
We’ve just installed the new Rockport fabric, which we finalised recently. We’re now starting to run codes on it and look at performance, how well it scales etc. Moving forward, we’ll be looking at the cost competitiveness of the Rockport fabric – it is certainly in the running for being installed in these new systems.Click below to share this article