Migrating SynxFlow flood modelling from CUDA* to SYCL* with oneAPI

Dr Xilin Xia

Assistant Professor,

University of Birmingham

Summary/Story at a Glance

The SynxFlow hydrodynamic modeling code was re-engineered using oneAPI, Intel’s implementation of SYCL, a cross-vendor heterogeneous programming model. This transformation aimed to address limitations in GPU compatibility and enhance scalability for next-generation Exa-Scale High-Performance Computing (HPC) systems. The new implementation delivers improved performance, interoperability, and scalability across diverse hardware platforms, enabling ground-breaking large-scale probabilistic forecasting with high spatial and temporal resolutions.

The SYCL implementation and Performance tests used components of the Intel® oneAPI Base Toolkit and Dawn, one of the UK’s newest and fastest artificial intelligence (AI) supercomputers.^[1]

Introduction

SynxFlow^[2] is an open-source GPU-based hydrodynamic flood modeling software developed by Dr. Xilin Xia (University of Birmingham) and his colleagues in CUDA*, C++, and Python*. The CUDA part runs the simulations, while the Python code is used for data pre-processing and visualisation. As a model that runs on multiple GPUs, SynxFlow can run flood simulations faster than real-time with hundreds of millions of computational cells and metre-level resolution. As open-source software with a user-friendly Python interface, it can be easily integrated into data science workflows for risk assessments of disastrous circumstances. Therefore, the model has been widely used for research and industry, for example, to support flood early warning systems and generate flood maps for (re)insurance companies.

The SynxFlow software is capable of simulating flooding scenarios and related hazards, including landslide runout and debris flow. Such simulations are vital in advance planning and management of emergency services. A detailed prediction of natural hazards can help mitigate their adverse social and economic impacts. Apart from risk assessment and disaster preparedness, hazard simulation with SynxFlow can also assist in urban planning, environmental protection, climate change adaption, insurance and financial planning, infrastructure design and engineering, public awareness, and education.

In a collaborative project supported by the Natural Environment Research Council between the University of Birmingham and the UK Centre for Ecology and Hydrology, Dr Xia and his colleagues are developing a new generation of probabilistic flood forecasting systems.

In the study, the team aims to develop a system capable of simulating river flow and generating probabilistic high-resolution flood maps. This system will be created by the SynxFlow model coupled with UKCEH’s G2G hydrological model. These probabilistic flood maps are created by deriving likelihoods of flooding from a series of high-resolution flood maps (Figure 1). They provide critical insights into the likelihood and severity of potential flooding events, helping decision-makers and emergency responders better prepare for and mitigate risks.

This poses a significant computational challenge due to the demand for creating an ensemble of high-resolution flood simulations much faster than real-time. To implement such a system, the team turned their eyes to the UK’s latest supercomputer – DAWN^[1], with over 1000 Intel® Data Center GPU Max 1550.

Figure 1: An example of high-resolution flood map created by SynxFlow for a city

Challenge/Objective

The primary challenge was overcoming the hardware limitations imposed by the CUDA programming model used by SynxFlow, which only supports NVIDIA* GPUs. Therefore, the first step in building the new flood forecasting system is to port the original CUDA code to a new language that supports the Intel® Data Center GPU Max Series on DAWN. The team also hoped that the ported code would support GPUs from other vendors or even CPUs. After weighing various options, the SynxFlow development team decided to leverage the Intel oneAPI Base Toolkit - an implementation of the oneAPI specification backed by the Unified Acceleration (UXL) Foundation. It is all based on multiarchitecture, multi-vendor supported SYCL framework. With support for Intel, NVIDIA, and AMD* GPUs, it comes with the Intel® DPC++ Compatibility Tool for easy and automated code migration from CUDA to SYCL. After about 6 weeks of effort, the team has successfully made the SynxFlow model run on DAWN, utilising as many as 64 GPUs simultaneously while still being able to run it on an NVIDIA GPU Supercomputer with similar performance.

Performance Results & Benefits

To test the performance, Dr Xia had initially chosen a case study of the West Midlands region, which includes the city of Birmingham. The total area is 1050 km2, and the simulation resolution is 2 m, which resulted in a total simulation size of 262.5 million cells.

Two different supercomputers were used to test the model:

EPSRC Tier-2 Baskerville Supercomputer: 228 NVIDIA* A100 GPUs

DAWN Supercomputer: 1024 Intel Data Center GPU Max 1550

Three different scaling settings were used: 16 GPUs across 4 nodes, 32 GPUs across 8 nodes, and 64 GPUs across 16 nodes. For a 3-hour long simulation, the running times are shown in Figure 2. As the results show, the performances of the original CUDA code and the new SYCL code are comparable.

The SYCL code is even faster than the original CUDA code on the same NVIDIA-based supercomputer.

As seen in Figure 2, the same SYCL code's performance on different platforms is also comparable.

Figure 2: Running time for a 3-hour simulation on various supercomputers with different numbers of GPUs

The scaling efficiency is also encouraging, as is shown in Figure 3, both the SYCL and CUDA code have achieved satisfactory strong scaling efficiency, with the efficiency of SYCL code higher than CUDA code.

The efficiency of the SYCL code for 64 GPUs is 89%, which can be considered very high. Due to the limits of simulation size and the maximum number of GPUs per job, we could not run the simulation on a smaller or larger number of GPUs. For the CUDA code, the team could only run the simulation with as many as 32 GPUs, the largest number of GPUs that could be used per job.

Figure 3: Strong scaling efficiencies of the CUDA and SYCL code

After the new code has been proven to be performant on both CUDA and SYCL platforms, the team is currently developing the new probabilistic flood forecasting system by coupling the SynxFlow model with UKCEH’s G2G hydrological model^[3]. The effort of developing the new SYCL-based SynxFlow code has also been shortlisted for the prestigious HPCWire Readers’ Choice Award in 2024.

The Path to a Performant SYCL Solution

When porting the code to SYCL, the team has followed the following steps:

Using the Intel DPC++ Compatibility Tool to Translate CUDA Kernels into SYCL:

The original SynxFlow code is organised using CMake. Although the Intel DPC++ Compatibility Tool supports the migration of CMake build environments in its latest version, the decision was made to port these makefiles manually to provide a higher level of control during migration on a Windows* machine. which the compatibility tool does not support. To enable the tool to run, a new project was set up using Microsoft* Visual Studio, which allowed running the porting tool successfully. Most kernels and API calls were automatically translated into SYCL (see Figure 3 for an example). When compiling the code, there were some errors, but they were easy to fix by following the error message from the compiler. In this step, the NCCL library calls were not translated, so only the single-GPU code was successfully compiled.


__global__ void cuCalDepthDurationKernel(Scalar* h_old, Scalar* h, Scalar* t_hGTx, Scalar x, Scalar dt, unsigned int phi_size){ 

    unsigned int index = blockDim.x * blockIdx.x + threadIdx.x; 

    while(index < phi_size){ 

      Scalar t = t_hGTx[index]; 

      if(h_old[index] > x &&  h[index] > x){ 

        t += dt; 

      }else{ 

        t = 0.0; 

      } 

      t_hGTx[index] = t; 

      index += blockDim.x * gridDim.x; 

    } 

  }

(a) Original CUDA kernel


void cuCalDepthDurationKernel(Scalar* h_old, Scalar* h, Scalar* t_hGTx, Scalar x, Scalar dt, unsigned int phi_size, const sycl::nd_item<3> &item_ct1){  

    unsigned int index = item_ct1.get_local_range(2) * item_ct1.get_group(2) + 

                         item_ct1.get_local_id(2); 

    while(index < phi_size){ 

      Scalar t = t_hGTx[index]; 

      if(h_old[index] > x &&  h[index] > x){ 

        t += dt; 

      }else{ 

        t = 0.0; 

      } 

      t_hGTx[index] = t; 

      index += item_ct1.get_local_range(2) * item_ct1.get_group_range(2); 

    } 

  }

(b) Migrated SYCL kernel

Figure 4: Comparison between the original CUDA kernel (a) and translated SYCL kernel (b)

Performance Profiling:

Once the code was successfully compiled, the next step was to evaluate its performance. To make the code run on NVIDIA GPUs, we used the Codeplay’s oneAPI for NVIDIA® GPUs plugin. Initial testing on a small test case revealed a noticeable performance loss. To investigate the issue, performance profiling tools were employed. The profiling results indicated that the function ‘cuEventCreate’ was being invoked after each kernel run, which raised concerns about its impact on overall efficiency. However, further analysis revealed minimal performance impact for larger test cases. Therefore, no further optimization had been done, and later, it was found that the ‘cuEventCreate’ is part of the features of Intel’s SYCL implementation.
Implementing Inter-Node Communication using GPU-enabled MPI:

The final challenge involved enabling inter-node communication, critical for scaling the application across multiple GPUs. Our initial attempt involved using the Intel® oneAPI Collective Communications Library (oneCCL), Intel’s equivalent to the NVIDIA* Collective Communications Library (NCCL). However, the resulting code of this approach was not backward-compatible with NVIDIA-based HPC systems. As an alternative, the team replaced NCCL-based inter-GPU communication with GPU-direct-enabled MPI. On the CUDA platform, this was OpenMPI, and on the Intel platform, this was the Intel® MPI Library implementation.

This adjustment proved effective, allowing the application to scale successfully on both Baskerville and DAWN. During large-scale runs, they observed minor discrepancies in results between CUDA and SYCL platforms. After investigating, they discovered that the Intel compiler’s default settings utilised speed optimised mathematical operations instead of the high precision model we prefer. They resolved the discrepancies by adjusting the compiler settings to prioritise math precision using ‘-fp-model=precise’, ensuring consistent results across both platforms.
Configurations/Solution Ingredients:

For porting code from CUDA to SYCL:
• Windows 10 Enterprise
• Microsoft Visual Studio 2019
• Intel® DPC++ Compatibility Tool as part of oneAPI 2024.0.1

For testing on HPC hardware:

• On NVIDIA machine (Baskerville HPC)): 16 or 32 NVIDIA A100 GPUs, Intel® Xeon® Platinum 8360Y CPUs (2 CPUs per GPU during the job), CUDA Toolkit v11.1, OpenMPI 4.1.1, Intel oneAPI Base Toolkit v2024.0.1, Codeplay’s oneAPI for NVIDIA* GPUs plugin, Red Hat Enterprise Linux 8.10

• On Intel machine (DAWN HPC): 16, 32, or 64 Intel Data Center GPU Max 1550, Intel® Xeon® Platinum 8468 CPUs (2 CPUs per GPU during the job), Intel oneAPI Base Toolkit v2024.2.0 Rocky Linux 8

What’s Next?

Get started with Intel DPC++ Compatibility Tool and its open-source counterpart  SYCLomatic to easily achieve automated, efficient code portability from CUDA to SYCL for accelerated heterogenous computing across hardware from diverse vendors.

We encourage you to check out practical application examples of code migration available in the CUDA to SYCL catalogue. Also explore AI and HPC  tools in Intel’s oneAPI-powered software portfolio.