Introduction
Message passing interface (MPI) is a programing model that can run a multiprocessor program in a distributed computing environment. With the introduction of the Intel® oneAPI DPC++/C++ Compiler, developers can write a single source code that can be run on a wide variety of platforms including CPU, GPU, and FPGA. By combining MPI and SYCL language, developers can take advantage of scaling across diverse platforms while running the application in a distributed computing environment. This article shows developers an example of this combination, how to compile the MPI application with the DPC++ compiler, and how to run it on a Linux* operating system.
Integrating MPI and DPC++
The code sample gives an example of combining MPI code and DPC++ code. The application is basically an MPI program computing the number Pi (π) by dividing the work equally to all the MPI processes (or ranks). The number Pi can be computed by applying its integral representation:
Each MPI rank computes a partial result of the number Pi according to the above formula. At the end of the computation, the MPI primary rank adds all the partial results from other ranks, and then prints the results. The source code is part of the Samples for the Intel oneAPI Toolkit Samples, and can be downloaded from GitHub* (released under MIT License). The source code sample illustrates different approaches, including MPI, to calculate the number Pi. This article focuses on the MPI implementation, the mpi_native function.
In the main function, DPC++ functionality is started when a device queue is created. Host code uses the queue myQueue to submit the device code to a device for execution. This example uses a default selector so that SYCL* runtime would select the best device on the system. The SYCL runtime tracks and initiates the work.
Next, the program runs an MPI initialization MPI_Init. MPI_Comm_size and reports the number of processes created. Each MPI process gets its process number and the name of the processor by calling MPI_Comm_rank and MPI_Get_processor_name, respectively.
To compute the partial result of the number Pi, each MPI rank calls the mpi_native function as shown in the following main function:
The following mpi_native function is written in DPC++. This function takes five arguments:
- results (points to the array of results of each compute unit)
- rank_num (the MPI rank number)
- num_procs (the number of MPI processes)
- total_num_steps (the number of points)
- q (the queue)
Each DPC++ host code, executed by the MPI rank on the host, launches the SYCL application. The host code orchestrates data movement and computes offload to devices.
In the try block, a buffer object is created to connect data between the host and the device. Buffers provide an abstract view of memory. The buffer results_buf is created to store data of results.
The command group is then submitted to the queue q created. The command group takes handler h that contains all requirement for a kernel to execute.
Because the host and device cannot access the buffer directly, the results_accessor accessor is created to allow the device to write to the results_buf buffer.
The parallel_for function expresses the basic kernels that creates the number of instances that are executed in parallel in the device. The function takes two arguments: num_items specifying the number of steps to launch, and the kernel function to be executed in each index. Each instance computes a single value of the result and writes to the results_buf buffer. The SYCL runtime then copies the results to the results array in the host. This completes the computation of a partial of the number Pi for each MPI rank.
Finally, the primary MPI rank sums all partial results from all the MPI ranks in the main() function.
Compile and Run MPI/DPC++ Program in Linux
This section describes how to compile and run the program in Linux. For the purpose of this test, two Intel® systems are used. These systems are powered by Intel® Iris® Pro Graphics 580, an Intel® Core™ i7 processor, and they run Ubuntu* 18.04. Note that Intel Iris Pro Graphics 580 is a version of Intel® Processor Graphics Gen9 that is supported by Intel® oneAPI Toolkits. The host names of these systems are host1 and host2 with IP addresses 10.54.72.150 and 10.23.3.154, respectively.
The following steps explain how to compile and run an MPI program written with DPC++:
- Install Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit, which includes Intel® C++ Compiler and Intel® MPI Library. For the purpose of this article, the Gold version is used for testing. Any version later than the Gold version should work with this code sample.
In this example, Intel oneAPI is installed in both machines in the default path /opt/intel/. - Disable the firewall on the machine where the MPI program started.
- Set up the password-less SSH login on these two machines.
- Set up the oneAPI environment variables.
To generate an executable from the code sample, you need to source the oneAPI script in the host where you run the program. - Compile the MPI program.
After you set up the environment variables, use the mpiicpc script to compile and link MPI programs written in C++. The script provides the options and special libraries needed for MPI programs.
This script uses the Intel® C++ Compiler (mpicc that resides in the oneAPI HPC toolkit). The command mpiicpc is the Intel® MPI Library compiler command for Intel® C++ Compiler. The -show option displays how the underlying Intel® C++ Compiler. It also shows the required compiler flags and options when running the following command: The previous command displays the command lines if you invoke the mpiicpc script for a C++ program. The Intel® C++ Compiler icpc is used to compile and link the C++ program.
On the other hand, Intel® oneAPI DPC++ compiler dpcpp must be used to compile DPC++ programs. DPC++ consists of C++ with SYCL and extensions. To compile and link MPI programs written in DPC++, you need to replace icpc with dpcpp in the previous command. For this particular program, there are other functions that use Intel® oneAPI Threading Building Blocks (oneTBB), and hence, you also need to link with the oneTBB library (-ltbb): Alternately, you can also compile the MPI program with the following commands: The previous commands show how to compile an MPI program written in SYCL called main.cpp. The flag -o specifies the name of executable file dpc_reduce. - Transfer the executable file to the other machine.
As you run the command on one host (host1 in this case), you need to transfer the executable file to the other host (host2 whose IP address is 10.23.3.154): - Run the MPI executable in a two-node cluster using the mpirun command.
Now, you can run the executable on both hosts. The option -n specifies the number of MPI ranks per node. The option -host specifies the host where the MPI ranks are running. The colon “:” separates the two nodes (host1 and host2). The following command runs one MPI rank on the first host host1 and one MPI rank on the second host host2. Each MPI rank calculates a partial result of the number Pi by using Data Parallel C++. The SYCL runtime chooses the best available device to offload the kernel. In this case, the SYCL runtime chooses the GPU available on both systems to execute the kernels in parallel.You can run the executable with other MPI environment variables. For example, to print debugging information, you can set the I_MPI_DEBUG=<level> environment variable:
Summary
DPC++ allows you to reuse code across hardware targets such as CPU, GPU, and FPGA. To take advantage of this feature, MPI programs can incorporate DPC++. This article shows developers how to compile and run MPI/DPC++ programs using the Intel® oneAPI DPC++/C++ Compiler.