In this sample walkthrough, we use a vector_add sample to demonstrate oneAPI concepts and functionality. The sample employs hardware acceleration to add two arrays of integers together. Throughout this walkthrough, you will learn about:
- SYCL headers
- Asynchronous exceptions from kernels
- Device selectors for different accelerators
- Buffers and accessors
- Queues
- parallel_for kernel
Download the vector_add source from GitHub.
SYCL Headers
Intel is currently using SYCL from the Khronos Group*, which includes language extensions developed through an open source community process. The Intel® oneAPI DPC++/C++ Compiler provides the sycl.hpp header file and the fpga_extensions.hpp header file includes FPGA support.
The following code snippet, taken from vector_add, demonstrates the different headers you need to support various accelerators.
Catch Asynchronous Exceptions from SYCL Kernels
SYCL kernels run asynchronously on accelerators in different stack frames. The kernel may have asynchronous errors that cannot be propagated up to the stack. To catch the asynchronous exceptions, the SYCL queue class provides a way for error handler functions.
The following code snippet, from vector_add, shows you how to create an exception handler.
Using a Default Selector for Accelerators
You can select an accelerator for offload kernels in a straightforward manner. SYCL and oneAPI offer selectors that can discover and provide access to the hardware available in your environment.
The default_selector_v enumerates all available accelerators and selects the most performant one. SYCL also provides additional selector classes for the FPGA accelerator, including fpga_selector_v and fpga_emulator_selector_v, which you can find in fpga_extensions.hpp.
The following code snippet, taken from vector_add, demonstrates how to include FPGA selectors.
Data, Buffers, and Accessors
SYCL processes large pieces of data or computation using kernels that run on accelerators. The host declares data, which SYCL runtime then wraps in a buffer and implicitly transfers to the accelerators. Accelerators read or write to the buffer through an accessor. The runtime also determines the kernel dependencies from the used accessors and then dispatches and runs the kernels in the most efficient order. Remember the following:
- a_vector, b_vector, and sum_parallel are array objects from the host.
- a_buf, b_buf, and sum_buf serve as buffer wrappers.
- a and b are read-only accessors, while sum is a write-only accessor.
The following code snippet, taken from vector_add, demonstrates how to use buffers and accessors.
Queue and parallel_for Kernels
A SYCL queue encapsulates all the necessary context and states for kernel execution. By default, when no parameter is passed, a queue is created and associated with an accelerator through a default selector. It can also accept a specific device selector and an asynchronous exception handler, as used in vector_add.
You enqueue kernels to the queue for execution. Kernels come in different types: single task kernel, basic data-parallel kernel, hierarchical parallel kernel, etc. The vector_add uses the basic data-parallel parallel_for kernel, as shown in the following snippets.
The kernel body is an addition of two arrays captured in the Lambda function.
The range of data the kernel can process is specified in the first parameter num_items of h.parallel_for. For example, a 1-D range with a size of num_items. Two read-only data, a_buf and b_buf, are transferred to the accelerator by the runtime. When the kernel is completed, the sum of the data in the sum_buf buffer is copied to the host when the sum_buf goes out of scope.
Summary
Device selectors, buffers, accessors, queues, and kernels are the building blocks of oneAPI programming. SYCL and community extensions are used to simplify data parallel programming. SYCL allows code reuse across hardware targets and enables high productivity and performance across CPU, GPU, and FPGA architectures while permitting accelerator-specific tuning.