As heterogeneous platforms become increasingly common in the technology and software industry, they present a new challenge of developing software in a way that is both vendor- and platform-agnostic while still reaping performance benefits. To this end, the Unified Acceleration Foundation (UXL Foundation*), under the umbrella of The Linux Foundation*, is driving an open-standard software ecosystem for programming accelerators that includes compilers and performance libraries. This article discusses extending the UXL Foundation software ecosystem to Python*, bringing portability across configurations of heterogeneous platforms and vendor independence, while allowing users to compute on accelerators from different vendors in the same Python session. Additionally, the Python ecosystem from the UXL Foundation facilitates the creation of new portable data-parallel built-in Python extensions using the Intel® oneAPI DPC++ Compiler with standard Python toolings, such as scikit-build or Meson.
Data-parallel extensions for Python center around one such portable extension, dpctl.tensor. The extension implements an array object, dpctl.tensor.usm_ndarray, based on Unified Shared Memory (USM) allocation, and a library of functions to manipulate array objects. In keeping with the goal of portability, the dpctl.tensor library was designed to conform to the Python array API standard (revision 2023.12), a standard for tensor frameworks that is being actively adopted by NumFOCUS* sponsored projects, including NumPy* (as of version 2.0), CuPy*, Dask*, and other community projects like JAX, PyTorch*, and TensorFlow*. Python packages, which traditionally relied on NumPy to provide an array object and the library to manipulate it, such as scikit-learn*, SciPy, and more, continue to expand their support for array objects from array API-conforming libraries, which opens the door for dpctl.tensor.usm_ndarray to work out of the box with these packages.
The data-parallel control Python package (dpctl), in addition to providing a SYCL*-based reference implementation of an array API library, also provides Python bindings to DPC++ runtime entities to facilitate platform enumeration, device selection, USM allocation, and execution placement from Python.
dpctl provides integration with popular Python extension generators, such as Cython and pybind11, permitting data-parallel built-in extensions to work with the Python object types it provides and map them to underlying C++ classes. That is, with pybind11, dpctl.SyclQueue is bidirectionally mapped to sycl::queue and dpctl.tensor.usm_ndarray is mapped to dpctl::tensor::usm_ndarray C++ class implemented by dpctl.
The dpctl.tensor package is implemented using pure SYCL and is built using the Intel oneAPI DPC++ Compiler. The package supports builds for multiple SYCL targets (using oneAPI for CUDA* and oneAPI for AMD* solutions from Codeplay*), enabling users to target offload to accelerators from different vendors in the same Python environment.
Intel GPUs and CPUs require SPIR-V* offload sections, NVIDIA* GPUs require NVPTX64 offload sections, and AMD CPUs require AMDGCN. By default, DPC++ only generates SPIR-V sections. See our presentation at SciPy 2024 for more details. Provided the project was compiled to generate offload sections appropriate for the driver stack servicing the accelerator of interest, the package enables Python users to engage with any device that DPC++ can engage with.
To build dpctl to target CUDA devices, follow the instructions provided in dpctl’s documentation.
Any device recognized by DPC++ runtime for Intel oneAPI is also recognized by dpctl:
By default, dpctl.tensor targets the device selected by SYCL’s default selector. Selecting a particular device can be done using the filter selector in the SYCL extension for Intel oneAPI. The filter selector string is a triple backend:device_type:ordinal_id where any element of the triple may be omitted provided at least one is specified. The following are examples of creating arrays populated by values of arithmetic sequence on different devices.
Upon creation, an array is implicitly assigned a SYCL queue from a device-queue cache. This queue is used to submit tasks that manipulate array elements, such as item assignment (x[:] = 0). Users may also provide such a queue explicitly through a sycl_queue keyword argument, supported in every array creation function.
Compute-follows-data is a paradigm adopted by dpctl.tensor that allows users to only specify the device placement in array creation functions. The output of all other functions is allocated on the same device and is associated with the same queue as the input arrays, and all input arrays are expected to be associated with the same queue.
If the need arises to combine arrays residing on different devices, the user is expected to perform data migration explicitly using the to_device(target_dev) method of usm_ndarray or the dpctl.tensor.asarray constructor.
Attempting to compute (y_cpu * y_cuda) results in an ExecutionPlacementError. The user resolves it by explicitly migrating data:
This ambiguity may only arise if nondefault device placement is used when creating any of the input arrays. By default, all arrays are created on the same default-selected device and are associated with the same queue:
Since dpctl.tensor conforms to the Python array API specification, community packages supporting the array API can work with dpctl.tensor.usm_ndarray objects. To illustrate, we compute a fast Fourier transform (FFT) on usm_ndarray using a SciPy FFT package, which contains experimental support for Array API:
dpctl.tensor can accelerate computations on integrated and discrete GPUs. Here is an example of k-nearest neighbors (k-NN) search benchmark code run on an Intel® Arc™ GPU on Windows* and on an Intel® Data Center GPU Max 1100 discrete GPU on Linux*, showing 2.5x and 31x speed-ups respectively over vanilla NumPy installed from conda-forge.