Speed Up Multiarchitecture Device Offload with SYCL* Default Context

Now also with the Intel® oneAPI DPC++/C++ Compiler for Windows*

Get the Latest on All Things CODE

author-image

作者

July 11, 2024

Reusable SYCL Default Context

Some time ago, the default context concept in the SYCL standard was extended to allow the context of a SYCL device queue to be preserved between accelerator offloads, with awareness of the full individual multiarchitecture platform with all its compute devices. The term default context thus now refers to device queue constructor properties' availability throughout a program execution flow and scope. 

This makes the default context reusable. If no explicit context argument is given during queue creation, any queue constructor will simply reuse the existing default context. As a result, applications that frequently create and destroy device queues for offload run faster because they don’t have to pay the cost of having a new context instantiated for them each time.

If you create a queue without a context argument,

queue q;

queue q2(someDevice);

you can simply get and use its context as follows

context ctx = q.get_context();

With the Intel® oneAPI DPC++/C++ Compiler version 2024.2, this behavior is now the standard behavior of the compiler not only on Linux* build host but also on Windows*.

Note: To use this feature on Windows or Linux, setting the environment variable SYCL_ENABLE_DEFAULT_CONTEXT = 1
is no longer required.

With this evolution of the default SYCL context definition, the developer can programmatically capture the available devices and device queues as part of the execution flow, helping to automate the offload dispatch from already cached context configurations when initializing a SYCL queue.

This feature allows direct use of a cached SYCL runtime configuration when initializing a SYCL queue, without the need to create a new SYCL context. The context will also retain cached kernels associated with the initialized SYCL queues under the same platform.

The SYCL default context feature is enabled for both Linux and Windows in the Intel® oneAPI DPC++/C++ Compiler 2024.2 release.

Faster Execution with SYCL Default Context

In addition to simplifying coding and queue management for the SYCL software developer, default context use can also directly impact application performance. The performance benefit of this default context or persistent device queue instantiation can be substantial.

One such example are SYCL offload code sequences that involve loops. The SYCL context associated with the queue (q_) is typically re-created in each iteration due to the limited lifetime of q_ defined in C++. This results in increased latency, including repeated Just-In-Time (JIT) compilation of the HashTableKernel.

With the SYCL default context feature, each initialization of the SYCL queue will be able to directly use the cached SYCL context and associate with the previously cached kernel and program, leading to better performance.

Let us, for instance, look at the following piece of C++ code:

for (int i = 0; i < 5; i++) {
	sycl::queue q_;
   	auto start = std::chrono::steady_clock::now();
	q_.submit ([&](handler& cgh) {
	cgh.parallel_for<class HashTableKernel>(range<1>(p.size()), [=](id<1> idx) {
…});
});
auto end = std::chrono::steady_clock::now();
	auto tt = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

	std::cout << "Loop = " << i + 1 << "\n";
	std::cout << " build hash table time=" << tt.count() / 1000 << "ms" << std::endl;
}

In this example, without SYCL Default Context usage, the time it takes to build and rebuild the hash table time remains relatively high (around 120-140ms) for subsequent loops after the first iteration.

However, with SYCL Default Context, the build hash table time drops significantly (to around 1-3ms) for subsequent loops after the first iteration, indicating significantly improved performance for this particular loop.

Without enabling SYCL Default Context With SYCL Default Context enabled
Test with default q_:
Loop=1
 Build hash table time=141ms
Loop=2
 Build hash table time=131ms
Loop=3
 Build hash table time=127ms
Loop=4
 Build hash table time=122ms
Loop=5
 Build hash table time=127ms
Test with default q_:
Loop=1
 Build hash table time=151ms
Loop=2
 Build hash table time=3ms
Loop=3
 Build hash table time=2ms
Loop=4
 Build hash table time=1ms
Loop=5
 Build hash table time=1ms

This kind of performance impact can easily compound across larger workloads, resulting in impactful and very noticeable overall application performance gains.

Use SYCL Default Context in your Next Windows-Based Project

If you are new to SYCL or you are about to update to write a new multiarchitecture accelerated compute program on Windows, make sure to do so with the latest fully SYCL 2020-conformant Intel® oneAPI DPC++/C++ Compiler for Windows 2024.2.    

Join the SYCL open developer ecosystem and the movement towards flexible optimized standards-based multiarchitecture offload acceleration with the Unified Acceleration Foundation (UXL). Intel’s compiler innovations directly contribute to the compiler projects and language standards discussion in these open source projects.

Download the Compiler Now 

You can download the Intel oneAPI DPC++/C++ Compiler on Intel’s oneAPI Developer Tools product page

This version is also in the Intel® oneAPI Base Toolkit, which includes an advanced set of foundational tools, libraries, analysis, debug and code migration tools.

You may also want to check out our contributions to the LLVM compiler project on GitHub.

Additional Resources