Overview
This article demonstrates getting started with OpenCL™ Tools developer components in Intel® System Studio 2019 initial release and Update 1 release on Linux* OS. This article also applies to the Intel® SDK for OpenCL Applications 2019 as a standalone . The walkthrough assumes developer targets of both Intel® CPU and Intel® Graphics Technology. Training sample source code is attached for download via upper right orange button and at the article footer.
The article walks through:
- Part 1:
- Prerequisites and Installation
- Build
- Program execution
- Part 2:
- IDE integration
- Disk Provisions
- Developer Tips
- Part 3:
- Explanation of training sample applications and execution
- Offline Compilation
- More resources
The specific platform used in this example is Ubuntu* OS 16.04.4 (stock Linux* OS kernel 4.13) on an Intel® Core™ i7-6770HQ processor (system model NUC6i7KYK). The Intel® Core™ i7-6770HQ processor has Intel® Iris™ Pro Graphics 580. More information on this platform and other platforms can be viewed at ark.intel.com. The article walkthrough is *not* exclusive to this specific hardware model.
New developers are highly encouraged to develop on a system with both intended OpenCL™ implementations and hardware available. However, in the general case, target hardware is not absolutely required for building OpenCL™ applications. Intel® CPU and Intel® Graphics Technology hardware is required to use all the developer tool features and it is required to support compiling device kernels.
Part 1
Prerequisites & Installation
- Review the Intel® System Studio 2019 release notes.
- Review the release notes for the OpenCL™ developer tools. Observe supported platforms.
- Download and install OpenCL™ runtimes:
- *2019 Initial Release* Download the Intel® CPU Runtime for OpenCL™ Applications
- *2019 Update 1* users have Intel® CPU Runtime for OpenCL™ Applications installed automatically with the SDK.
- Review the appropriate release notes. Observe supported platforms.
- This CPU runtime does not require any Intel® Graphics Technology hardware. It serves as a production OpenCL™ implementation useful for:
- Backing production applications using OpenCL™.
- Reference development for targeting other types of devices, such as Intel® Graphics Technology or Intel® FPGA.
- Example Prerequisites Setup:
- Example Runtime install:
- The CPU Runtime sets up symbolic links to the ICD loader library libOpenCL.so. Advanced users: if the development system prefers a different libOpenCL.so, you may wish to ensure the alternate libOpenCL.so is first in the search order for your dynamic linker.
- Download the Intel® Graphics Runtime for OpenCL™ Driver "NEO"
- This runtime enables OpenCL™ kernels to target Intel® Iris™ Pro Graphics 580 on the Intel® Core™ i7-6770HQ processor in this example, or supported Intel® Graphics Technology available on other Intel® platforms.
- Review the release notes on the GitHub portal. The README, FAQ, and LIMITATIONS documents are particularly useful. Observe supported platforms.
- If necessary see the OpenCL™ runtimes overview article
- Install the appropriate runtime package(s). This graphics runtime is available as a prebuilt install package(s) for the distribution used in this example from the GitHub portal releases page.
- Install OpenCL™ Tools prerequisites:
- Make sure to see the install notes section of the release notes for the OpenCL™ tools. That section is the official source for validated package prerequisites.
- The Eclipse* IDE plugin has a Java Runtime Environment requirement. Java 8 (1.8) Runtime Environment or higher is expected to satisfy the plugin prerequisite. Intel® System Studio 2019: OpenCL™ Tools and the Eclipse* IDE should automatically deploy the Java Runtime Environment. In other cases, when manual install is required here is an example installation:
- libicu55 prerequisite install example:
- For mono use the latest guidance from https://www.mono-project.com. Here is an example used on the test system:
- dkms dependency example install:
- libwebkitgtk IDE rendering dependency:
- Add the user to the video group so user programs are privileged for the Intel® Graphics Runtime for OpenCL™ Driver
- *2019 Initial Release* Download the Intel® CPU Runtime for OpenCL™ Applications
Install Intel® System Studio 2019 (or Intel® SDK for OpenCL™ Applications 2019 standalone)
- Click the Register & Download button from the Intel® System Studio 2019 portal downloads page.
- Select 'Linux*' for host OS. Select 'Linux* and Android*' for target OS.
- Click the Add button for the OpenCL™ Tools line item.
- The Eclipse* IDE is automatically added to the installer manifest when OpenCL™ Tools is added.
- Optional: Add Intel® C++ Compiler and Intel® Vtune™ Amplifier. These products have useful features for heterogeneous and general development. Intel® Threading Building Blocks is also added as a prerequisite. These other components are not required in this article.
- Click 'continue'
- Click 'download'
- Note any extraction instructions. Configured downloads may come with a .json file to be referenced by the installer.
- Extract the installation package. Execute the installer with sudo ./install.sh as sudoer user:
- Set proxy as necessary
- Install to 'this computer'
- Review and accept license terms
- Choose your preferred software improvement collection option. Consent greatly helps Intel® improve this product.
- (Optional) Select deployment folder
- (Optional) This package has been configured with the Intel® C++ Compiler and Intel® Vtune™ Amplifier in addition to the OpenCL™ Tools (Intel® SDK for OpenCL™ Applications). The additional tools are very useful for OpenCL™ developers, but strictly speaking are not required for OpenCL™ Tools features to be used.
- Errata: If you install Intel® System Studio 2019 initial release, ensure OpenCL™ 2.1 Experimental runtime for Intel® CPU Device component is selected. This tool is required for Eclipse* IDE plugin functionality. Starting with Update 1 and later, future releases deprecate and remove the Experimental runtime. The Experimental runtime is replaced by Intel® CPU Runtime for OpenCL™ Applications 18.1 and newer.
- The Windriver* and Android* screens are unrelated to this article walkthrough.
- After the components are installed, the 'Launch Intel(R) System Studio checkbox' can be deselected as the next section of this walkthrough demonstrates building from the command line.
Build
Two example source sets are in the .tar.gz archive attached to this article. They are entitled GPUOpenCLProjectForLinux and CPUOpenCLProjectForLinux. These sources match the two implementations installed earlier.
These example build commands demonstrate the two main build requirements: inclusion of OpenCL™ headers, and linking the libOpenCL.so (ICD loader library) runtime. The first program executes it's kernel on Intel® Graphics Technology. The second second program executes it's kernel on Intel® CPU and not for Intel® Graphics Technology. Here are command line examples to build the host side applications with Intel® System Studio 2019 redistributed headers and libraries:
The filesystem paths used in this example build reflect default Intel® System Studio 2019 initial release install locations. These may vary for the standalone SDK or custom tool installations.
Notice -Wno-deprecated-declarations is used in the build. This allows this build to proceed using deprecated function calls that can allow for OpenCL™ 1.2 API implementations in addition to the version 2.1 API implementations from Intel®. Developers may wish to maintain portability in a different manner than this example, however the example source regions relevant for this toggle are discussed in the code walkthrough section of this article.
Compiling the device side OpenCL-C program offline is a helpful practice, although not the only option. In this article we use offline compilation only for developer feedback:
The ioc64 (or ioc32) offline compiler picks an OpenCL™ implementation with which to compile with the -device=<device> toggle. The ‘gpu’ device switch sends the kernel source through Intel® Graphics Compute Runtime for OpenCL™ Driver. The kernel is compiled for Intel® Graphics Technology.
The ‘cpu’ switch sends the kernel through Intel® CPU Runtime for OpenCL™ Applications compiling the kernel for Intel® x86_64. The output log file shows build feedback. Note that the contents of cpu/TemplateCPU.cl and gfx/TemplateGFX.cl kernel sources are not the same.
In this article, both CPU and Graphics kernels will also ultimately compile and execute at runtime.
See /opt/intel/opencl/bin/ioc64 -help for more information about offline compiler capabilities. They are also pasted in the source walkthrough section of the article for convenience.
Note for 2019 Initial Release: For CPU runtime builds, ioc64 may show a segfault if both the Intel® CPU Runtime for OpenCL™ Applications and OpenCL™ 2.1 Experimental runtime for Intel® CPU Device are simultaneously referenced in ICD files. Developers are highly encouraged to target Intel® CPU Runtime for OpenCL™ Applications over the Experimental runtime. Programs can effectively ignore the Experimental Runtime by setting aside the .icd file associated with it. Consider moving /etc/OpenCL/vendors/intel_exp64.icd out of the vendors folder to a backup location as a workaround.
Run
Execute the test application on Intel® Graphics Technology application and check the output:
Output:
Execute the test application on Intel® CPU and check the output:
Output:
Device selection error feedback is presented in the output here to show no CPU device is available underneath the 'Intel(R) OpenCL HD Graphics' platform as expected.
Developer tips and a sample explanation are later on in the walkthrough. Newer developers in particular may gain insight into how OpenCL™ applications function.
Part 2:
IDE facilities
The OpenCL™ tools IDE plugin allows a quick view of OpenCL™ capabilities for the target system. These capabilities help developers understand how to target OpenCL™ devices. Launch the Eclipse based developer environment from the launcher icon placed on the Ubuntu* desktop or from the iss_ide_eclipse-launcher.sh script in the Intel® System Studio 2019 install directory. After launching the Intel® System Studio 2019 developer environment, go to Tools -> Intel Code Builder for OpenCL API -> Platform Info...:
Three platforms and three devices are shown with drop down dialogs are shown on the example system here for completeness. They are 'Experimental Host Only OpenCL 2.1 Platform', 'Intel(R) CPU Runtime for OpenCL(TM) Applications', and 'Intel(R) HD OpenCL Graphics'.
In 2019 Update 1, Experimental runtime is removed. It is replaced by Intel® CPU Runtime for OpenCL™ Applications.
The first platform is the experimental platform. This platform has one device: the Experimental 2.1 OpenCL™ CPU device. This implementation is selected as an installed component through the Intel® System Studio 2019 Initial Release installer dialogs. This platform is for nonproduction purposes only. Developers should target Intel® CPU Runtime for OpenCL™ Applications version 18.1 or newer as a replacement for the Experimental runtime. The Experimental Runtime is required to run the IDE plugin in 2019 initial release and earlier.
The second platform is the the Intel® CPU Runtime for OpenCL™ Applications. It exposes the CPU target device.
The third platform, Intel® HD OpenCL Graphics, exposes the Intel® Graphics Technology target device.
Platform and Device names exposed here are expected to match interrogated parameters in developer programs from clGetPlatformInfo(...) and clGetDeviceInfo(...) OpenCL™ API calls. These names are subject to change depending on the operating system, version of the Intel® implementation, and underlying hardware.
Note: Production OpenCL™ platforms interrogated from Intel® implementations are consistently expected to contain the string "Intel" when using CL_PLATFORM_NAME with the clGetPlatformInfo(...) OpenCL™ API call.
Right clicking on a device will allow detailed device properties to be viewed.
For completeness, here are the device properties for the CPU device.
Here are the properties for the Intel® Graphics Technology Device.
These properties match the various property bitflags discernable with clGetDeviceInfo(...) OpenCL™ API call. Some examples where properties are useful at runtime are: reading device OpenCL™ standard capabilities, sizing constraints for memory or OpenCL™ kernel work-items, and OpenCL™ device extentions.
The platform experimental runtime is shown here for completeness, but in Intel® System Studio 2019 initial release, it's highly recommended to target Intel® CPU Runtime for OpenCL Applications 18.1 as opposed to the Experimental Runtime . Using both experimental and Intel® CPU Runtime for OpenCL™ Applications in the same process is known to show issues leading to program segfaults. The Experimental Runtime was removed in 2019 Update 1.
Disk Provisions
Intel® System Studio 2019: OpenCL™ Tools come with standard headers and libraries from Khronos*. Key examples are the cl.h header and the OpenCL™ ICD interrogator loader library, libOpenCL.so. This header and library are needed to build and execute OpenCL™ programs. The libOpenCL.so library interrogates and routes OpenCL™ API function calls to a chosen OpenCL™ implementation. In our example programs, the routines create contexts and execute work on OpenCL™ devices. The main getting started guide diagrams the modular relationship of OpenCL™ implementations and the OpenCL™ ICD interrogator library.
Alternate versions of the OpenCL™ ICD interrogator library are available from third parties. One such example is the ocl-icd package available from various system package managers. The key to using any ICD library effectively is to ensure it’s OpenCL™ capability can support desired OpenCL™ implementation capabilities. For example:
- Intel® Graphics Compute Runtime for OpenCL™ Driver offers an OpenCL™ 2.1 implementation for Intel® Iris™ Pro Graphics 580 on the Intel® Core™ i7-6770HQ Processor. The OpenCL™ ICD interrogator library included with Intel® System Studio 2019: OpenCL™ Tools is 2.1 capable.
- Current Intel® Atom™ Processors’ Graphics hardware may support only up to OpenCL™ version 1.2, but the OpenCL™ 2.1 ICD interrogator library should still be able to resolve the legacy specification’s features.
The OpenCL™ implementation installers put their ICD loader library references in /etc/OpenCL/vendors. At runtime the ICD loader library uses these ICD files to route OpenCL™ API calls through the intended OpenCL™ implementation. The contents of the /opt/OpenCL/vendors folder are useful for understanding the deployments available on a system. Developers may observe ICD reference text files from multiple OpenCL™ device vendors in this folder. Example:
In the above filesystem:
- intel64.icd maps to Intel® CPU Runtime for OpenCL™ Applications.
- intel.icd maps to Intel® Graphics Compute Runtime for OpenCL™ Driver.
- intel_exp64.icd file is shown for completeness and it maps to the experimental 2.1 runtime included with Intel® System Studio 2019 initial release. Again, programs are recommended to target either Intel® CPU Runtime for OpenCL™ Applications 18.1 OR the Experimental Runtime, but not both. Targeting Intel® CPU Runtime for OpenCL™ Applications 18.1 or newer is strongly recommended where applicable. Using 2019 Update 1 or newer versions of the developer tools is highly recommended as the Experimental Runtime is deprecated and removed.
Developer Tips
Intel® recommends keeping the two main bottlenecks in mind for heterogeneous development:
- Minimizing offload transfer.
- OpenCL™ devices may operate within a different memory domains. Minimizing transfers in these cases is crucial for performance.
- OpenCL™ devices may share a memory domain with the host processor. The API typically exposes ways to avoid copying data all together in these cases. Such a shared configuration is typical of current Intel® Graphics Technology capable processors.
- Target device topology. Devices differ in number of compute units and other device characteristics. So, how to best schedule work to devices may change depending on device.
- For Intel® Graphics Technology Gen9 and newer, consider using the cl_intel_subgroups OpenCL™ statndard extension. Examples are located at the compute-samples GitHub portal.
- For Intel® FPGA products consider usage of OpenCL™ pipes.
For information on the using the Intel® System Studio 2019: OpenCL™ tools or Intel SDK for OpenCL™ Applications 2019 standalone… see the Developer Guide.
For more on getting the most out of OpenCL™ programming for the CPU only implementation see the OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon processors.
For a a video instructional guide on general programming considerations for Intel® Graphics Technology check the video, "What Intel® Processor Graphics GEN9 Unlocks in OpenCL*", located at techdecoded.intel.io. Searching techdecoded.intel.io for 'opencl' yields other related content.
For a comprehensive look at Intel® Graphics Technology hardware, see this compute architecture overview document for Gen 9. Note that a manual attempt to walk through how OpenCL™ kernels and generated assembly is greatly obfuscated by the scheduling employed by the OpenCL™ runtime implementation.
Intel® FPGA products guidance can be accessed through fpgasoftware.intel.com.
Part 3
Explanation of sample source
The test application adds two input values together in an OpenCL™ kernel for each work-item, the result is available on the host after the kernel execution.
The Kernel - Intel® Graphics Technology
The texture sampler facility is used to read two texture sampler elements. The elements are added and the result is written to the output image. The kernel uses the read_imageui(…) and write_imageui(...) function calls available from the OpenCL-C language to use the device’s texture sampler facility. In the Intel® Graphics Technology implementation case there are multiple hardware texture samplers per device. This source file is stored on disk as TemplateGFX.cl for consumption at run time by the OpenCL™ host program.
Note: There are more compute elements than texture samplers on current Intel® Graphics Technology hardware. Texture samplers are preferable for interpolation tasks. For simple reads or writes, operating on straight forward OpenCL™ buffers may be preferable. Memory staged for image2d_t objects may be reorganized by the OpenCL™ implementation for interpolation prior to kernel launch. Consider the opaque reorganization's cost to be ammortized by interpolation operations.
The Host Side Application - Intel® Graphics Technology
The host side application has all the OpenCL™ host API calls to set up a context, pick a target device, compile and execute the device kernel, and stage kernel input and output memory.
This example is not the only way to create an OpenCL™ host program, as production applications may wish to use different interrogation procedures, OpenCL™ API event facilities, or even different command queue methodology. C++ developers may wish to consider the OpenCL™ C++ Wrapper API for use in C++ programs. This example is written in C++ and uses the OpenCL™ API and some C99 library routines to focus the walkthrough on base API usage.
Starting in main() function, notice the application picks a 1024 x 1024 array size for the image data. The sample doesn't read in and operate on a real image. Random data will be generated to fill the image data. However, the sizing for the input is critical later on to control the number of OpenCL™ work-items that are launched for the kernel.
The FindOpenCLPlatform(...) routine finds an Intel® platform that has a device type that we specify, in this case we're looking for a platform with a GPU device. The program uses the clGetPlatformIDs(...) API call to get the number of OpenCL™ platforms and an array of OpenCL™ platform id's.
Each platform's devices are checked to see if the platform contains our preferred device type using clGetDeviceIDs(...) API call:
A new OpenCL™ context is created on the matching platform. The clCreateContextFromType(...) OpenCL™ API call creates the context associated with a device of our selected type. The clGetContextInfo(...) OpenCL™ API call allows the program to recover the selected device's id:
The GetPlatformAndDeviceVersion(...) function interrogates the given platform and device to understand OpenCL™ feature capability. The function allows for some flexibility in using OpenCL™ 1.2, OpenCL™ 2.0, or OpenCL™ 2.1 API calls as appropriate for target devices. The function also informs the program if OpenCL-C kernel language 2.0 features are supported.
Back in the SetupOpenCL(...) funtion, the program creates a command queue with the device context. Command queue creation changed between OpenCL™ 1.2 API and OpenCL™ 2.0 API, so there are helper macros to pick the correct API calls. A mismatch in command queue creation is a frequent build breaker for new developers. OpenCL™ API code developers review in the wild may not be explicitly labeled for it's OpenCL™ revision. The build example in the article uses -Wno-deprecated-declarations to easily leverage the older style command queue OpenCL™ API call:
Note: const cl_command_queue_properties properties[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0}; The CL_QUEUE_PROFILING_ENABLE flag is useful for debugging. It can help provide useful information in concert with OpenCL™ events. Refer to the Khronos* documentation on using this feature for more information: clCreateCommandQueueWithProperties(...) clGetEventProfilingInfo(...). In production circumstances, enabling command queue profiling is often undesirable due to queue serialization.
Back in our main(...) function our input and output buffers are created. The aligned allocations assume 4K page size. This enables us to get the zero copy behavior to eliminate unnecessary copying. We take advantage of the unified host and device memory domain. See the Intel® article on zero copy for OpenCL for more information:
In int CreateBufferArguments(ocl_args_d_t *ocl, cl_int* inputA, cl_int* inputB, cl_int* outputC, cl_uint arrayWidth, cl_uint arrayHeight) image objects are created. A description data structure is configured to denote the image object parameters. An image object is created for the 2 inputs and 1 output. The results are 1-channel images with bitdepth of 32-bits. Data is stored with unsigned integer values. CL_MEM_USE_HOST_PTR allows buffers to exhibit zero copy behavior:
Next, the program creates the kernel program object. The kernel program "Template.cl" is read in from source on disk. The source is then associated with the program object and context.
The program is built for the target context and device with clBuildProgram(...). Build feedback is recorded and is often very useful for scheduling and sizing kernel execution. Build feedback is also useful for debugging. The empty string is the argument where the OpenCL-C revision for the kernel program source or other preprocessor variables would be specified. Example "-cl-std=CL2.0". The OpenCL-C revision may or may not be the same as the OpenCL™ API revision. See the Khronos* official reference for more information:
A kernel object is created. This kernel object is associated with the desired function, Add(...), in the kernel source file via character string "Add" in the second argument:
Kernel arguments are bound. This binding maps image objects to parameters for the Add(...) kernel function. Now, the program can feed the proper host side data to the intended kernel parameters:
The built kernel program is now associated with the context for our Intel® Graphics Technology device and memory arguments are bound. The kernel program is enqueued on the command queue for execution. The enqueuue specifies the 2 dimensional hard coded size assigned at the beginning of the example. clFinish(...) OpenCL™ API call is used to block host program execution until the enqueued kernel has finished executing. The main performance timer measuring kernel performance begins and ends after these operations:
Next, the output is mapped back to a host pointer for result verification. This mapping operation avoids a copy and exploits the shared memory domain between the host and Intel® Graphics Technology hardware on this platform.
The result is also computed on the host and compared to the result from the OpenCL™ device for verification. The source code that follows in GPUProjectforLinux.cpp is for tear down and clean-up purposes.
Intel® CPU
The Intel® CPU version of the source is mostly similar. Below we look at the key differences.
The Kernel - Intel® CPU
The CPU version does not show texture sampler usage in the sample. Hardware texture sampler access is only offered through the Intel® Graphics Technology implementation. The CPU runtime implements texture sampling functionality in software. The CPU the kernel in this example for the CPU target accesses kernel data through raw pointers. The data index desired for this work-item is calculated by a basic 2-dimensional to 1-dimensional transformation.
The Host Side Application - Intel® CPU
The sample uses OpenCL™ basic buffer objects and not OpenCL™ images for preparing kernel data:
The validation step maps the OpenCL™ result buffer to a host program pointer at the end of the program. clEnqueueMapBuffer(...) OpenCL™ API call is used:
On error handing
Error handling, even with sandbox OpenCL™ development, triages many issues before they start. See the source examples for the TranslateOpenCLError(...) function definition to observe the various detailed error codes. These error codes are set by the OpenCL™ standard and are defined in the Khronos* standard headers. Always handle OpenCL™ API return values. Even for sandbox programs basic error handling is advised.
Offline compilation
The ioc64 offline compiler front end has a few different output format options for compiled kernel intermediates. These can be stored on disk and consumed by the OpenCL™ API at a later time In some cases linked to other compiled kernel objects for future builds. SPIR and SPIR-V provide useful targeting for OpenCL™ as well as non OpenCL™ kernels. SPIR provides a necessary level of kernel obfuscation for developers who wish not to distribute kernels in source text format. This avoids distributions as a text file (.cl) or constant string within the host binary. For developer reference, the -help usage output of ioc64 from 2019 initial release is provided here:
Note: Future updates to ioc64 no longer support 'cpu_2_1' device toggle.
A build log may report the x86_64 instruction set architecture extensions used for the kernel. As of December 2018 and Intel® CPU Runtime for OpenCL™ Applications 18.1, avx512 generated kernels can be generated and executed on capable Intel® processors. See the -simd=<instruction_set_arch> toggle for more information.
More
For discussion on OpenCL™ developer components in Intel® System Studio 2019 and the Intel® SDK for OpenCL™ Applications 2019, join the community at the Intel® forums.
For more on CPU and Intel® Graphics Technology performance analysis for heterogeneous OpenCL™ programs, see Intel® VTune™ Amplifier, available as a standalone or as an Intel® System Studio 2019 component.
The OpenCL™ intercept layer instrumentation tool is a useful and lightweight. It can track API calls and provide some straightforward performance metrics.
*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
"