SYCL*-based Discrete Cosine Transform (DCT) For JPEG Image Compression on GPU

Accelerate parallel implementation of image compression using the Intel® oneAPI DPC++/C++ Compiler

Image compression refers to the process of reducing the size of a digital image file while only minimally compromising its quality. This is achieved by removing redundant and non-essential data, enabling easier storage and faster image transmission over the internet or other networks.

This blog discusses the Discrete Cosine Transform code sample available in the oneAPI GitHub repository. It demonstrates implementing an irreversible image compression technique called Discrete Cosine Transform (DCT) for JPEG images using SYCL* and the Intel ® oneAPI DPC++/C++ Compiler.  

Before diving into the details of the code sample, let us talk more about image compression. 

Real-world applications of image compression include: 
 

  • Digital photography – for efficient storage and sharing of high-resolution images captured using cameras; 

  • Consumer electronics – for minimizing the usage of data and storage space on mobile devices like smartphones and tablets; 

  • Medical imaging – for efficient storage and transmission of medical images, retaining the image quality for proper diagnosis; 

  • Video surveillance – for compressing images captured by surveillance systems to efficiently store and transmit them using cloud services; and 

  • Web development – for faster loading of images on websites to improve user experience and reduce bandwidth usage.
     

There are two types of image compression techniques: 
 

  1. Lossless compression: This method retains image quality and enables precise image reconstruction from the compressed data. PNG, GIF, and TIFF are some of the common lossless compression image formats, and 

  2. Lossy compression: This method permanently eliminates some information from the image data; hence, the original image cannot be restored. JPEG and WebP are commonly known lossy compression formats. 

Lossy compression techniques often transform the original image into a frequency domain using mathematical functions such as the Discrete Cosine Transform (DCT) and then quantize the frequency components.

Why Discrete Cosine Transform (DCT)?

The DCT image compression method is favorable because it tends to concentrate the image signal information in a few low-frequency components. This makes it easier to achieve high compression ratios while maintaining good visual quality.

With carefully applied quantization, the loss of image quality caused by the DCTcompression process can be made imperceptible to the human eye while significantly reducing the file size. 

Let us now discuss the Discrete Cosine Transform code sample and how the Intel oneAPI DPC++/C++ Compiler can help achieve faster compression, accelerated using SYCL-based GPU offload. 

Intel® oneAPI DPC++/C++ Compiler: An Overview 

The Intel oneAPI DPC++/C++ Compiler is a high-performance, industry-standard compliant LLVM*-based compiler that helps compile ISO C/C++ and SYCL applications across diverse architectures. It is the world’s first compiler fully supporting the latest SYCL 2020 specification. In addition to SYCL, it also supports other accelerated parallel computing frameworks like OpenMP* or OpenCL*. It is designed to seamlessly interface with and leverage oneAPI libraries, such as oneDPL and oneTBB, for optimized parallel execution and offload compute acceleration. These design properties enable code reusability across heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs.

About The Discrete Cosine Transform Code Sample

The code sample first performs Discrete Cosine Transform (DCT) and quantization on the input image. The intermediate image thus produced then undergoes inverse DCT and de-quantization to produce the output BMP image, which will be used to assess the loss of image information caused by the DCT compression technique. 

Let us briefly walk through the three execution steps of the compression routine: 

 

  1. DCT Step 
    The pixel representation of an image stores the color value of each pixel. The color pattern of image subsets is represented as a sum of multiple cosine functions. The image is processed as 8x8 subsections (called ‘blocks’ in the code sample). An 8x8 image can be represented by only 8 discrete cosine functions. Reconstruction of the image from the cosine representation only requires coefficients associated with each cosine function. The DCT process transforms the 8x8 matrix of pixels of the input image into a corresponding 8x8 matrix of coefficients.  
     

  2. Quantization Step
    The quantization process enables compression of the image data. A quantizing matrix is designed to prioritize the cosine functions most relevant to image data. The matrix obtained after DCT, when divided by the quantizing matrix, results in a series of numbers followed by several zeroes if read diagonally (as stored in the memory). The large series of zeroes allows compression of the original image. 
     

  3. Inverse DCT and De-quantization Steps
    Before writing the quantization output to a file, the code sample then performs de-quantization followed by inverse DCT to re-produce the raw image data.  Because of the inverse operations, the resultant image will not be a compressed version of the original one. However, it will expose the artifacts caused by an irreversible compression method like DCT.  

Parallel Computations with SYCL* 

The individual 8x8 blocks of an image can be processed separately or concurrently. The code sample quickly implements parallelization using SYCL, with a few minor changes to the original serial implementation.

→ Check out the key implementation details

Sample Output

Below is an example output of executing the code sample on a 6th Gen Intel® Core™ Processor with integrated Intel® Processor Graphics Gen 9 or newer and Intel oneAPI DPC++/C++ Compiler. The code will dispatch execution to a compatible GPU if found, else it runs on a CPU (host). 

Filename: willyriver.bmp W: 5184 H: 3456 Start image processing with offloading to GPU... Running on Intel(R) UHD Graphics 620 --The processing time is 6.27823 seconds DCT successfully completed on the device. The processed image has been written to willyriver_processed.bmp

What’s Next? 

Check out the Discrete Cosine Transform sample for SYCL-based parallel DCT image compression technique implementation. 

Try your hands on several other code samples available in the oneAPI GitHub repository

Get started with the Intel oneAPI DPC++/C++ Compiler today to efficiently compile C/C++ and SYCL applications across heterogeneous architectures. 

Also, explore other AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio. 

Get The Software 

Install the Intel oneAPI DPC++/C++ Compiler as a part of the Intel® oneAPI Base Toolkit or Intel® HPC Toolkit. You can also download a standalone version of the compiler or test it across Intel® CPUs and GPUs on the Intel® Tiber™ Developer Cloud platform.   

Useful Resources