Improved Apache Spark* Performance on Google Dataproc Serverless*

Introduction

As companies have more data to analyze, increasing the efficiency of big data analysis becomes a critical business need. Apache Spark* is an open source analytics engine used for big data workloads. The Google* Dataproc-managed Spark service now includes Native Query Execution (NQE) powered by Apache Gluten*. Gluten is an incubating open source project under the Apache Software Foundation, designed to enhance the performance of Spark by offloading SQL execution to native processing engines.

This article gives an overview of Gluten and the performance improvements it can offer over Spark alone along with an outline of how to run a Spark workload on Dataproc Serverless.

Apache Gluten* Overview

Spark is an important workload in the data services space for big data processing. There are several add-ins available for Spark to improve performance. Intel’s initial approach to optimize Spark came in the form of the Gazelle plug-in project, which uses a vectorized execution engine. The Gazelle project evolved into Gluten, which builds on Gazelle and unlocks extra performance by offloading compute-intensive critical data processing to native accelerator libraries. The Gluten plug-in can interface with several popular native SQL engines, including ClickHouse*, Apache Arrow*, and Velox*. These native SQL engines benefit from community-driven support for accelerators in addition to software optimizations in the traditional sense. Velox, for example, supports several hardware features found in current and next-generation Intel® Xeon® processors, such as Intel® QuickAssist Technology, high-bandwidth memory (HBM), and Intel® In-Memory Analytics Accelerator, which enable users to add significant performance to their Spark workloads.

Gluten Architecture

The Gluten plug-in works in conjunction with Substrait*, a cross-language specification for data processing, and has multiple phases. First, Gluten facilitates the transformation of Spark SQL queries into Substrait plans. Then, it adapts code running in a Java* virtual machine (JVM) to use the Java* Native Interface (JNI) to use native applications (for example, Velox, a C++ project). Finally, once native running is complete, the data is returned to Gluten as a ColumnarBatch. Figure 1 provides extra context into the software stack that Spark and Gluten uses.

While the Gluten plug-in supports a wide range of operators, there are some Spark operators that are not yet supported. In these cases or cases where native execution fails, Gluten falls back to a non-native execution of the query. This all happens transparently to the user who is able to run their workloads much faster without having to change any queries or interfaces.

Substrait provides a well-defined, cross-language specification for data compute operations. The Spark physical plan is transformed into a Substrait plan. Then, the Substrait plan is passed to native through a JNI call. On the native side, the native operator chain is built out and offloaded to the native engine. Gluten returns ColumnarBatch to Spark and a Spark columnar API (since Spark 3.0) is used at runtime. Gluten uses columnar data format as its basic data format, so the returned data to Spark JVM is ArrowColumnarBatch.

""

Figure 1: Spark and Gluten software stack

 

There are several key components in Gluten, as depicted in Figure 1:

  • Query Plan Conversion: Converts the Spark physical plan into a Substrait plan.
  • Unified Memory Management: Controls native memory allocation.
  • Columnar Shuffle: Shuffles Gluten columnar data. The shuffle service reuses the one in Spark Core. A kind of columnar exchange operator is implemented to support Gluten columnar data format.
  • Fallback Mechanism: Supports falling back to the Vanilla version of Spark for unsupported operators. Gluten ColumnarToRow (C2R) and RowToColumnar (R2C) converts Gluten columnar data and Spark internal row data if needed. Both C2R and R2C are implemented in native code as well.
  • Metrics: Collected from Gluten native engine to help identify bugs and performance bottlenecks. The metrics are displayed in the Spark UI.
  • Shim Layer: Supports multiple Spark versions. The current commitment from the Gluten project is to support Spark's latest two to three releases. Currently, Spark 3.2, Spark 3.3, and Spark 3.4 (experimental) are supported.

Native Query Execution (NQE) on Google Cloud Platform* Service (GCP)

NQE on Google Cloud Platform* service (GCP) uses the Gluten plug-in implementation onto Dataproc Serverless. This managed service offers the following benefits:

  • Native Query Execution: Users experience significant performance gains with the new premium-tier NQE.
  • Seamless Monitoring with Spark UI: Users can track job progress in real time with a built-in Spark UI available by default for all Spark batches and sessions.
  • Streamlined Investigation: Troubleshoot batch jobs from a central Investigate tab, which displays all the essential metrics, highlights, and logs filtered by errors automatically.

NQE Process

Dataproc Serverless NQE is available only with batch workloads and interactive sessions running in the Dataproc Serverless premium pricing tier. While the premium tier commands an increased cost over the standard tier pricing, there is no additional charge for NQE on top of premium tier pricing.

The user can enable premium tier resource allocation and pricing for the batch and interactive session resources by setting the resource allocation tier properties to premium before submitting a Spark batch workload.

NQE is enabled by setting NQE properties on a batch workload, interactive session, or session template and then submitting the workload or running the interactive session in a notebook.

Source

An example of setting up NQE follows:

  1. In the GCP console:
    1. Go to Dataproc Batches.
    2. To open the Create batch page, click Create.
  2. Select and fill in the following fields to configure the batch for NQE:
    • Container:
    • Executor and Driver Tier Configuration:
      • Select Premium for all tiers (Driver Compute Tier, Execute Compute Tier).
    • Properties: Enter the following key and value pairs for the NQE properties:
      • Key spark.dataproc.runtimeEngine

      • Value native
  3. Fill in, select, or confirm other batch workloads settings. See Submit a Spark Batch Workload.
  4. To run the Spark batch workload, click Submit.

The user can run the Dataproc NQE qualification tool to identify workloads that can achieve faster runtimes with NQE. The qualification tool analyzes the Spark event files generated by batch workload applications and then estimates potential runtime savings that each workload application can obtain with NQE.

There are several options available for more advanced performance tuning. These include the following properties:

  • spark.driver.memory / spark.driver.memoryOverhead: To tune memory for the entire Spark driver
  • spark.executor.memory / spark.memoryOverhead: To tune on-heap memory for executors
  • spark.memory.offHeap.size: To tune off-head memory
  • spark.dataproc.driver.disk.size: To tune disk size for Spark driver
  • spark.dataproc.executor: To tune disk size for executor

By default, if no memory (either off-heap or on-heap) is configured, the execution engine uses 4GB of memory allocating off-heap to on-heap memory at a ratio of 6:1. If these options are changed, Dataproc recommends that off-heap to on-heap memory is allocated with the same 6:1 ratio.

Source

For more information, refer to Accelerate Dataproc Serverless Batch Workloads and Interactive Sessions with Native Query Execution.

Gluten Performance on GCP

We use TPC-DS-like and TPC-H-like workloads as a benchmark to evaluate the performance of the GCP NQE feature powered by Gluten. The following chart shows the overall performance can reach up to a 2.92x speedup when compared to the original GCP without the NQE feature. The NQE significantly elevates query performance while minimizing operational costs, including zero code change, and only requires a few parameters to select in GCP. It delivers a remarkable speed improvement and provides a huge performance boost, with certain queries reaching even higher performance gains over running the Vanilla version of Spark execution.

Note Performance results may depend on the experimental environment, including dataset size, instance type, and other factors.

""

Configuration: C3-highmem-22, 8 workers, Scale Factor: 1TB. Spark 3.3.1, Hadoop 3.2.4, JDK 1.8

Why the NQE Feature (Gluten) Can Run Faster

Gluten and Velox, as Spark plug-ins, provide several optimizations that significantly enhance Spark's performance. One of the major differences from the Vanilla version of Spark is their use of a columnar data format to run SQL queries. This approach not only helps save memory during computation but also reduces the data transfer size during shuffles. Another noteworthy optimization is the ability to use Intel® Advanced Vector Extensions, thanks to the columnar data format. In addition to the benefits of the columnar format, Gluten and Velox use native implementations to rewrite every operator and function, resulting in better performance compared to the JVM in the Vanilla version of Spark.

When to Use NQE Feature (Gluten)

In TPC-DS-like and TPC-H-like benchmarks, we have ensured that almost all operators and functions are supported in Gluten. This is why the NQE feature can boost performance by at least 2x. However, in real queries, the user may encounter scenarios where the performance does not meet the claimed improvements or even experiences some regression. Several factors can contribute to this discrepancy.

Gluten supports various operators and data types, including hash aggregate, broadcast hash join, columnar shuffle, and more. To fully benefit from Gluten, consider the following points to ensure full support and optimal performance:

  • Focus on compute-intensive workloads rather than I/O-intensive ones.
  • Verify that the data types used are fully supported in Gluten.
  • Ensure that the operators and functions in the query are supported by Gluten.
  • Monitor the number of fallbacks (ColumnarToRow and RowToColumnar); excessive fallbacks can significantly impact performance.
  • Tune the fallback strategy to optimize performance based on user workloads.

To check the fallback status for the query, the user can use df.explain() or the Spark UI with the Gluten tab. Gluten supports three levels of fallback strategies to help tune the query: operator-level fallback, stage-level fallback, and whole-query fallback. For more information on setting up the parameters for fallback strategies, refer to the Gluten configuration documentation.

Additionally, GCP provides a qualification tool to help evaluate if the workloads are suitable for the NQE feature.

Conclusion

The integration of Gluten with Spark on the Dataproc managed service offers significantly better performance than running Spark workloads without NQE. TPC-DS-like and TPC-H-like benchmark results demonstrate that running workloads on Dataproc Serverless using Intel Xeon technology is the optimal choice for Spark SQL workloads. This setup is not only easier to use but also capable of doubling the performance of the workloads.

References