Deploy Apache Spark SQL with Gluten and Velox on Arm64

Run Apache Spark SQL workloads on Azure Cobalt 100 Arm64 using Gluten and Velox for accelerated analytics

Log an issue

Fork and edit

Discuss on Discord

Run Apache Spark SQL workloads on Azure Cobalt 100 Arm64 using Gluten and Velox for accelerated analytics

Before you begin

Switch to the root user and install all required system packages. This ensures you have the correct Java version, build tools, and database dependencies for Spark, Hadoop, Hive, and Gluten on Arm64.

    

        
        
sudo -i
apt update
apt install -y \
openjdk-17-jdk wget tar git curl unzip build-essential \
python3-pip mysql-server maven cmake ninja-build pkg-config libssl-dev

Java runtime is necessary for Spark and Hadoop. C++ dependencies are necessary for building Gluten. MySQL is necessary for Hive metastore.

Configure hostname

Hadoop requires a proper hostname for internal communication.

Set the hostname to spark-master so that Hadoop and Spark can communicate reliably on a single-node cluster. This prevents networking issues during service startup.

    

        
        
hostnamectl set-hostname spark-master
exec bash

Configure hosts

Append the hostname to /etc/hosts to ensure all Hadoop and Spark services resolve the local node correctly. This prevents connection errors.

    

        
        
echo "127.0.0.1 spark-master" >> /etc/hosts

Set up passwordless SSH

Generate an SSH key pair for passwordless authentication. Hadoop daemons use SSH to manage services internally, so this step is required for smooth operation.

    

        
        
ssh-keygen -t rsa -P ""

When prompted to enter the file location, press the Enter button to accept the default:

    

        
        Enter file in which to save the key (/root/.ssh/id_rsa):
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub

Append the public key to authorized_keys to enable passwordless SSH for the root user:

    

        
        
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Install Hadoop

Hadoop provides HDFS for storage and YARN for resource management.

Install Hadoop:

    

        
        
cd /opt
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1-aarch64.tar.gz
tar -xvf hadoop-3.3.1-aarch64.tar.gz
ln -s hadoop-3.3.1 hadoop

Install Spark

Spark is the main engine for SQL and analytics.

Download and extract Apache Spark 3.4.2 built for Hadoop 3. This is the main analytics engine for running SQL and DataFrame workloads.

    

        
        
wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz
tar -xvf spark-3.4.2-bin-hadoop3.tgz
ln -s spark-3.4.2-bin-hadoop3 spark

Install Hive

Download and extract Apache Hive 3.1.3. Hive provides the SQL metadata layer and metastore for Spark SQL.

    

        
        
wget https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xvf apache-hive-3.1.3-bin.tar.gz
ln -s apache-hive-3.1.3-bin hive

Set up environment variables

Set up environment variables for Java, Hadoop, Spark, and Hive. This ensures all commands and scripts can find the correct binaries and configuration files.

    

        
        
cat >> ~/.consolerc <<EOF

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-arm64
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_HOME=/opt/spark
export HIVE_HOME=/opt/hive

export PATH=\$JAVA_HOME/bin:\$HADOOP_HOME/bin:\$SPARK_HOME/bin:\$HIVE_HOME/bin:\$PATH

EOF

Apply the environment changes to your current shell:

    

        
        
source ~/.consolerc

Set up Hadoop directories

HDFS needs storage directories.

Create the required HDFS storage directories for the NameNode and DataNode. This is necessary for Hadoop to manage its file system state.

    

        
        
mkdir -p $HADOOP_HOME/dfs/name
mkdir -p $HADOOP_HOME/dfs/data
mkdir -p /opt/dfs/data

Configure Hadoop

Create a minimal core-site.xml to define the default HDFS URI for a single-node cluster.

    

        
        
cat > $HADOOP_HOME/etc/hadoop/core-site.xml <<EOF
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark-master:9000</value>
</property>
</configuration>
EOF

Create a minimal hdfs-site.xml to configure HDFS for single-node operation.

    

        
        
cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml <<EOF
<configuration>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>dfs.hosts</name>
    <value>/opt/hadoop/etc/hadoop/workers</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/dfs/name</value>
</property>
<property>
    <name>dfs.namenode.data.dir</name>
    <value>file:/opt/hadoop/dfs/data</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/opt/dfs/data</value>
</property>
</configuration>
EOF

Create a minimal yarn-site.xml optimized for Arm64, specifying the resource manager hostname and available resources.

    

        
        
cat > $HADOOP_HOME/etc/hadoop/yarn-site.xml <<EOF
<configuration>
<property><name>yarn.resourcemanager.hostname</name><value>spark-master</value></property>
<property><name>yarn.nodemanager.resource.memory-mb</name><value>8192</value></property>
<property><name>yarn.nodemanager.resource.cpu-vcores</name><value>4</value></property>
</configuration>
EOF

Fix Java 17 compatibility issue

Create hadoop-env.sh to fix Java 17 reflection issues and ensure Hadoop, Spark, and Gluten stability:

    

        
        
cat >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh <<EOF

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-arm64

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

export HADOOP_OPTS="--add-opens java.base/java.lang=ALL-UNNAMED"

EOF

Start Hadoop

To start Hadoop, run:

    

        
        
hdfs namenode -format

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

The output from start-dfs.sh is similar to:

    

        
        Starting namenodes on [spark-master]
Starting datanodes
Starting secondary namenodes [spark-master]

The output from start-yarn.sh is similar to:

    

        
        Starting resourcemanager
Starting nodemanagers

Verify that all Hadoop daemons are running:

jps

The output is similar to:

    

        
        53856 ResourceManager
54469 Jps
54245 NodeManager
53526 SecondaryNameNode
53036 NameNode
53276 DataNode

Set up Hive Metastore

Hive stores table metadata. To set up Hive Metastore, run:

    

        
        
mysql -u root <<EOF
CREATE DATABASE hive_metastore;
CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY '123456';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hiveuser'@'localhost';
FLUSH PRIVILEGES;
EOF

Configure Hive to use MySQL

By default, Hive uses the embedded Derby database for its metastore. Derby does not support MySQL SQL syntax, which causes the schema initialization to fail. You need to download the MySQL Java Database Connectivity (JDBC) connector. Then, configure Hive to connect to the MySQL metastore you created in the previous step.

Download the MySQL JDBC connector:

    

        
        
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar -P $HIVE_HOME/lib/

Create hive-site.xml with the MySQL connection details:

    

        
        
cat > $HIVE_HOME/conf/hive-site.xml <<EOF
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost/hive_metastore?createDatabaseIfNotExist=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hiveuser</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
</configuration>
EOF

Initialize Hive

Initialize the Hive metastore schema in MySQL and start the Hive metastore service in the background. This step is required before Spark can use Hive tables for SQL analytics.

    

        
        
$HIVE_HOME/bin/schematool -dbType mysql -initSchema
nohup hive --service metastore &

The output is similar to:

    

        
        Initialization script completed
schemaTool completed

Build Gluten with Velox

Build the Gluten project with the Velox backend to enable native C++ query execution in Spark. This process downloads all dependencies, compiles the engine, and prepares the Java Archives (JARs) needed for Spark integration. The sed commands ensure the build targets Spark 3.4 and disables S3 support for simplicity.

    

        
        
cd /opt
git clone https://github.com/apache/incubator-gluten.git
cd incubator-gluten
git checkout v1.3.0
sed -i 's/SPARK_VERSION=ALL/SPARK_VERSION=3.4/' dev/builddeps-veloxbe.sh
sed -i 's/--enable_s3=ON//g' dev/package.sh
./dev/package.sh
mkdir -p /opt/gluten-jars
cp package/target/*.jar /opt/gluten-jars/

Configure Spark to use Gluten

Configure Spark to use the Gluten plugin and Velox backend by creating a spark-defaults.conf file. This enables native execution, sets resource limits, and ensures the correct JARs are loaded for both the driver and executors.

    

        
        
cat > $SPARK_HOME/conf/spark-defaults.conf <<EOF

spark.master yarn

spark.executor.instances 2
spark.executor.cores 2
spark.executor.memory 3g

spark.driver.memory 3g

spark.sql.shuffle.partitions 50

spark.plugins org.apache.gluten.GlutenPlugin
spark.gluten.enabled true

spark.memory.offHeap.enabled true
spark.memory.offHeap.size 2g

spark.gluten.sql.columnar.backend.lib velox

spark.driver.extraClassPath /opt/gluten-jars/*
spark.executor.extraClassPath /opt/gluten-jars/*

EOF

Start Spark Thrift Server

Start the Spark Thrift Server, which allows you to connect to Spark SQL using JDBC/Open Database Connectivity (ODBC) clients. This is the main entry point for running SQL queries and benchmarks.

    

        
        
$SPARK_HOME/sbin/start-thriftserver.sh

Verify that the Thrift Server processes are running:

jps

The output is similar to:

    

        
        53856 ResourceManager
229942 Jps
54245 NodeManager
229910 ExecutorLauncher
53526 SecondaryNameNode
229637 SparkSubmit
55622 RunJar
53036 NameNode
53276 DataNode

What you’ve accomplished and what’s next

You’ve now created a fully operational Spark SQL cluster on Arm64 with native acceleration enabled. You installed and configured Hadoop, Spark, and Hive, built Gluten with the Velox backend, and set up native off-JVM query execution on your Azure Cobalt 100 VM.

Next, you’ll generate a TPC-DS dataset, load it into HDFS, create Spark SQL tables, and run analytical queries to measure the performance improvement over standard Spark.

Back

Run Apache Spark SQL workloads on Azure Cobalt 100 Arm64 using Gluten and Velox for accelerated analytics

Introduction

Understand Azure Cobalt 100 and Apache Spark with Gluten and Velox

Create an Azure Cobalt 100 Arm64 virtual machine

Deploy Apache Spark SQL with Gluten and Velox on Arm64

Run TPC-DS Benchmark on Spark with Gluten and Velox on Arm64

Next Steps

Run Apache Spark SQL workloads on Azure Cobalt 100 Arm64 using Gluten and Velox for accelerated analytics

Before you begin

Configure hostname

Configure hosts

Set up passwordless SSH

Install Hadoop

Install Spark

Install Hive

Set up environment variables

Set up Hadoop directories

Configure Hadoop

Fix Java 17 compatibility issue

Start Hadoop

Set up Hive Metastore

Configure Hive to use MySQL

Initialize Hive

Build Gluten with Velox

Configure Spark to use Gluten

Start Spark Thrift Server

What you’ve accomplished and what’s next