You will need an AWS account . Create an account if needed.
You also require multiple tools on your local computer. Follow these links to install each tool:
To use the AWS CLI, you need to generate AWS access keys and configure the CLI.
Follow the AWS Credentials Install Guide for instructions on how to do this.
Take a look at the GitHub repository then clone it on your local computer:
git clone https://github.com/koleini/spark-sentiment-analysis.git
cd spark-sentiment-analysis/eks
If you would like to change the default AWS region, you can do this by editing the file variables.tf
.
As you will see, the default value is at the top of the file and is set to us-east-1
:
variable "AWS_region" {
default = "us-east-1"
description = "AWS region"
}
In addition, if you are using a profile other than default
, then you need to update the following variable:
variable "AWS_profile" {
default = "Your_AWS_Profile"
description = "AWS authorization profile"
}
To create the Amazon EKS cluster, execute the following commands:
terraform init
terraform apply --auto-approve
Once the cluster is created, verify it in the AWS console.
If you want to use an AWS CLI profile that is not the default, make sure that you change the profile name before running the command to verify the cluster.
Update the kubeconfig
file to access the deployed EKS cluster with the following command:
aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw cluster_name) --profile <Your_AWS_Profile>
Create a service account for Apache spark:
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Navigate to the sentiment_analysis
folder to create a JAR file () file for the sentiment analyzer.
JAR is an acronym for Java ARchive, and is a compressed archive file format that contains Java related-files and metadata.
You will need sbt
installed. If you are running Ubuntu, you can install it with:
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
sudo apt-get update
sudo apt-get install sbt
If you have another operating system, refer to Installing sbt .
cd ../sentiment_analysis
sbt assembly
A JAR file is created at the following location:
sentiment_analysis/target/scala-2.13/bigdata-assembly-0.1.jar
Create a repository in Amazon ECR to store the docker images. You can also use Docker Hub.
The Spark repository contains a script to build the container image that you need to run inside the Kubernetes cluster.
Execute this script on your Arm-based computer to build the arm64 image.
In the current working directory, use the following commands to get the apache spark
tar.
Extract this repository prior to building the image:
wget https://archive.apache.org/dist/spark/spark-3.4.3/spark-3.4.3-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.4.3-bin-hadoop3-scala2.13.tgz
cd spark-3.4.3-bin-hadoop3-scala2.13
Copy the JAR file generated in the previous step to the following location:
cp ../sentiment_analysis/target/scala-2.13/bigdata-assembly-0.1.jar jars/
To build the docker container, use the following commands, ensuring that you substitute the name of your container repository before executing them:
bin/docker-image-tool.sh -r <your-docker-repository> -t sentiment-analysis build
bin/docker-image-tool.sh -r <your-docker-repository> -t sentiment-analysis push
Execute the spark-submit
command within the Spark folder to deploy the application.
The following commands run the application with two executors, each with 12 cores. They allocate 24GB of memory for both the executors and driver pods.
Before executing the spark-submit
command, set the following variables (replacing values in angle brackets with your values):
export K8S_API_SERVER_ADDRESS=<K8S_API_SERVER_ENDPOINT>
export ES_ADDRESS=<IP_ADDRESS_OF_ELASTICS_SEARCH>
export CHECKPOINT_BUCKET=<S3_BUCKET_NAME>
export ECR_ADDRESS=<ECR_REGISTERY_ADDRESS>
Execute the spark-submit
command:
bin/spark-submit \
--class bigdata.SentimentAnalysis \
--master k8s://K8S_API_SERVER_ADDRESS:443 \
--deploy-mode cluster \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=$ECR_ADDRESS \
--conf spark.kubernetes.driver.pod.name="spark-twitter" \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.driver.extraJavaOptions="-DES_NODES=$ES_ADDRESS -DCHECKPOINT_LOCATION=s3a://$CHECKPOINT_BUCKET/checkpoints/" \
--conf spark.executor.extraJavaOptions="-DES_NODES=$ES_ADDRESS -DCHECKPOINT_LOCATION=s3a://$CHECKPOINT_BUCKET/checkpoints/" \
--conf spark.executor.cores=12 \
--conf spark.driver.cores=12 \
--conf spark.driver.memory=24g \
--conf spark.executor.memory=24g \
--conf spark.memory.fraction=0.8 \
--name sparkTwitter \
local:///opt/spark/jars/bigdata-assembly-0.1.jar
Use kubectl get pods
to check the status of the pods in the cluster:
NAME READY STATUS RESTARTS AGE
sentimentanalysis-346f22932b484903-exec-1 1/1 Running 0 10m
sentimentanalysis-346f22932b484903-exec-2 1/1 Running 0 10m
spark-twitter 1/1 Running 0 12m
Create a twitter(X)
developer account
and download the bearer token
.
Use the following commands to set the bearer token and fetch the posts:
export BEARER_TOKEN=<BEARER_TOKEN_FROM_X>
python3 scripts/xapi_tweets.py
You might need to install the following python packages, if you run into any dependency issues:
You can modify the script xapi_tweets.py
and use your own keywords.
Here is the code which includes some sample keywords:
query_params = {'query': "(#onArm OR @Arm OR #Arm OR #GenAI) -is:retweet lang:en",
'tweet.fields': 'lang'}
Use the following command to send these processed tweets to Elasticsearch
python3 csv_to_kinesis.py
Navigate to the Kibana dashboard using the following URL and analyze the tweets:
http://<IP_Address_of_ES_and_Kibana>:5601
Following this Learning Path will deploy many artifacts in your cloud account. Remember to destroy the resources after you have finished. Use the following command to cleanup the resources:
terraform destroy