Spark Streaming, Kafka and Cassandra Tutorial

This tutorial builds on our basic “Getting Started with Instaclustr Spark and Cassandra” tutorial to demonstrate how to set up Apache Kafka and use it to send data to Spark Streaming where it is summarised before being saved in Cassandra. 

The high-level steps to be followed are:

  1. Set up your environment.
  2. Build the sample.
  3. Run the sample.

1. Set Up Your Environment

To set up your environment, first follow the step in sections 1 (Provision a cluster with Cassandra and Spark) 2 (Set up a Spark client) in the tutorial here: https://www.instaclustr.com/support/documentation/apache-spark/getting-started-with-instaclustr-spark-cassandra/

(Just a minor change in the configuration would be selecting “AMI: Ubuntu Server 16.04 LTS (HVM), SSD Volume Type as the AMI”)

Once this is complete, open three tabs (tab 1, tab 2, tab 3) in the Terminal, install and start Kafka:

  1. Download Kafka (in tab1):
cd ~
wget http://www-us.apache.org/dist/kafka/0.9.0.1/kafka_2.10-0.9.0.1.tgz
  1. Unpack the files (in tab1):
tar xvf kafka_2.10-0.9.0.1.tgz
  1. Build Kafka (in tab1):
cd kafka_2.10-0.9.0.1
sbt update
sbt package

Note 1: If you get a warning of the sbt version not being set: then create a new directory in /home/ubuntu/kafka_2.10-0.9.0.1 called “project”. In this directory, create a file called “build.properties” and add this line: “sbt.version=1.0.2”

Note 2: If you get an error about java version mismatch, saying:
bc: command not found The java installation you have is not up to date requires at least version 1.6+, you have version 1.8”, install bc with “sudo apt-get install bc”.

  1. Start Kafka (in tab1):
bin/zookeeper-server-start.sh config/zookeeper.properties&
bin/kafka-server-start.sh config/server.properties&
  1. Run Kafka producer test harness to send some test messages (in tab2):
cd kafka_2.10-0.9.0.1
bin/kafka-console-producer.sh --topic test --broker-list localhost:9092
  1. With the test harness running, type some random messages followed by Ctrl-D to finish. You will see a lot of logs when you type the first entry (related to the test harness setting up the channel). From then, you should see no errors for subsequent entries (in tab2).
  2. Run the consumer test harness to retrieve the messages (in tab3):
cd kafka_2.10-0.9.0.1
bin/kafka-console-consumer.sh --topic test --zookeeper localhost:2181 --from-beginning
  1. You should see the messages you typed in earlier played back to you (in tab3).

2. Build the sample

We have loaded a sample project including the build, source and configuration files to Github. To build this:

  1. Clone the repository:
cd ~
git clone https://github.com/instaclustr/sample-KafkaSparkCassandra.git

The repository contains 4 active files:

  • sbt: the project file that specifies dependencies.
  • cassandra-count.conf: configuration file with IPs, username, password.
  • src/main/scala/KafkaSparkCassandra.scala: the scala file with the actual application. The code is heavily commented to explain what is going on.
  • project/assembly.sbt: sbt plugin config to package dependencies in the target jar.

When executed, the application will:

  • Connect directly from the Spark driver to Cassandra, create a keyspace and table to store results if required.
  • Start a Spark streaming session connected to Kafka. Summarise messages received in each 5 second period by counting words. Save the summary result in Cassandra.
  • Stop the streaming session after 30 seconds.
  • Use Spark SQL to connect to Cassandra and extract the summary results table data that has been saved.
  1. Build the project:
cd sample-KafkaSparkCassandra
sbt assembly
  1. Set your local configuration settings by either overwriting the cassandra-count.conf with the one you created in the previous tutorial or editing the template from the repository to replace the values in <> brackets.

3. Run the sample

At this stage, Kafka should still be running after the first step. We need to run both the Kafka producer test harness and the Spark sample app at the same time so it’s easiest if you have two console windows open. Once you have the two windows open and logged in do the following steps:

  1. In your first console window, start the Kafka producer test harness:
cd ~/kafka_2.10-0.9.0.1/
bin/kafka-console-producer.sh --topic test --broker-list localhost:9092
  1. In the second console window, submit your Spark job:
cd ~/sample-KafkaSparkCassandra/
~/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --properties-file cassandra-count.conf --class KafkaSparkCassandra target/scala-2.11/cassandra-kafka-streaming-assembly-1.0.jar
  1. Switch back to the Kafka producer console window and enter some test messages for 20 seconds or so.
  2. Switch back to the Spark console window, amidst the streams of log messages you should see something like the following which is the summary from a single Spark streaming batch:

Picture1.png

  1. After 30 seconds of streaming has passed, you should see an output like the following which is the dump of the Casssandra table:

Picture2.png

Last updated:
If you have questions regarding this article, feel free to add it to the comments below.

0 Comments

Please sign in to leave a comment.