Using Apache Zeppelin with Instaclustr Spark & Cassandra Tutorial

Zeppelin is a web-based notebook, which facilitates interactive data analysis using Spark. Instaclustr is planning on adding Zeppelin as a standard component of our Spark clusters shortly. However, in the meantime, this tutorial demonstrates how you can set up Zeppelin to interact with Instaclustr Spark and Cassandra. In this tutorial, we assume that a cluster and a Spark client have already been provisioned and set up properly as shown in our “Getting started with Instaclustr Spark and Cassandra” tutorial.

Note: Instaclustr now support Apache Zeppelin as an add-on component to our managed clusters. Just click the Apache Zeppelin option when creating your cluster and skip straight to Step 5 below. 

1. Install Zeppelin on Spark Client

The following steps demonstrate how to install Zeppelin from source. 

(1) Log in into your Spark Client.

(2) Install npm (node package manager - required by Zeppelin build process).

sudo apt-get install npm

(3) Create a symbolic link for node.

sudo ln –s /usr/loca/bin/nodejs /usr/local/bin/node

(4) Download Maven (need to install manually to get the correct version).

wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

(5) Unpack the files.

sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/

(6) Create a symbolic link for mvn.

sudo ln –s /usr/local/apache-maven-3.3.3/bin/mvn /usr/local/bin/mvn 

(7) Install Git.

sudo apt-get install git

(8) Clone Zeppelin repository.

git clone https:/github.com/apache/incubator-zeppelin.git

(9) Build Zeppelin with Spark Interpreter.

cd ~/incubator-zeppelin
mvn clean package –Pspark-1.4 -DskipTests

 

2. Configure Network Access

This tutorial assumes that a VPC peering connection has been set properly as shown in section 3 of “Getting started with Instaclustr Spark and Cassandra” tutorial.

In addition, to browse the notebook from your local PC browser, you should enable your PC to connect to the Spark Client from port number:8080 in AWS security group. (8080 is the default port number of Zeppelin.) If you need more detailed instructions for this, see the earlier tutorial.

 

3. Run Zeppelin

(1) Log in to your Spark client machine and change directory to incubator-zeppelin.

(2) Start Zeppelin using the following commands.

./bin/zeppelin-daemon.sh start

If you needed, you can also use the following commands to stop Zeppelin and check status:

./bin/zeppelin-daemon.sh stop
./bin/zeppelin-daemon.sh status

(3) Browse <Spark client public IP>:8080. You should see the following page.

(4) Walk through the build-in Notebook tutorial (by clicking on the Zeppelin Tutorial link). It will help you get familiar with Zeppelin.

 

4. Configure Zeppelin through UI

Zeppelin provides UI to configure Spark and other interpreters. You can see all the interpreters and their properties in the Interpreter page. You can add, delete and change any properties and interpreters. To connect to the cluster, you must check the red-framed properties on the following picture have been set correctly. Once you set up these properties, save and restart the spark interpreter, all these properties will be automatically injected to the Spark context ‘sc’. If you have enabled authentication on your Instaclustr, you need add two more properties:

spark.cassandra.auth.username: iccassandra    

spark.cassandra.auth.password: <iccassandra password> 

 

5. Basic interaction with Zeppelin Notebook

(1) Create a new Notebook.

(2) Load dependencies.

%dep
z.load("/home/ubuntu/spark-cassandra-connector-assembly-1.6.0-M1.jar")

Note that if you are using our Zeppelin offering, this should be:

%dep
z.load("/opt/zeppelin/interpreter/spark/spark-cassandra-connector-assembly-1.6.0-M1.jar")

Then you will see the following output:

Make sure you get the same output as shown in the above picture. If it throws out an error, just go to the Interpreter page and restart the spark interpreter. Then you can go back to the Notebook and re-run the code.

(3) Run the following code.

import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
val rdd = sc.cassandraTable("system","schema_keyspaces")
println("Row count:" + rdd.count)

You should then get a result like the following:

 

6. Using Spark SQL from Zeppelin Notebook

(1) In the same Notebook, add a new paragraph, write and run the following code.

import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
val csc=new CassandraSQLContext(sc)
val rdd1=csc.sql("SELECT count(*) from system.schema_keyspaces")
println("Row count:"+rdd1.first()(0))

You should then get a result like the following:

If you try to run the above code in a new Notebook, you have to load the dependencies in the new Notebook first.

 

7. Using CQL from Zeppelin Notebook

Zeppelin can also be used to connect directly to Cassandra to execute CQL commands.

(1) Create a new Notebook.

(2) Go to the Interpreter page and change the value of “cassandra hosts” property to one of your Cassandra node private IP. If you have enabled authentication on your Instaclustr, you need to add two more properties:

cassandra.credentials.username: iccassandra

cassandra.credentials.password: <iccassandra password>

Note that if you are using our Zeppelin offering, Cassandra interpreter should already be pre-configured

(3) Put the following code into your Notebook and run the code.

%cassandra
USE "system";
SELECT * FROM schema_keyspaces;

You should then get a result like the following:

 

 

Last updated:
If you have questions regarding this article, feel free to add it to the comments below.

0 Comments

Please sign in to leave a comment.