Using Apache Zeppelin with Instaclustr Spark & Cassandra Tutorial

Zeppelin is a web-based notebook, which facilitates interactive data analysis using Spark. Instaclustr now supports Apache Zeppelin as an add-on component to our managed clusters. In this tutorial, we will walk you through the basic steps of using Apache Zeppelin with Instaclustr Spark and Cassandra.

1. Provision a cluster with Cassandra, Spark and Zeppelin

(1) If you haven’t already signed up for an Instaclustr account, refer our support article to sign up and create an account.

(2) Once you have signed up for Instaclustr and verified your email, log into the Instaclustr console and click the Create Cassandra Cluster button.

creating_cluster_01_final.png

(3) On the Create Cassandra Cluster page, enter an appropriate name and network address block for your cluster. Refer our support article on Network Address Allocation to understand how we divide up the specified network range to determine the node IP addresses. Under Applications section, select:

  • Apache Cassandra 3.11
  • Apache Spark as an Add-on (Apache Spark 2.1.1 - Hadoop 2.6)
  • Apache Zeppelin as an Add-on (Apache Zeppelin 0.7.1 with Scala 2.11/Spark 2.1.1)

application_section.png

(4) Under Data Centre section, select:

data_center_section.png

(5) Under Cassandra Options section, select:

  • Use Private IP Addresses for node discovery

cassandra_options_section.png

(6) Leave the other options as default. Accept the terms and conditions and click Create Cluster button. The cluster will automatically provision and will be available for use once all nodes are in the running state.

create_cluster.png

2. Getting Started with Zeppelin

(1) Once all nodes in the cluster are in the running state, click on the Zeppelin tab on the cluster’s page.

zeppelin_tab.png

(2) Go to the listed URL and enter the given credentials to access Zeppelin.

zeppelin_url_crentials.png

(3) After which you should see the following page.

welcome_zeppelin.png

3. Basic Interaction with Zeppelin Notebook

(1) Create a new Notebook by clicking on the Create new note link. Give your note a preferred name and let Spark to be the Default Interpreter and click the Create Note button.

create_new_note.png

create_new_note_button.png

(2) The notebook has already been preconfigured to use Spark interpreter. Click the gear button on the top right of the notebook to see the enabled interpreters.

gear_button.png

(3) Make sure the Spark interpreter is at the top of the list and Cassandra interpreter is enabled. Click Save button to save the settings.

interpreters.png

(4) Load the dependencies using the following code.

%dep
z.load("/opt/zeppelin/interpreter/spark/spark-cassandra-connector-assembly-2.0.2.jar")

Then you will see the following output:

output.png 

Make sure you get the same output as shown in the above picture. If it throws out an error, click on the gear button on the top right, go to the Interpreter menu and then restart the spark interpreter. Then you can go back to the Notebook and re-run the code.

rerun_interpreter.png

rerun_interpreter_2.png

(5) Run the following code:

%spark
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
val rdd = sc.cassandraTable("system_schema","keyspaces")
println("Row count:" + rdd.count)

You should then get a result like the following:

result_2.png

4. Using Spark SQL from Zeppelin Notebook 

(1) In the same Notebook, add a new paragraph, write and run the following code. 

%spark
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
val createDDL = """CREATE TEMPORARY VIEW keyspaces
USING org.apache.spark.sql.cassandra
OPTIONS (
table "keyspaces",
keyspace "system_schema",
pushdown "true")"""
spark.sql(createDDL)
spark.sql("SELECT * FROM keyspaces").show
val rdd1 = spark.sql("SELECT count(*) from keyspaces")
println("Row count: " + rdd1.first()(0))

You should then get a result like the following:

output_sparkSQL_edited.png

If you try to run the above code in a new Notebook, you have to load the dependencies in the new Notebook first.

5. Using CQL from Zeppelin Notebook

Zeppelin can also be used to connect directly to Cassandra to execute CQL commands.

(1) Create a new Notebook.

(2) Put the following code into your Notebook and run the code.

%cassandra
USE "system_schema";
SELECT * FROM keyspaces;

You should then get a result like the following:

output_CQL.png

 

Last updated:
If you have questions regarding this article, feel free to add it to the comments below.

0 Comments

Please sign in to leave a comment.