Getting started with Spark Jobserver and Instaclustr

Spark Jobserver is an open source project available on GitHub (originally created by Ooyala). You can submit jobs, contexts and JARs to the Jobserver using a RESTful interface. This tutorial demonstrates how to use the Jobserver to submit jobs to an Instaclustr Cassandra+Spark cluster. We will interact with the Jobserver using curl.

Spark Jobserver provides a simple, secure method of submitting jobs to Spark without many of the complex set up requirements of connecting to the Spark master directly.

If you’ve enabled encryption, you will need to have downloaded the cluster’s CA certificate, available in the zip file in the connection details page (this is the same CA certificate that is used for connecting to Cassandra).

If you’ve enabled authentication, you will need to supply a username and password when making HTTP requests to the Jobserver. You can do this in curl by using the flag -u username:password. These are also available on the connection details page of your cluster on the console.

The high level steps to follow are:

  1. Setup your environment.
  2. Build the sample.
  3. Run the sample.

1. Setup your environment

First, if you don’t already have one, create a Cassandra cluster with Spark enabled. All Instaclustr clusters with Spark enabled include Jobserver so you can use whatever settings make the most sense for your scenario.

Secondly, ensure that you have the necessary software installed on your machine to build the Spark jobs.

If you have already done one of our other Spark tutorials, the Spark client machine that you set up for those tutorials can be used for this tutorial. However, one of the advantages of using Jobserver is that less setup and network configuration is required to use it.

The software that you will need installed is:

  1. A java 1.8 JDK
  2. sbt
  3. git (to retrieve the samples)

These are readily available and easily installed for most systems. Some examples of how to install are:

  • Ubuntu:
    • sudo apt-get install default-jdk
    • echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
    • sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
    • sudo apt-get update
    • sudo apt-get install sbt
    • sudo apt-get install git
  • Mac:
    • Install brew if you don’t already have it (http://brew.sh/)
    • brew update
    • brew tap caskroom/cask
    • brew install Caskroom/cask/java
    • brew install sbt
    • brew install git

Finally, you will need to ensure the machine you will be working on has access to Spark Jobserver through the firewall. Log in to the Instaclustr console, navigate to the cluster settings page and add the IP address of your workstation to the Spark Jobserver allowed address. (If you’re unsure of the IP address of your workstation, go to Google and search for “what is my ip”.)

2. Build the sample

We have loaded a sample project including the build, source and configuration files to Github. To build this:

  1. Clone the repository:
git clone https://github.com/instaclustr/sample-SparkJobserverCassandra
  1. Build the project:
cd sample-SparkJobserverCassandra
sbt assembly

The repository contains 3 source files:

  • sbt: the project file that specifies dependencies.
  • src/main/scala/cassandraCount.scala: the scala file with the actual application. The code is brief (~10 lines) and heavily commented to explain what is going on.
  • project/assembly.sbt: sbt plugin config to package dependencies in the target jar.

When executed, the application will use the Cassandra connector to create an RDD based on a Cassandra table, count the number of rows in the RDD and return the result.

3. Run the sample

Upload the jar

As the first step we will upload the JAR to the Jobserver, which will allow us to make future calls to it. The examples below include the --cacert and -u options which assumes that authentication and SSL are enabled. If they aren't enabled, just proceed without those flags (and use http rather than https).

curl --cacert cluster-ca-certificate.pem -u icspark:<password> --data-binary @target/scala-2.10/cassandra-count-assembly-1.0.jar https://<sparkJobServerIP>:8090/jars/cassandra-count

curl will return with ‘OK’ to indicate success. You can also make a GET request to the Jobserver to verify that the JAR has indeed been uploaded:

curl --cacert cluster-ca-certificate.pem -u icspark:<password> https://<sparkJobServerIP>:8090/jars
{
    "cassandra-count": "2015-11-16T22:44:31.775Z"
}

Uploading Contexts to the Jobserver

You upload contexts to the Jobserver and specify your job to use the available contexts that have been uploaded to the Jobserver. Jobserver manages the context on your behalf so you don't initialize a new context in the main function of the job. The Jobserver uses the following context by default:

  • num-cpu-cores = 2
  • memory-per-node = 512m

We will upload a new context as the cassandra-count job requires connecting to cassandra:

curl --cacert cluster-ca-certificate.pem -d "" -u icspark:<password> 'https://<sparkJobserverIP>:8090/contexts/test-context?spark.cassandra.auth.username=iccassandra&spark.cassandra.auth.password=<password>&spark.cassandra.connection.host=<PRIVATE_IP_OF_CASSANDRA_NODE>'

curl will return with OK upon success of uploading the context. We can now use this context when running the job. You can specify other context parameters and they will override the default context if applicable. The name of the context is unique - there can only be one context with the name test-context. You can always delete the existing context by making a DELETE request to it:

curl --cacert cluster-ca-certificate.pem -u icspark:<password> --request DELETE https://<sparkJobserverIP>:8090/contexts/test-context'

This will stop all jobs running in that context, so be careful! 

Running our cassandra-count job

We are now ready to run the job. We do so by making a post request to the Jobserver with the cassandra endpoint we want to use:

curl --cacert cluster-ca-certificate.pem -u icspark:<password> -d "" 'https://<sparkJobserverIP>:8090/jobs?appName=cassandra-count&classPath=cassandraCount&context=test-context'

Here we've told the Jobserver to run our job using the context we specified earlier. Jobserver will return with something that looks like this:

{
"status": "STARTED",
"result": {
"jobId": "6d6350d6-7c67-4cd7-8129-c36d4985ca80",
"context": "test-context"
    }
}

Alternatively, you can always add &sync=true at the end for small jobs, which will cause curl to wait for the result of the job. We can query for the result or status of the job by making a GET request to /jobs/<uuid>:

curl --cacert cluster-ca-certificate.pem -u icspark:<password>https://sparkJobserverIP:8090/jobs/6d6350d6-7c67-4cd7-8129-c36d4985ca80

If the job has finished, you will see the following result:

{
    "status": "FINISHED",
    "result": 5
}

Which tells us that the job completed successfully and that the number of tables in the system keyspace is 5. 

The Spark Jobserver UI

The Jobserver UI is available on port 8090 of your spark jobserver instance’s IP. It shows all your currently uploaded JARs, contexts, and failed/running/successful jobs.

 

Job Server UI

Further Reading

The Spark Jobserver GitHub page contains a lot of useful information about using the Spark Jobserver.

Last updated:
If you have questions regarding this article, feel free to add it to the comments below.

0 Comments

Please sign in to leave a comment.