Scala Code to connect to Spark and Cassandra - scala

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}

val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST

Related

Spark program taking hadoop configurations from an unspecified location

I have few test cases such as reading/writing a file on HDFS that I want to automate using Scala and run using maven. I have taken the Hadoop configuration files of test environment and put it in the resources directory of my maven project. The project is also running fine on the desired cluster from any cluster that I am using to run the project from.
One thing that I am not getting is how is Spark taking Hadoop configurations from resources directory even when I have not specified it anywhere in the project. Below is a code snippet from project.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val hdfsCoreSitePath = new Path("/etc/hadoop/conf/core-site.xml","core-site.xml")
val hdfsHDFSSitePath = new Path("/etc/hadoop/conf/hdfs-site.xml","hdfs-site.xml")
val hdfsYarnSitePath = new Path("/etc/hadoop/conf/yarn-site.xml","yarn-site.xml")
val hdfsMapredSitePath = new Path("/etc/hadoop/conf/mapred-site.xml","mapred-site.xml")
hadoopConfiguration.addResource(hdfsCoreSitePath)
hadoopConfiguration.addResource(hdfsHDFSSitePath)
hadoopConfiguration.addResource(hdfsYarnSitePath)
hadoopConfiguration.addResource(hdfsMapredSitePath)
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(hadoopConfiguration)
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
println("-----------------Logged-in via keytab---------------------")
FileSystem.get(hadoopConfiguration)
val sc=new SparkContext(conf)
return sc
}
#Test
def testCase(): Unit = {
var hadoopConfiguration: Configuration = new Configuration()
val sc=getSparkContext(hadoopConfiguration)
//rest of the code
//...
//...
}
Here, I have used hadoopconfiguration object but I am not specifying this anywhere to sparkContext as this will run the tests on the cluster which I am using for running the project and not on some remote test environment.
If this is not a correct way? Can anyone please explain how I should carry out my motive of running spark test-cases on test environment from some remote cluster?

Execute job on distributed cassandra DSE spark cluster

I have three node Cassandra DSE cluster and db schema with RF=3. Now I'm creating a scala application to be executed on DSE spark. Scala code is as follow :-
package com.spark
import com.datastax.spark.connector._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
object sample {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("testing")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "1g")
.set("spark.driver.maxResultSize", "500M")
.set("spark.executor.heartbeatInterval", "30s")
.set("spark.submit.deployMode", "cluster")
val sc = new SparkContext(conf)
val lRDD = sc.cassandraTable("dbname", "tablename")
lRDD.collect.foreach(println)
}}
I'm running script using
dse> bin/dse spark-submit --class com.spark.sample --total-executor-cores 4 /home/db-svr/sample.jar
So, now I want to execute my spark application from 1 node but system should do processing on 3 nodes internally and I want to monitor the same so that I can utilize RAM and processor collectively of 3 nodes. How can I do that ?
Also, this current script is taking lot of time to bring result (table size 1 million rows with 128 byte each). Is there any performance tuning parameters that I'm missing?
There a few things you probably want to change. The main thing stopping you from running on multiple machines is
.setMaster("local")
Which instructs the application that it shouldn't use a distributed Resource Manager and instead should run everything locally in the application process. With DSE you should follow the relevant documentation or start with the Spark Build Examples.
In addition you most likely never want to set
.set("spark.driver.allowMultipleContexts", "true")
having multiple Spark Contexts in one JVM is frought with problems and usually means things are not set up correctly.

Clarification about running spark jobs on a cluster (AWS)

I have a HortonWorks cluster running on AWS EC2 machine on which I would like to run a spark job using spark streaming that will swallow the tweet concernings Game of thrones.
Before trying to run it on my cluster I did run it locally.
The code is working, here it is:
import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.twitter._
import org.apache.spark.{SparkConf, SparkContext}
object Twitter_Stream extends App {
val consumerKey = "hidden"
val consumerSecret = "hidden"
val accessToken = "hidden"
val accessTokenSecret = "hidden"
val sparkConf = new SparkConf().setAppName("GotTweets").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val myStream = TwitterUtils.createStream(ssc, None, Array("#GoT","#WinterIsHere","#GameOfThrones"))
val rddTweets = myStream.foreachRDD(rdd =>
{
rdd.take(10).foreach(println)
})
ssc.start()
ssc.awaitTermination()
}
My question are more precisely about this specific code line :
val sparkConf = new SparkConf().setAppName("GotTweets").setMaster("local[2]")
I replaced the "local[2]" by "spark://ip-address-EC2:7077" wich correspond to one of my ec2 machine but I have a connection failure.
I'm sure that the 7077 port is open on this machine.
Also when I run this code with this configuration (setMaster("local[2]")) on one of my EC2 machine , will my spark use all the machine of the cluster or will it run only on a single machine ?
Here the exception :
17/07/24 11:53:42 INFO AppClient$ClientEndpoint: Connecting to master
spark://ip-adress:7077... 17/07/24 11:53:44 WARN
AppClient$ClientEndpoint: Failed to connect to master ip-adress:7077
java.io.IOException: Failed to connect to spark://ip-adress:7077 at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
To run spark application using yarn, you should use spark-submit using --master yarn . No need to use setMaster inside scala source code.

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector

Why does Spark Cassandra Connector fail with NoHostAvailableException?

I am having problems getting Spark Cassandra Connector working in Scala.
I'm using these versions:
Scala 2.10.4
spark-core 1.0.2
cassandra-thrift 2.1.0 (my installed cassandra is v2.1.0)
cassandra-clientutil 2.1.0
cassandra-driver-core 2.0.4 (recommended for connector?)
spark-cassandra-connector 1.0.0
I can connect and talk to Cassandra (w/o spark) and I can talk to Spark (w/o Cassandra) but the connector gives me:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.0.194:9042 (com.datastax.driver.core.TransportException: [/10.0.0.194:9042] Cannot connect))
What am I missing? Cassandra is a default install (port 9042 for cql according to cassandra.yaml). I'm trying to connect locally ("local").
My code:
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext("local","test",conf)
val rdd = sc.cassandraTable("myks","users")
val rr = rdd.first
println(s"Result: $rr")
Local in this context is specifying the Spark master (telling it to run in local mode) and not the Cassandra connection host.
To set the Cassandra Connection host you have to set a different property in the Spark Config
import org.apache.spark._
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "IP Cassandra Is Listening On")
.set("spark.cassandra.username", "cassandra") //Optional
.set("spark.cassandra.password", "cassandra") //Optional
val sc = new SparkContext("spark://Spark Master IP:7077", "test", conf)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md