Spark-Scala with Cassandra - scala

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!

Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2

you can perform this easly with spark-cassandra-connector

Related

Spark StreamingContext

I have a problem running Spark Streaming. Can someone please help me below?
Since you are using /Filestore, I believe you are using databricks.
Below code would help you to start a spark streaming context.
If you are using databricks, clear all the states and run the below code.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 1)
dstream = ssc.textFileStream("<Folder/File location")
dstream.saveAsTextFiles("<Destination folder/file location")
ssc.start()
ssc.awaitTermination()
I would suggest you to start using spark structured streaming, instead of using standard streaming option.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Spark Local Session with custom maven library

In my scala code, which I run thru sbt run command I am creating local spark session and I need to make use of following library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs._
...
val spark = SparkSession.builder
.master("local")
.appName("RandomForestClassifierExample")
.getOrCreate()
...
val connectionString = ConnectionStringBuilder("<connectionstring>")
.setEventHubName("energinet")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream)
.setConsumerGroup("$default")
val eventhubs = spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
Of course it fails, because of missing event hubs library. I know I can run spark-submit and pull the library by setting --packages parameter, however I want to run my app using sbt run command. Please is there a way, how to make the library available for local spark sessions I create from scala code?

Execute job on distributed cassandra DSE spark cluster

I have three node Cassandra DSE cluster and db schema with RF=3. Now I'm creating a scala application to be executed on DSE spark. Scala code is as follow :-
package com.spark
import com.datastax.spark.connector._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
object sample {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("testing")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "1g")
.set("spark.driver.maxResultSize", "500M")
.set("spark.executor.heartbeatInterval", "30s")
.set("spark.submit.deployMode", "cluster")
val sc = new SparkContext(conf)
val lRDD = sc.cassandraTable("dbname", "tablename")
lRDD.collect.foreach(println)
}}
I'm running script using
dse> bin/dse spark-submit --class com.spark.sample --total-executor-cores 4 /home/db-svr/sample.jar
So, now I want to execute my spark application from 1 node but system should do processing on 3 nodes internally and I want to monitor the same so that I can utilize RAM and processor collectively of 3 nodes. How can I do that ?
Also, this current script is taking lot of time to bring result (table size 1 million rows with 128 byte each). Is there any performance tuning parameters that I'm missing?
There a few things you probably want to change. The main thing stopping you from running on multiple machines is
.setMaster("local")
Which instructs the application that it shouldn't use a distributed Resource Manager and instead should run everything locally in the application process. With DSE you should follow the relevant documentation or start with the Spark Build Examples.
In addition you most likely never want to set
.set("spark.driver.allowMultipleContexts", "true")
having multiple Spark Contexts in one JVM is frought with problems and usually means things are not set up correctly.

How spark-shell or Zepellin notebook set HiveContext to SparkSession?

Does anyone know why I can access to an existing hive table from spark-shell or zepelling notebook doing this
val df = spark.sql("select * from hive_table")
But when I submit a spark jar with a spark object created this way,
val spark = SparkSession
.builder()
.appName("Yet another spark app")
.config("spark.sql.shuffle.partitions", 18)
.config("spark.executor.memory", "2g")
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
I got this
Table or view not found
What I really want is to learn, understand, what the shell and the notebooks are doing for us in order to provide hive context to the SparkSession.
When working with Hive, one must instantiate SparkSession with Hive support
You need to call enableHiveSupport() on the session builder

Not able to load hive table into Spark

I am trying to load data from hive table using spark-sql. However, it doesn't return me anything. I tried to execute the same query in hive and it prints out the result. Below is my code which I am trying to execute in scala.
sc.setLogLevel("ERROR")
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
val data = sqlContext.sql("select `websitename` from db1.table1 limit 10").toDF
Kindly let me know what could be the possible reason.
Spark- version : 1.6.2
Scala - 2.10
Depends how the table was created in the first place. If it was created by an external application and you have hive running as separate service make sure that the settings in SPARK_HOME/conf/hive-site.xml are correct.
If it's an internal spark-sql table, it sets up the metastore in a folder on the master node, which in your case might have been deleted or moved.