spark Cassandra tuning - scala

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra

In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.

The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

Related

How to use TestHiveContext using Spark 2.2

I am trying to upgrade to Spark 2.2 from Spark 1.6. The existing unit tests are depending on a defined HiveContext which was initialised using TestHiveContext.
val conf = new SparkConf().set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext("local", "sc", conf)
sc.setLogLevel("WARN")
val sqlContext = new TestHiveContext(sc)
In spark 2.2, HiveContext is deprecated and SparkSession.builder.enableHiveSupport is advised to be used. I tried to create a new SparkSession using SparkSession.builder but I couldn't find a way to initialise a SparkSession that uses TestHiveContext.
Is it possible to do that or should I change my approach ?
HiveContext and SQLContext has been replaced by SparkSession as stated in the migration guide :
SparkSession is now the new entry point of Spark that replaces the old
SQLContext and
HiveContext. Note that the old SQLContext and HiveContext are kept for
backward compatibility. A new catalog interface is accessible from
SparkSession - existing API on databases and tables access such as
listTables, createExternalTable, dropTempView, cacheTable are moved
here.
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-16-to-20
So you create a Sparksession instance with your test configuration and use it instead of HiveContext

How to use SparkSession and StreamingContext together?

I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so:
val sc: SparkContext = createSparkContext(sparkContextName)
val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate()
val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time))
val csvSchema = new StructType().add("field_name",StringType)
val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
If I run ssc.start() after this, I get this error:
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Instead if I try to start the SparkSession like this:
inputDF.writeStream.format("console").start()
I get:
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Clearly I'm not understanding how SparkSession and StreamingContext should work together. If I get rid of SparkSession, StreamingContext only has textFileStream on which I need to impose a CSV schema. Would appreciate any clarifications on how to get this working.
You cannot have a spark session and spark context together. With the release of Spark 2.0.0 there is a new abstraction available to developers - the Spark Session - which can be instantiated and called upon just like the Spark Context that was previously available.
You can still access spark context from the spark session builder:
val sparkSess = SparkSession.builder().appName("My App").getOrCreate()
val sc = sparkSess.sparkContext
val ssc = new StreamingContext(sc, Seconds(time))
One more thing that is causing your job to fail is you are performing the transformation and no action is called. Some action should be called in the end such as inputDF.show()

SparkSession not picking up Runtime Configuration

In my application I'm creating a SparkSession object and then trying to Read my properties file and setting the properties at runtime. But it is not picking up the properties that I am passing at runtime.
I am submitting my App in YARN Cluster Mode
This is my inital Spark session object which I am creating in a Trait
val spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
Then in my main function which is inside an object, i am extending this Trait so my spark session is Initialized in Trait and in my Object (containing main) i am setting this :
spark.conf.set(spark.sql.hive.convertMetastoreParquet, false)
spark.conf.set(mapreduce.input.fileinputformat.input.dir.recursive,true)
spark.conf.set(spark.dynamicAllocation.enabled, true)
spark.conf.set(spark.shuffle.service.enabled, true)
spark.conf.set(spark.dynamicAllocation.minExecutors,40)
So Ideally my App must start with 40 Executors but it is starting and then running Entirely using the Default 2 executors ..
There is nothing unexpected here. Only certain subset of Spark SQL properties (prefixed with spark.sql) can be set on runtime (see SparkConf documentation):
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
Remaining options have to be set before SparkContext is initalized. It means initalizing SparkSession with SparkContext:
val conf: SparkConf = ... // Set options here
val sc = SparkContext(conf)
val spark = SparkSession(sc)
with config method of SparkSession.Builder and SparkConf
val conf: SparkConf = ... // Set options here
val spark = SparkSession.builder.config(conf).getOrCreate
or key-value pairs:
val spark = SparkSession.builder.config("spark.some.key", "some_value").getOrCreate
This applies in particular to spark.dynamicAllocation.enabled,
spark.shuffle.service.enabled and spark.dynamicAllocation.minExecutors.
mapreduce.input.fileinputformat.input.dir.recursive from the other hand, is a property of Hadoop configuration, not Spark, and should be set there:
spark.sparkContext.hadoopConfiguration.set("some.hadoop.property", "some_value")

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream with Spark on local mode

I have used Spark before in yarn-cluster mode and it's been good so far.
However, I wanted to run it "local" mode, so I created a simple scala app, added spark as dependency via maven and then tried to run the app like a normal application.
However, I get the above exception in the very first line where I try to create a SparkConf object.
I don't understand, why I need hadoop to run a standalone spark app. Could someone point out what's going on here.
My two line app:
val sparkConf = new SparkConf().setMaster("local").setAppName("MLPipeline.AutomatedBinner")//.set("spark.default.parallelism", "300").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000") //.set("spark.akka.threads", "300")//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") //.set("spark.akka.timeout", "1000")
val sc = new SparkContext(sparkConf)