Spark Cassandra integration not using C* optimizations - scala

I am running code from intellij IDE .My spark cassandra cluster has 3 nodes . Cassandra nodes and spark workers are on same machines
val sparkConf = new SparkConf()
.set(s"spark.sql.catalog.mycatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
.set("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.set("spark.sql.catalog.casscatalog", "com.datastax.spark.connector.datasource.CassandraCatalog");
val sc = SparkSession.builder()
.config(sparkConf)
.master("spark://master")
.withExtensions(new CassandraSparkExtensions)
.getOrCreate();
val table = sc.sql("select * from table where primarykeyA = 1");
table.show(10)
Now above query when I run normally will be in millisec as I have mentioned partition key
Expectation This query should only hit 1 of worker nodes which has the parition data and should return
Somehow when this runs it ends up going to all worker nodes which indicates Datastax optimization are not in place
Is there a way I can submit below parameter --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta via code

Related

Spark sc.binaryFiles() partitioning small files and YARN

Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
object ReadTestYarn extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("yarn", "ReadTestYarn")
val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")
println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)
}
[maria_dev#sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar
Num of RDD1 partitions: 10
Num of RDD2 partitions: 1
The data I use is small, 10 csv files, each about 4-5MB in size, 43MB in total. In the case of RDD1, the number of the resulting partitions is understandable and the calculation method is well explained in the following post and article:
Spark RDD default number of partitions
https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7
But with RDD2, binaryFiles() function and master URL passed to Spark as "yarn", the number of partitions created is only 1 which I don't understand exactly.
#Mark Rajcok has given some explanation in the post below, but the link to commit changes there is not working. Could someone please provide detailed explanation about creating only one partition in this case?
PySpark: Partitioning while reading a binary file using binaryFiles() function

Execute job on distributed cassandra DSE spark cluster

I have three node Cassandra DSE cluster and db schema with RF=3. Now I'm creating a scala application to be executed on DSE spark. Scala code is as follow :-
package com.spark
import com.datastax.spark.connector._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
object sample {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("testing")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "1g")
.set("spark.driver.maxResultSize", "500M")
.set("spark.executor.heartbeatInterval", "30s")
.set("spark.submit.deployMode", "cluster")
val sc = new SparkContext(conf)
val lRDD = sc.cassandraTable("dbname", "tablename")
lRDD.collect.foreach(println)
}}
I'm running script using
dse> bin/dse spark-submit --class com.spark.sample --total-executor-cores 4 /home/db-svr/sample.jar
So, now I want to execute my spark application from 1 node but system should do processing on 3 nodes internally and I want to monitor the same so that I can utilize RAM and processor collectively of 3 nodes. How can I do that ?
Also, this current script is taking lot of time to bring result (table size 1 million rows with 128 byte each). Is there any performance tuning parameters that I'm missing?
There a few things you probably want to change. The main thing stopping you from running on multiple machines is
.setMaster("local")
Which instructs the application that it shouldn't use a distributed Resource Manager and instead should run everything locally in the application process. With DSE you should follow the relevant documentation or start with the Spark Build Examples.
In addition you most likely never want to set
.set("spark.driver.allowMultipleContexts", "true")
having multiple Spark Contexts in one JVM is frought with problems and usually means things are not set up correctly.

spark Cassandra tuning

How to set following Cassandra write parameters in spark scala code for
version - DataStax Spark Cassandra Connector 1.6.3.
Spark version - 1.6.2
spark.cassandra.output.batch.size.rows
spark.cassandra.output.concurrent.writes
spark.cassandra.output.batch.size.bytes
spark.cassandra.output.batch.grouping.key
Thanks,
Chandra
In DataStax Spark Cassandra Connector 1.6.X, you can pass these parameters as part of your SparkConf.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "100")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100")
.set("spark.cassandra.output.batch.grouping.key", "partition")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
You can refer to this readme for more information.
The most flexible way is to add those variables in a file, such as spark.conf:
spark.cassandra.output.concurrent.writes 10
etc...
and then create your spark context in your app with something like:
val conf = new SparkConf()
val sc = new SparkContext(conf)
and finally, when you submit your app, you can specify your properties file with:
spark-submit --properties-file spark.conf ...
Spark will automatically read your configuration from spark.conf when creating the spark context
That way, you can modify the properties on your spark.conf without needing to recompile your code each time.

Simple spark job fail due to GC overhead limit

I've created a standalone spark (2.1.1) cluster on my local machines
with 9 cores / 80G each machine (total of 27 cores / 240G Ram)
I've got a sample spark job that sum all the numbers from 1 to x
this is the code :
package com.example
import org.apache.spark.sql.SparkSession
object ExampleMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("spark://192.168.1.2:7077")
.config("spark.driver.maxResultSize" ,"3g")
.appName("ExampleApp")
.getOrCreate()
val sc = spark.SparkContext
val rdd = sc.parallelize(Lisst.range(1, 1000))
val sum = rdd.reduce((a,b) => a+b)
println(sum)
done
}
def done = {
println("\n\n")
println("-------- DONE --------")
}
}
When running the above code I get results after a few seconds
so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached
I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.
Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10
I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.
Per #philantrovert request I'm adding my spark-submit command
/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar
In addition my spark/conf are as following:
slaves file contain the 3 IP addresses of my nodes (including the master)
spark-defaults contain:
spark.master spark://192.168.1.2:7077
spark.driver.memory 10g
spark-env.sh contain:
SPARK_LOCAL_DIRS= shared folder among all nodes
SPARK_EXECUTOR_MEMORY=10G
SPARK_DRIVER_MEMORY=10G
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=10G
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_DIR= shared folder among all nodes
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Thanks
I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:
val rdd = spark.range(1000000000L).rdd

Scala Code to connect to Spark and Cassandra

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}
val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST