Cassandra insert performance using spark-cassandra connector - scala

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?

It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.

There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

Spark Streaming - Refresh Static Data

I have a Spark Streaming job, which when it starts, queries Hive and creates a Map[Int, String] object, which is then used for parts of the calculations the job performs.
The problem I have is that the data in Hive has the potential changes every 2 hours. I would like to have the ability to refresh the static data on a schedule, without having to restart the Spark Job every time.
The initial load of the Map object takes around a 1minute.
Any help is very welcome.
You can use a listener. Which will be triggered every time when a job is started for any stream within the spark context. Since your db is updated every two hours there is no harm updating it every-time AFAIK.
sc.addSparkListener(new SparkListener() {
override def onSparkListenerJobStart(jobStart: SparkListenerJobStart) {
//load data that to the map that will be sent to executor
}
});

Spark configurations for Out of memory error [duplicate]

Cluster setup -
Driver has 28gb
Workers have 56gb each (8 workers)
Configuration -
spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g
My job -
//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.
//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)
//works fine
data.map(s => myFunc(s))
Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?
Update -
Data -
val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)
Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.
myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.
Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.
You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.
Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.
Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.
Would help if you put more details of what going on in your program before and after the MAP.
Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.
Updating solution after looking at your code, will be better if you do this
val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Spark Streaming states to be persisted to disk in addition to in memory

I have written a program using spark streaming by using map with state function which detect repetitive records and avoid such records..the function is similar as bellow:
val trackStateFunc1 = (batchTime: Time,
key: String,
value: Option[(String,String)],
state: State[Long]) => {
if (state.isTimingOut()) {
None
}
else if (state.exists()) None
else {
state.update(1L)
Some(value.get)
}
}
val stateSpec1 = StateSpec.function(trackStateFunc1)
//.initialState(initialRDD)
.numPartitions(100)
.timeout(Minutes(30*24*60))
My numbers of records could be high and I kept the time-out for about one month. Therefore, number of records and keys could be high..I wanted to know if I can save these states on Disk in addition to the Memory..something like
"RDD.persist(StorageLevel.MEMORY_AND_DISK_SER)"
I wanted to know if I can save these states on Disk in addition to the
Memory
Stateful streaming in Spark automatically get serialized to persistent storage, this is called checkpointing. When you run your stateful DStream, you must provide a checkpoint directory otherwise the graph won't be able to execute at runtime.
You can set the checkpointing interval via DStream.checkpoint. For example, if you want to set it to every 30 seconds:
inputDStream
.mapWithState(trackStateFunc)
.checkpoint(Seconds(30))
Accourding to "MapWithState" sources you can try:
mapWithStateDS.dependencies.head.persist(StorageLevel.MEMORY_AND_DISK)
actual for spark 3.0.1

Spark streaming multiple sources, reload dataframe

I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.
I can load the postgres table with something like:
val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
"url" -> url,
"dbtable" -> query))
...
val broadcasted = sc.broadcast(data.collect())
And later I can cross it like this:
val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}
I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/
In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.
Compared to traditional stateful streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.
The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.
One workaround is to postpone the query to output operations for example: foreachRDD(func).