does anyone know to what corresponds LocalTableScan in Spark Structured Streaming?
I'm trying to understand a strange behavior that I observed in my Spark structure streaming application that is running in local[*] mode.
I have 8 core on my machines. While the majority of my Batches have 8 partitions, every once in a while I get 16 or 32 or 56 and so on partitions/Tasks. I notice that it is always a multiple of 8. I have notice in opening the stage tab, that when it happens, it is because there is multiple LocalTableScan.
That is if I have 2 LocalTableScan then the mini-batch job, will have 16 task/partition and so on.
To give a bit of context because I am suspecting that it might come from it, I am using a MemoryStream.
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
val rdf = df.mapPartitions{ it => {.....}}(RowEncoder.apply(StructType(List(StructField("blob", StringType, false)))))
I have a future that feeds my memory stream as such right after:
Future {
blocking {
for (i <- 1 to 100000) {
rows.addData(maps)
Thread.sleep(3000)
}
}
}
and then my query:
rdf.writeStream.
trigger(Trigger.ProcessingTime("1 seconds"))
.format("console").outputMode("append")
.queryName("SourceConvertor1").start().awaitTermination()
Please, any suggestions? Hints ?
It indicates in memory on the Driver. As your code shows.
Related
I've setup a pipeline for incoming events from a stream in Apache Kafka.
Spark connects to Kafka, get the stream from a topic and process some "simple" aggregation tasks.
As I'm trying to build a service that should have a low latency refresh (below 1 second) I've built a simple Spark streaming app in Scala.
val windowing = events.window(Seconds(30), Seconds(1))
val spark = SparkSession
.builder()
.appName("Main Processor")
.getOrCreate()
import spark.implicits._
// Go into RDD of DStream
windowing.foreachRDD(rdd => {
// Convert RDD of JSON into DataFrame
val df = spark.read.json(rdd)
// Process only if received DataFrame is not empty
if (!df.head(1).isEmpty) {
// Create a view for Spark SQL
val rdf = df.select("user_id", "page_url")
rdf.createOrReplaceTempView("currentView")
val countDF = spark.sql("select count(distinct user_id) as sessions from currentView")
countDF.show()
}
It works as expected. My concerns are about performance at this point. Spark is running on a 4 CPUs Ubuntu server for testing purpose.
The CPU usage is about 35% all the time. I'm wondering if the incomming data from the stream have let's say 500 msg/s how would the CPU usage will evolve? Will it grow exp. or in a linear way?
If you can share your experience with Apache Spark in that kind of situation I'd appreciate it.
Last open question is if I set the sliding window interval to 500ms (as I'd like) will this blow up? I mean, it seems that Spark streaming features are fresh and the batch processing architecture may be a limitation in really real time data processing, isn't it?
I developed spark streaming(reeiver approach) which is reading data from kafka and processing data and writing into elasticsearch.
same code was developed in java(now we are writing same code in spark scala) and when we are comparing with java performance, spark is not doing well.
What I have observed is when we are writing to ES, its taking time.
Here is my highlevel code:
val kafkaStreams: util.List[DStream[String]] = new util.ArrayList[DStream[String]]
for(i <- 0 until topic_threads){
var data = KafkaUtils.createStream(ssc,kafkaConf,topic).map(line => line._2)
kafkaStreams.add(data)
}
//below union improves the performance as per spark 1.6.2
documentation
val unifiedStream = ssc.union(kafkaStreams)
unifiedStream.persist(StorageLevel.MEMORY_ONLY)
if(flagY){
val dataES = unifiedStream.map(rdd => processData(rdd))
dataES.foreachRDD(rdd => {
ElasticUtils.saveToEs(rdd, index_Name, index_Type)
})
In processData method, I am just parsing the data which we have red from kafka.
Could anyone please let me know your experiene or suggestions to improve the spark steaming(scala) receiver approach performance.
Due to this low performance, batches are piling up and its increasing delay in batch scheduling..
I am running an application with the following code. I don't understand why only 1 executor is in use even though I have 3. When I try to increase the range, my job fails cause the task manager loses executor.
In the summary, I see a value for shuffle writes but shuffle reads are 0 (maybe cause all the data is on one node and no shuffle read needs to happen to complete the job).
val rdd: RDD[(Int, Int)] = sc.parallelize((1 to 10000000).map(k => (k -> 1)).toSeq)
val rdd2= rdd.sortByKeyWithPartition(partitioner = partitioner)
val sorted = rdd2.map((_._1))
val count_sorted = sorted.collect()
Edit: I increased the executor and driver memory and cores. I also changed the number of executors to 1 from 4. That seems to have helped. I now see shuffle read/writes on each node.
It looks like your code is ending up with only one partition for RDD. You should increase the partitions of RDD to at least 3 to utilize all 3 executors.
..maybe cause all the data is on one node
That should make you think that your RDD has only one partition, instead of 3, or more, that would eventually utilize all the executors.
So, extending on Hokam's answer, here's what I would do:
rdd.getNumPartitions
Now if that is 1, then repartition your RDD, like this:
rdd = rdd.repartition(3)
which will partition your RDD into 3 partitions.
Try executing your code again now.
all,
I have a table which is about 1TB in mongodb. I tried to load it in spark using mongo connector but I keep getting stack overflow after 18 minutes executing.
java.lang.StackOverflowError:
at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
....
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
16/06/29 08:42:22 INFO YarnAllocator: Driver requested a total number of 54692 executor(s).
16/06/29 08:42:22 INFO YarnAllocator: Will request 46501 executor containers, each with 4 cores and 5068 MB memory including 460 MB overhead
Is it because I didn't provide enough memory ? Or should I provide more storage?
I have tried to add checkpoint, but it doesn't help.
I have changed some value in my code because they relate to my company database, but the whole code is still ok for this question.
val sqlContext = new SQLContext(sc)
val builder = MongodbConfigBuilder(Map(Host -> List("mymongodurl:mymongoport"), Database -> "mymongoddb", Collection ->"mymongocollection", SamplingRatio -> 0.01, WriteConcern -> "normal"))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
mongoRDD.registerTempTable("mytable")
val dataFrame = sqlContext.sql("SELECT u_at, c_at FROM mytable")
val deltaCollect = dataFrame.filter("u_at is not null and c_at is not null and u_at != c_at").rdd
val mapDelta = deltaCollect.map {
case Row(u_at: Date, c_at: Date) =>{
if(u_at.getTime == c_at.getTime){
(0.toString, 0l)
}
else{
val delta = ( u_at.getTime - c_at.getTime ) / 1000/60/60/24
(delta.toString, 1l)
}
}
}
val reduceRet = mapDelta.reduceByKey(_+_)
val OUTPUT_PATH = s"./dump"
reduceRet.saveAsTextFile(OUTPUT_PATH)
As you know, Apache Spark does in-memory processing while executing a job, i.e. it loads the data to be worked on into the memory. Here as per your question and comments, you have a dataset as large as 1TB and the memory available to Spark is around 8GB per core. Hence your spark executor will always be out of memory in this scenario.
To avoid this you can follow either of the below two options:
Change your RDD Storage Level to MEMORY_AND_DISK. In this way Spark will not load the full data into its memory; rather it will try to spill the extra data into disk. But, this way the performance will decrease because of the transactions done between the memory and disk. Check out RDD persistence
Increase your core memory so that Spark can load even 1TB of data fully into the memory. In this way the performance will be good, but infrastructure cost will increase.
I add another java option "-Xss32m" to spark driver to raise the memory of stack for every thread , and this exception is not throwing any more. How stupid was I , I should have tried it earlier. But another problem is shown, I will have to check more. still great thanks for your help.
I have gone through some videos in Youtube regarding Spark architecture.
Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.
If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.
e.g.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?
Is my understanding correct? Is caching RDD is only solution to address this issue?
Once I start caching, it may spill over to disk due to large size and performance would be impacted due to disk IO operations. In that case, performance of Hadoop and Spark are comparable?
EDIT:
From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.
But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?
First off, the lazy execution means that functional composition can occur:
scala> val rdd = sc.makeRDD(List("This is a test", "This is another test",
"And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
| map(word => {println(word);(word,1)}).
| reduceByKey((x,y) => {println(s"$x+$y");x+y}).
| collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:
That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.
So, overall, Spark beats MR hands down in just about every regard.
The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.
When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).
I would recommend watching this talk and going through the slides.