Large task size for simplest program - scala

I am trying to run the simplest program with Spark
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (1 to 10000000).toList
val data = sc.parallelize(dat).cache()
for(i <- 1 to 100){
println(data.reduce(_ + _))
}
}
}
I get the following error message, after each iteration :
WARN TaskSetManager: Stage 0 contains a task of very large size (9767
KB). The maximum recommended task size is 100 KB.
Increasing the data size increases said task size. This suggests to me that the driver is shipping the "dat" object to all executors, but I can't for the life of me see why, as the only operation on my RDD is reduce, which basically has no closure. Any ideas ?

Because you create the very large list locally first, the Spark parallelize method is trying to ship this list to the Spark workers as a single unit, as part of a task. Hence the warning message you receive. As an alternative, you could parallelize a much smaller list, then use flatMap to explode it into the larger list. this also has the benefit of creating the larger set of numbers in parallel. For example:
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest extends App {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (0 to 99).toList
val data = sc.parallelize(dat).cache().flatMap(i => (1 to 1000000).map(j => j * 100 + i))
println(data.count()) //100000000
println(data.reduce(_ + _))
sc.stop()
}
EDIT:
Ultimately the local collection being parallelized has to be pushed to the executors. The parallelize method creates an instance of ParallelCollectionRDD:
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L730
ParallelCollectionRDD creates a number of partitions equal to numSlices:
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L96
numSlices defaults to sc.defaultParallelism which on my machine is 4. So even when split, each partition contains a very large list which needs to be pushed to an executor.
SparkContext.parallelize contains the note #note Parallelize acts lazily and ParallelCollectionRDD contains the comment;
// TODO: Right now, each split sends along its full data, even if
later down the RDD chain it gets // cached. It might be worthwhile
to write the data to a file in the DFS and read it in the split //
instead.
So it appears that the problem happens when you call reduce because this is the point that the partitions are sent to the executors, but the root cause is that you are calling parallelize on a very big list. Generating the large list within the executors is a better approach, IMHO.

Reduce function sends all the data to one single node. When you run sc.parallelize the data is distributed by default to 100 partitions. To make use of the already distributed data use something like this:
data.map(el=> el%100 -> el).reduceByKey(_+_)
or you can do the reduce at partition level.
data.mapPartitions(p => Iterator(p.reduce(_ + _))).reduce(_ + _)
or just use sum :)

Related

How to improve Kudu reads with Spark?

I have a process that given a new input retrieves related information form our Kudu database and then does some computation.
The problem lies in the data retrieval, we have 1.201.524.092 rows and for any computation, it takes forever to start processing the needed ones because the reader needs to give it all to spark.
To read form kudu we do:
def read(tableName: String): Try[DataFrame] = {
val kuduOptions: Map[String, String] = Map(
"kudu.table" -> tableName,
"kudu.master" -> kuduContext.kuduMaster)
SQLContext.read.options(kuduOptions).format("kudu").load
}
And then:
val newInputs = ??? // Dataframe with the new inputs
val currentInputs = read("inputsTable") // This takes too much time!!!!
val relatedCurrent = currentInputs.join(newInputs.select("commonId", Seq("commonId"), "inner")
doThings(newInputs, relatedCurrent)
For example, we only want to introduce a single new input. Well, it has to scan the full table to find the currentInputs which makes a Shuffle Write of 81.6 GB / 1201524092 rows.
How can I improve this?
Thanks,
You can collect the new input and after that you can use it in a where clause.
Using this way you can easily hit an OOM, but it can make your query very fast because it's going to benefit of predicate pushdown
val collectedIds = newInputs.select("commonId").collect
val filtredCurrentInputs = currentInputs.where($"commonId".isin(collectedIds))

How do I determine the optimal number of threads in Spark application?

In my Scala/Spark application, I am trying correctly to use multiprocessing. As you can see from the code below, the number of threads is equal to the number of elements in the storage array. I tested the current code and it works. But as you can see there are only 2 elements in the storage array. It seems to me that if there are a large number of elements in the array, problems will occur. In my case, I don't know how many elements there will be in the array in the future. Perhaps I should limit the number of threads and start new threads only when the previous ones are processed.
Question: How do I determine the optimal number of threads?
Main.app:
import org.apache.spark.sql.DataFrame
import utils.CustomThread
object MainApp {
def main(args: Array[String]): Unit = {
// Create the main DataFrame with all information.
var baseDF: DataFrame = spark.read.option("delimiter", "|").csv("/path_to_the_files/")
// Cache the main DataFrame.
baseDF.persist(StorageLevel.MEMORY_AND_DISK)
// The first time DataFrame is computed in an action, it will be kept in memory on the nodes.
baseDF.count()
// Create arrays with the different identifiers
var array1 = Array("6fefc487-bd57-4fa2-808a-3845703b83d0", "9baba76b-07c2-48ec-a153-6cfb8b138ecf")
var array2 = Array("ab654369-77f5-478c-94e5-ee2755ae8571", "3b43e0a6-deba-4919-a2cc-9d450e28e0fe")
var storage = Array(array1, array2)
// Check if the main DataFrame is empty or not.
if (baseDF.head(1).nonEmpty) {
for (item <- storage) {
val thread = new Thread(new CustomThread(baseDF, item))
thread.start()
}
}
}
}
CustomThread.scala:
package utils
import org.apache.spark.sql.DataFrame
class CustomThread(baseDF: DataFrame, item: Array[String]) extends Runnable {
override def run(): Unit = {
val df = baseDF.filter(col("col1").isin(item:_*))
println("Count: " + df.count())
}
}
I use such configurations:
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max.mb: 1024
spark.executor.memory: 2g
spark.sql.autoBroadcastJoinThreshold: -1
spark.sql.files.ignoreCorruptFiles: true
spark.driver.memory: 30g
spark.driver.maxResultSize: 20g
spark.executor.cores: 1
spark.cores.max: 48
spark.scheduler.mode: FAIR
What do you want to achieve with your multi-threading? In a spark application, you should not worry about the number of threads etc. What your code does is the launch parallel jobs (your multi-threading is only on the driver), for the executors, there is no difference.
In my experience, I only use parallel job launches if I have several jobs which are small or skewed, such as the cluster resources are not fully utilized. If I do that, I use scalas parallel collections.
To answer your question : The optimal number if threads is probably 1
EDIT: I would suggest to rewrite your code completely with the goal to have all results in a new dataframe, this is better than implementing complicated multi-threading:
// testcase
val baseDf = Seq(
"6fefc487-bd57-4fa2-808a-3845703b83d0",
"9baba76b-07c2-48ec-a153-6cfb8b138ecf",
"ab654369-77f5-478c-94e5-ee2755ae8571",
"dummy"
).toDF("col1")
var array1 = Seq("6fefc487-bd57-4fa2-808a-3845703b83d0", "9baba76b-07c2-48ec-a153-6cfb8b138ecf")
var array2 = Seq("ab654369-77f5-478c-94e5-ee2755ae8571", "3b43e0a6-deba-4919-a2cc-9d450e28e0fe")
var storage = Seq(array1, array2)
broadcast(storage.toDF("storage"))
.join(baseDf,array_contains($"storage",$"col1"),"left")
.groupBy($"storage").agg(count($"col1").as("count"))
.show()
gives :
+--------------------+-----+
| storage|count|
+--------------------+-----+
|[ab654369-77f5-47...| 1|
|[6fefc487-bd57-4f...| 2|
+--------------------+-----+

This Akka stream sometimes doesn't finish

I have a graph that reads lines from multiple gzipped files and writes those lines to another set of gzipped files, mapped according to some value in each line.
It works correctly against small data sets, but fails to terminate on larger data. (It may not be the size of the data that's to blame, as I have not run it enough times to be sure - it takes a while).
def files: Source[File, NotUsed] =
Source.fromIterator(
() =>
Files
.fileTraverser()
.breadthFirst(inDir)
.asScala
.filter(_.getName.endsWith(".gz"))
.toIterator)
def extract =
Flow[File]
.mapConcat[String](unzip)
.mapConcat(s =>
(JsonMethods.parse(s) \ "tk").extract[Array[String]].map(_ -> s).to[collection.immutable.Iterable])
.groupBy(1 << 16, _._1)
.groupedWithin(1000, 1.second)
.map { lines =>
val w = writer(lines.head._1)
w.println(lines.map(_._2).mkString("\n"))
w.close()
Done
}
.mergeSubstreams
def unzip(f: File) = {
scala.io.Source
.fromInputStream(new GZIPInputStream(new FileInputStream(f)))
.getLines
.toIterable
.to[collection.immutable.Iterable]
}
def writer(tk: String): PrintWriter =
new PrintWriter(
new OutputStreamWriter(
new GZIPOutputStream(
new FileOutputStream(new File(outDir, s"$tk.json.gz"), true)
))
)
val process = files.via(extract).toMat(Sink.ignore)(Keep.right).run()
Await.result(process, Duration.Inf)
The thread dump shows that the process is WAITING at Await.result(process, Duration.Inf) and nothing else is happening.
OpenJDK v11 with Akka v2.5.15
Most likely it's stuck in groupBy because it ran out of available threads in dispatcher to gather items into 2^16 groups for all sources.
So if I were you I'd probably implement grouping in extract semi-manually using statefulMapConcat with mutable Map[KeyType, List[String]]. Or buffer lines with groupedWithin first and split them into groups that you would write to different files in Sink.foreach.

How to use COGROUP for large datasets

I have two rdd's namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I'm using cogroup for those datasets like:
val tab_c = tab_a.cogroup(tab_b).collect.toArray
val updated = tab_c.map { x =>
{
//somecode
}
}
I'm using tab_c cogrouped values for map function and it works fine for small datasets but in case of huge datasets it throws Out Of Memory exception.
I have tried converting the final value to RDD but no luck same error
val newcos = spark.sparkContext.parallelize(tab_c)
1.How to use Cogroup for large datasets ?
2.Can we persist the cogrouped value ?
Code
val source_primary_key = source.map(rec => (rec.split(",")(0), rec))
source_primary_key.persist(StorageLevel.DISK_ONLY)
val destination_primary_key = destination.map(rec => (rec.split(",")(0), rec))
destination_primary_key.persist(StorageLevel.DISK_ONLY)
val cos = source_primary_key.cogroup(destination_primary_key).repartition(10).collect()
var srcmis: Array[String] = new Array[String](0)
var destmis: Array[String] = new Array[String](0)
var extrainsrc: Array[String] = new Array[String](0)
var extraindest: Array[String] = new Array[String](0)
var srcs: String = Seq("")(0)
var destt: String = Seq("")(0)
val updated = cos.map { x =>
{
val key = x._1
val value = x._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false && destt != "") {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
}
Code Updated:
val tab_c = tab_a.cogroup(tab_b).filter(x => x._2._1 =!= x => x._2._2)
// tab_c = {1,Compactbuffer(1,john,US),Compactbuffer(1,john,UK)}
{2,Compactbuffer(2,john,US),Compactbuffer(2,johnson,UK)}..
ERROR:
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(4,3,ResultTask,FetchFailed(null,0,-1,27,org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
ERROR YarnScheduler: Lost executor 8 on datanode1: Container killed by YARN for exceeding memory limits. 1.0 GB of 1020 MB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Thank you
When you use collect() you are basically telling spark to move all the resulting data back to the master node, which can easily produce a bottleneck. You are no longer using Spark at that point, just a plain array in a single machine.
To trigger computation just use something that requires the data at every node, that's why executors live on top of a distributed file system. For instance saveAsTextFile().
Here are some basic examples.
Remember, the entire objective here (that is, if you have big data) is to move the code to your data and compute there, not to bring all the data to the computation.
TL;DR Don't collect.
To run this code safely, without additional assumptions (on average requirements for worker nodes might be significantly smaller), every node (driver and each executor) would require memory significantly exceeding total memory requirements for all data.
If you were to run it outside Spark you would need only one node. Therefore Spark provides no benefits here.
However if you skip collect.toArray and make some assumptions about data distribution you might run it just fine.

Counting records of my RDDs in a large Dstream

I am trying to work with a large RDD as read by a file DStream.
The code looks as follows:
val creatingFunc = { () =>
val conf = new SparkConf()
.setMaster("local[10]")
.setAppName("FileStreaming")
.set("spark.streaming.fileStream.minRememberDuration", "2000000h")
.registerKryoClasses(Array(classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text], classOf[GGSN]))
val sc = new SparkContext(conf)
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
val appFile = httpFileLines
.map(x=> (x._1,x._2.toString()))
.filter(!_._2.contains("ggsnIPAddress"))
.map(x=>(x._1,x._2.split(",")))
var count=0
appFile.foreachRDD(s => {
// s.collect() throw exception due to insufficient amount of emery
//s.count() throw exception due to insufficient amount of memory
s.foreach(x => count = count + 1)
})
println(count)
newContextCreated = true
ssc
}
what I am trying to do is to get the count of my RDD..however since it is large..it throws exception..so I need to do a foreach instead to avoid collecting data to memory..
I wanna to get the count then as the way in my code but it always gives 0..
Is there a way to do this?
There's no need to foreachRDD and call count. You can use the count method defined on DStream:
val appFile = httpFileLines
.map(x => (x._1, x._2.toString()))
.filter(!_._2.contains("ggsnIPAddress"))
.map(x => (x._1, x._2.split(",")))
val count = appFile.count()
If that still yields an insufficient amount of memory exception, you either need to be calculating smaller batches of data each time, or enlarge you worker nodes to handle the load.
Regarding your solution, you should avoid the collect and sum the count of each RDD of the DStream.
var count=0
appFile.foreachRDD { rdd => {
count = count + rdd.count()
}
}
But I found this solution very ugly (the use of a var in scala).
I prefer the following solution:
val count: Long = errorDStream.count().reduce(_+_)
Notice, that the count method return a DStream of Long and not a Long, this is why you need to use the reduce.