How to minimize memory usage in akka streams - scala

I have stream that at some point will group objects to create files.
I think I can squeeze some bytes by serializing the object early in the stream.
But my biggest question is about how to optimize the memory footprint for a stream like this:
val sourceOfCustomer = Source.repeat(Customer(name = "test"))
def serializeCustomer(customer: Customer) = customer.toString
sourceOfCustomers
.via(serializeCustomer) // 1KB
.grouped(1000000) // 1GB
.via(processFile) // 1GB
.via(moreProcessing) // 1GB
.via(evenMoreProcessing) // 1GB
.to(fileSink) // 1GB
This gives me a memory usage at steady state of at least 5GB. Is this correct?
What strategy can I use to only limit it to 1 or 2GB? In principle it should be possible by collapsing the operators.
Note: I know a solution is to make the group smaller but let’s consider the size of the group a constraint of the problem.

Sorry, maybe I'm missing something, but I did not find group operation in latest Akka Stream documentation, I guess you mean grouped operation: https://doc.akka.io/docs/akka/current/stream/operators/Source-or-Flow/grouped.html .
If so, then it means that at .grouped(1000000) // 1GB you create group of elements in the stream , which can be handled simultaneously, hence one more then one group of 1GB size can present in memory in one moment of time. So in order to limit memory footprint in your stream up to 1GB, you can go with one of two ways:
1) Reduce number of large groups handled simultaneously.
This can be achieved with throttle operation: https://doc.akka.io/docs/akka/current/stream/operators/Source-or-Flow/throttle.html#throttle
Please, see code snippet for example
import scala.concurrent.duration._
...
.group(1000000) // 1GB
.throttle(1, 1 minute)
2) Reducing large group size
val parallelismLevel = Runtime.getRuntime.availableProcessors() // or another custom level which represents stream processing parallelism
val baseGroupSize = 1000000 // 1GB
val groupSize = baseGroupSize / parallelismLevel
sourceOfCustomers
.via(serializeCustomer) // 1KB
.group(groupSize)
Hope this helps!

Related

Memory efficient way to repartition a large dataset by key and applying a function separately for each group batch-by-batch

I have a large spark scala Dataset with a "groupName" column. Data records are spread along different partitions. I want to group records together by "groupName", collect batch-by-batch and apply a function on entire batch.
By "batch" I mean a predefined number of records (let's call it maxBatchCount) of the same group. By "batch-by-batch" I mean I want to use memory efficiently and not collect all partition to memory.
To be more specific, the batch function includes serialization, compression and encryption of the entire batch. This is later transformed into another dataset to be written to hdfs using partitionBy("groupName"). Therefore I can't avoid a full shuffling.
Is there a simple way for doing this? I made some attempt described below but TL/DR it seemed a bit over complicated and it eventually failed on Java memory issues.
Details
I tried to use a combination of repartition("groupName"), mapPartitions and Iterator's grouped(maxBatchCount) method which seemed very fit to the task. However, the repartitioning only makes sure records of the same groupName will be in the same partition, but a single partition might have records from several different groupName (if #groups > #partitions) and they can be scattered around inside the partition. So now I still need to do some grouping inside each partition first. The problem is that from mapPartition I get an Iterator which doesn't seem to have such API and I don't want to collect all data to memory.
Then I tried to enhance the above solution with Iterator's partition method. The idea is to first iterate the complete partition for building a Set of all the present groups and then use Iterator.partition to build a separate iterator for each of the present groups. And then use grouped as before.
It goes something like this - for illustration I used a simple case class of two Ints, and groupName is actually mod3 column, created by applying modulo 3 function for each number in the Range:
case class Mod3(number: Int, mod3: Int)
val maxBatchCount = 5
val df = spark.sparkContext.parallelize(Range(1,21))
.toDF("number").withColumn("mod3", col("number") % 3)
// here I choose #partitions < #groups for illustration
val dff = df.repartition(1, col("mod3"))
val dsArr = dff.as[Mod3].mapPartitions(partitionIt => {
// we'll need 2 iterations
val (it1, it2) = partitionIt.duplicate
// first iterate to create a Set of all present groups
val mod3set = it1.map(_.mod3).toSet
// build partitioned iterators map (one for each group present)
var it: Iterator[Mod3] = it2 // init var
val itMap = mod3set.map(mod3val => {
val (filteredIt, residueIt) = it.partition(_.mod3 == mod3val)
val pair = (mod3val -> filteredIt)
it = residueIt
pair
}).toMap
mod3set.flatMap(mod3val => {
itMap(mod3val).grouped(maxBatchCount).map(grp => {
val batch = grp.toList
batch.map(_.number).toArray[Int] // imagine some other batch function
})
}).toIterator
}).as[Array[Int]]
val dsArrCollect = dsArr.collect
dsArrCollect.map(_.toList).foreach(println)
This seemed to work nicely when testing with small data, but when running with actual data (on an actual spark cluster with 20 executors, 2 cores each) I received java.lang.OutOfMemoryError: GC overhead limit exceeded
Note in my actual data groups sizes are highly skewed and one of the groups is about the size of all the rest of the groups combined (I guess the GC memory issue is related to that group). Because of this I also tried to combine a secondary neutral column in repartition but it didn't help.
Will appreciate any pointers here,
Thanks!
I think you have the right approach with the repartition + map partitions.
The problem is that your map partition function ends up loading the entire partitions in memory.
First solution could be to increase the number of partitions and thus reduce the number of groups/ data in a partitions.
Another solution would be to use partitionIt.flatMap and process 1 record at time , accumulating only at most 1 group data
Use sortWithinPartitions so that records from the same group are consecutive
in the flatMap function, accumulate your data and keep track of group changes.

Spark streaming slow down when using large broadcast objects in UDF

I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()

Parallelism in Cassandra read using Scala

I am trying to invoke parallel reading from Cassandra table using spark. But I am not able to invoke parallelism as only one reads is happening any given time. What approach should be followed to achieve the same?
I'd recommend you go with below approach source Russell Spitzer's Blog
Manually dividing our partitions using a Union of partial scans :
Pushing the task to the end-user is also a possibility (and the current workaround.) Most end users already understand why they have long partitions and know in general the domain their column values fall in. This makes it possible for them to manually divide up a request so that it chops up large partitions.
For example, assuming the user knows clustering column c spans from 1 to 1000000. They could write code like
val minRange = 0
val maxRange = 1000000
val numSplits = 10
val subSize = (maxRange - minRange) / numSplits
sc.union(
(minRange to maxRange by subSize)
.map(start =>
sc.cassandraTable("ks", "tab")
.where("c > $start and c < ${start + subSize}"))
)
Each RDD would contain a unique set of tasks drawing only portions of full partitions. The union operation joins all those disparate tasks into a single RDD. The maximum number of rows any single Spark Partition would draw from a single Cassandra partition would be limited to maxRange/ numSplits. This approach, while requiring user intervention, would preserve locality and would still minimize the jumps between disk sectors.
Also read-tuning-parameters

Spark: stage X contains task of a very large size when running sc.binaryFiles()

I'm trying to load ~1M file set stored on S3. When running sc.binaryFiles("s3a://BUCKETNAME/*").count()
I'm getting WARN TaskSetManager: Stage 0 contains a task of very large size (177 KB). The maximum recommended task size is 100 KB. This is followed by failed tasks
I see that it infers 128 partitions for this stage, which is too low, note that when running the same command on 400K files bucket, number of partitions will be much higher (~2K partitions) and the action will succeed.
setting a higher minPartitions didn't help;
setting a higher spark.default.parallelism didn't help as well.
the only thing that worked was to create multiple smaller RDDs of 1000 files each, and running sc.union on them, Problem with this approach is that it's too slow.
How can this issue be mitigated?
UPDATE:
went on to see how number of partitions is settled in BinaryFileRDD.getPartitions() which got me to this piece of code:
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
val defaultParallelism = sc.defaultParallelism
val files = listStatus(context).asScala
val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
super.setMaxSplitSize(maxSplitSize)
}
I followed the computation and it still didn't make sense, I should get a much larger number.
So I tried to reduce the config.FILES_MAX_PARTITION_BYTES config (spark.files.maxPartitionBytes) - this did increase the number of partitions, and made the job finish, however I'm still getting the original warning (with a somewhat smaller task size), and still, the munber of partitions is way smaller than when running on a 400K file set.
The problem was rooted in the sizes of the files: To my surprise, the files in s3 were not uploaded properly, and their size was 100 times smaller than they should've been. This caused setMinPartitions to calculate splits that contained large amount of small files. Each split is essentially a comma separated string of file paths, since we have many files per split, we got a very long instruction string that should be communicated to all workers. This burdened the network and caused the entire flow to fail. Setting spark.files.maxPartitionBytes to a lower value solved the issue.

YARN killing containers for exceeding memory limits

I am running into an issue where YARN is killing my containers for exceeding memory limits:
Container killed by YARN for exceeding memory limits. physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I have 20 nodes that are of m3.2xlarge so they have:
cores: 8
memory: 30
storage: 200 gb ebs
The gist of my application is that I have a couple 100k assets for which I have historical data generated for each hour of the last year, with a total dataset size of 2TB uncompressed. I need to use this historical data to generate a forecast for each asset. My setup is that I first use s3distcp to move the data stored as indexed lzo files to hdfs. I then pull the data in and pass it to sparkSql to handle the json:
val files = sc.newAPIHadoopFile("hdfs:///local/*",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],conf)
val lzoRDD = files.map(_._2.toString)
val data = sqlContext.read.json(lzoRDD)
I then use a groupBy to group the historical data by asset, creating a tuple of (assetId,timestamp,sparkSqlRow). I figured this data structure would allow for better in memory operations when generating the forecasts per asset.
val p = data.map(asset => (asset.getAs[String]("assetId"),asset.getAs[Long]("timestamp"),asset)).groupBy(_._1)
I then use a foreach to iterate over each row, calculate the forecast, and finally write the forecast back out as a json file to s3.
p.foreach{ asset =>
(1 to dateTimeRange.toStandardHours.getHours).foreach { hour =>
// determine the hour from the previous year
val hourFromPreviousYear = (currentHour + hour.hour) - timeRange
// convert to seconds
val timeToCompare = hourFromPreviousYear.getMillis
val al = asset._2.toList
println(s"Working on asset ${asset._1} for hour $hour with time-to-compare: $timeToCompare")
// calculate the year over year average for the asset
val yoy = calculateYOYforAsset2(al, currentHour, asset._1)
// get the historical data for the asset from the previous year
val pa = asset._2.filter(_._2 == timeToCompare)
.map(row => calculateForecast(yoy, row._3, asset._1, (currentHour + hour.hour).getMillis))
.foreach(json => writeToS3(json, asset._1, (currentHour + hour.hour).getMillis))
}
}
Is there a better way to accomplish this so that I don't hit the memory issue with YARN?
Is there a way to chunk the assets so that the foreach only operates on about 10k at a time vs all 200k of the assets?
Any advice/help appreciated!
Its not your code. And don't worry foreach does not run all those lambdas concurrently. The problem is that Spark's default value of spark.yarn.executor.memoryOverhead (or recently renamed in 2.3+ to spark.executor.memoryOverhead) is overly conservative which causes your executors to be killed when under load.
The solution is (as suggested by the error message) to increase that value. I would start with setting it to 1GB (set to 1024) or more if you are requesting lots of memory for each executor. The goal is to get jobs running without any executors being killed.
Alternatively if you control the cluster, you could disable YARN memory enforcement via by setting the configs yarn.nodemanager.pmem-check-enabled and yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml