Spark: stage X contains task of a very large size when running sc.binaryFiles() - scala

I'm trying to load ~1M file set stored on S3. When running sc.binaryFiles("s3a://BUCKETNAME/*").count()
I'm getting WARN TaskSetManager: Stage 0 contains a task of very large size (177 KB). The maximum recommended task size is 100 KB. This is followed by failed tasks
I see that it infers 128 partitions for this stage, which is too low, note that when running the same command on 400K files bucket, number of partitions will be much higher (~2K partitions) and the action will succeed.
setting a higher minPartitions didn't help;
setting a higher spark.default.parallelism didn't help as well.
the only thing that worked was to create multiple smaller RDDs of 1000 files each, and running sc.union on them, Problem with this approach is that it's too slow.
How can this issue be mitigated?
UPDATE:
went on to see how number of partitions is settled in BinaryFileRDD.getPartitions() which got me to this piece of code:
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
val defaultParallelism = sc.defaultParallelism
val files = listStatus(context).asScala
val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
super.setMaxSplitSize(maxSplitSize)
}
I followed the computation and it still didn't make sense, I should get a much larger number.
So I tried to reduce the config.FILES_MAX_PARTITION_BYTES config (spark.files.maxPartitionBytes) - this did increase the number of partitions, and made the job finish, however I'm still getting the original warning (with a somewhat smaller task size), and still, the munber of partitions is way smaller than when running on a 400K file set.

The problem was rooted in the sizes of the files: To my surprise, the files in s3 were not uploaded properly, and their size was 100 times smaller than they should've been. This caused setMinPartitions to calculate splits that contained large amount of small files. Each split is essentially a comma separated string of file paths, since we have many files per split, we got a very long instruction string that should be communicated to all workers. This burdened the network and caused the entire flow to fail. Setting spark.files.maxPartitionBytes to a lower value solved the issue.

Related

dask dataframe groupby resulting in one partition memory issue

I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)
Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.

Spark configurations for Out of memory error [duplicate]

Cluster setup -
Driver has 28gb
Workers have 56gb each (8 workers)
Configuration -
spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g
My job -
//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.
//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)
//works fine
data.map(s => myFunc(s))
Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?
Update -
Data -
val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)
Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.
myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.
Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.
You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.
Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.
Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.
Would help if you put more details of what going on in your program before and after the MAP.
Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.
Updating solution after looking at your code, will be better if you do this
val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Spark shuffle read takes significant time for small data

We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)
One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.
Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:
Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers.
Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.
Here is an example of task stats table for one of the severs / hosts:
It looks like the code responsible for this DAG is the following:
output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()
We would highly appreciate your thoughts on this.
Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.
Since google brought me here with the same problem but I needed another solution...
Another possible reason for small shuffle size taking a long time to read could be the data is split over many partitions. For example (apologies this is pyspark as it is all I have used):
my_df_with_many_partitions\ # say has 1000 partitions
.filter(very_specific_filter)\ # only very few rows pass
.groupby('blah')\
.count()
The shuffle write from the filter above will be very small, so for the stage after we will have a very small amount to read. But to read it you need to check a lot of empty partitions. One way to address this would be:
my_df_with_many_partitions\
.filter(very_specific_filter)\
.repartition(1)\
.groupby('blah')\
.count()

YARN killing containers for exceeding memory limits

I am running into an issue where YARN is killing my containers for exceeding memory limits:
Container killed by YARN for exceeding memory limits. physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I have 20 nodes that are of m3.2xlarge so they have:
cores: 8
memory: 30
storage: 200 gb ebs
The gist of my application is that I have a couple 100k assets for which I have historical data generated for each hour of the last year, with a total dataset size of 2TB uncompressed. I need to use this historical data to generate a forecast for each asset. My setup is that I first use s3distcp to move the data stored as indexed lzo files to hdfs. I then pull the data in and pass it to sparkSql to handle the json:
val files = sc.newAPIHadoopFile("hdfs:///local/*",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],conf)
val lzoRDD = files.map(_._2.toString)
val data = sqlContext.read.json(lzoRDD)
I then use a groupBy to group the historical data by asset, creating a tuple of (assetId,timestamp,sparkSqlRow). I figured this data structure would allow for better in memory operations when generating the forecasts per asset.
val p = data.map(asset => (asset.getAs[String]("assetId"),asset.getAs[Long]("timestamp"),asset)).groupBy(_._1)
I then use a foreach to iterate over each row, calculate the forecast, and finally write the forecast back out as a json file to s3.
p.foreach{ asset =>
(1 to dateTimeRange.toStandardHours.getHours).foreach { hour =>
// determine the hour from the previous year
val hourFromPreviousYear = (currentHour + hour.hour) - timeRange
// convert to seconds
val timeToCompare = hourFromPreviousYear.getMillis
val al = asset._2.toList
println(s"Working on asset ${asset._1} for hour $hour with time-to-compare: $timeToCompare")
// calculate the year over year average for the asset
val yoy = calculateYOYforAsset2(al, currentHour, asset._1)
// get the historical data for the asset from the previous year
val pa = asset._2.filter(_._2 == timeToCompare)
.map(row => calculateForecast(yoy, row._3, asset._1, (currentHour + hour.hour).getMillis))
.foreach(json => writeToS3(json, asset._1, (currentHour + hour.hour).getMillis))
}
}
Is there a better way to accomplish this so that I don't hit the memory issue with YARN?
Is there a way to chunk the assets so that the foreach only operates on about 10k at a time vs all 200k of the assets?
Any advice/help appreciated!
Its not your code. And don't worry foreach does not run all those lambdas concurrently. The problem is that Spark's default value of spark.yarn.executor.memoryOverhead (or recently renamed in 2.3+ to spark.executor.memoryOverhead) is overly conservative which causes your executors to be killed when under load.
The solution is (as suggested by the error message) to increase that value. I would start with setting it to 1GB (set to 1024) or more if you are requesting lots of memory for each executor. The goal is to get jobs running without any executors being killed.
Alternatively if you control the cluster, you could disable YARN memory enforcement via by setting the configs yarn.nodemanager.pmem-check-enabled and yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml

Why does the task size keep growing when using updateStateByKey?

I wrote a simple function to use with updateStateByKey in order to see if the problem was because of my updateFunc. I think it must be due to something else. I am running this on --master local[4].
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
Some(1)
}
val state = test.updateStateByKey[Int](updateFunc)
After a while, there are warnings, and the task size keeps increasing.
WARN TaskSetManager: Stage x contains a task of very large size (129 KB). The maximum recommended task size is 100 KB.
WARN TaskSetManager: Stage x contains a task of very large size (131 KB). The maximum recommended task size is 100 KB.
You have more and more distinct keys coming from your stream, each of which results in a new copy of 1 being added to your state.
Current updateStateByKey scan every key in every batch interval, even if there is no data for that key. This causes the batch processing times of updateStateByKey to increase with the number of keys in the state, even if the data rate stays fixed.
There is a proposal to solve this.