increase task size spark [duplicate] - scala

This question already has answers here:
Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
(3 answers)
Closed 2 years ago.
I got a problem when I execute my code in spark-shell.
[Stage 1:> (0 + 0) / 16]
17/01/13 06:09:24 WARN TaskSetManager: Stage 1 contains a task of very large size (1057 KB). The maximum recommended task size is 100 KB.
[Stage 1:> (0 + 4) / 16]
After this warning the execution blocked.
Who can I solve it?
I tried this but it's doesn't solve the problem.
val conf = new SparkConf()
.setAppName("MyApp")
.setMaster("local[*]")
.set("spark.driver.maxResultSize", "3g")
.set("spark.executor.memory" ,"3g");
val sc = new SparkContext(conf);`

I had similar error:
scheduler.TaskSetManager: Stage 2 contains a task of very large size
(34564 KB). The maximum recommended task size is 100 KB
My input data was of size ~150MB with 4 partitions (i.e., each partition was of size ~30MB). That explains 34564 KB size mentioned in above error message.
Reason:
Task is the smallest unit of work in spark that acts on partitions of your input data. Hence, if spark tells that task's size is more than recommended size, it means that the partition its handling has way too much data.
Solution that worked for me:
reducing task size => reduce the data its handling => increase
numPartitions to break down data into smaller chunks
So, I tried increasing number of partitions and got rid of the error.
One can check number of partitions in dataframe via df.rdd.getNumPartitions
To increase partitions: df.repartition(100)

It's most likely because of large size requirements by the variables in any of your tasks.
The accepted answer to this question should help you.

Related

Spark- write 128 MB size parquet files

I have a DataFrame (df) with more than 1 billion rows
df.coalesce(5)
.write
.partitionBy("Country", "Date")
.mode("append")
.parquet(datalake_output_path)
From the above command I understand only 5 worker nodes in my 100 worker node cluster (spark 2.4.5) will be performing all the tasks. Using coalesce(5) takes the process 7 hours to complete.
Should I try repartition instead of coalesce?
Is there a more faster/ efficient way to write out 128 MB size parquet files or do I need to first calculate the size of my dataframe to determine how many partitions are required.
For example if the size of my dataframe is 1 GB and spark.sql.files.maxPartitionBytes = 128MB should I first calculate No. of partitions required as 1 GB/ 128 MB = approx(8) and then do repartition(8) or coalesce(8) ?
The idea is to maximize the size of parquet files in the output at the time of writing and be able to do so quickly (faster).
You can get the size (dfSizeDiskMB) of your dataframe df by persisting it and then checking the Storage tab on the Web UI as in this answer. Armed with this information and an estimate of the expected Parquet compression ratio you can then estimate the number of partitions you need to achieve your desired output file partition size e.g.
val targetOutputPartitionSizeMB = 128
val parquetCompressionRation = 0.1
val numOutputPartitions = dfSizeDiskMB * parquetCompressionRatio / targetOutputPartitionSizeMB
df.coalesce(numOutputPartitions).write.parquet(path)
Note that spark.files.maxPartitionBytes is not relevant here as it is:
The maximum number of bytes to pack into a single partition when reading files.
(Unless df is the direct result of reading an input data source with no intermediate dataframes created. More likely the number of partitions for df is dictated by spark.sql.shuffle.partitions, being the number of partitions for Spark to use for dataframes created from joins and aggregations).
Should I try repartition instead of coalesce?
coalesce is usually better as it can avoid the shuffle associated with repartition, but note the warning in the docs about potentially losing parallelism in the upstream stages depending on your use case.
Coalesce is better if you are coming from higher no of partitions to lower no. However, if before writing the df, your code isn't doing shuffle , then coalesce will be pushed down to the earliest point possible in DAG.
What you can do is process your df in say 100 partitions or whatever number you seem appropriate and then persist it before writing your df.
Then bring your partitions down to 5 using coalesce and write it. This should probably give you a better performance

What is the expected behavior for in-memory data-structures in Spark executors?

I want to understand if I expect the following behavior properly.
let's say I have 100 executors, each with 4 cores (meaning threads)
I am processing a very large RDD, and the rows inside contain a some_class that could be un-valid, if it is - I don't want to process the given row.
I don't want to use a broadcast since the invalid rows are determined to be invalid on the fly (during the RDD processing).
I thought of using an in-memory set and in the worst-case scenario, each executor will one time process a "bad" row - I am ok with that.
am I expecting the behavior properly or am I missing something?
val some_set = mutable.HashSet[String]
some_rdd.filterNot(r => some_set.contains(r.some_class.id)
.map(some_row => {
try{
some_def(some_row)
}
catch{
case e:Throwable => {
some_set.add(some_row.some_class.id)
log.info("some error")
}
}
}
In your example some_set will be serialized and sent to the executor together with the task code. Considering the case in which the size of some_set is 10.000, then the max task size of your Spark program will be approximately 200KB (10000 x 20chars). That satisfies the current maximum recommended task size of 1MB. On the other hand, if the task size exceeds 1GB you should expect a warning similar to:
Stage 1 contains a task of very large size (1024 MB). The maximum
recommended task size is 1000 KB.
If for some reason the size of some_set will increase beyond the limit of 1MB in the future, consider using broadcasting.
Similar questions
How to resolve : Very large size tasks in spark
Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

Spark Job stuck between stages after join

I have a spark job which joins 2 datasets, performs some transformations and reduces the data to give output.
The input size for now is pretty small (200MB datasets each), but after join, as you can see in DAG, the job is stuck and never proceeds with stage-4. I tried waiting for hours and it gave OOM and showed failed tasks for stage-4.
Why doesnt spark show stage-4(data transformation stage) as active after stage-3(join stage)? Is it stuck in the shuffle between stage-3 & 4?
What can I do to improve performance of my spark job? I tried increasing shuffle partitions and still same result.
Job code:
joinedDataset.groupBy("group_field")
.agg(collect_set("name").as("names")).select("names").as[List[String]]
.rdd. //converting to rdd since I need to use reduceByKey
.flatMap(entry => generatePairs(entry)) // this line generates pairs of words out of input text, so data size increases here
.map(pair => ((pair._1, pair._2), 1))
.reduceByKey(_+_)
.sortBy(entry => entry._2, ascending = false)
.coalesce(1)
FYI My cluster has 3 worker nodes with 16 cores and 100GB RAM, 3 executors with 16 cores(1:1 ratio with machines for simplicity) and 64GB memory allocate.
UPDATE:
Turns out the data generated in my job is pretty huge. I did some optimisations(strategically reduced input data and removed some duplicated strings from processing), now the job finishes within 3 hours. Input for stage 4 is 200MB and output is 200GB per se. It uses parallelism properly bu it sucks at shuffle. My shuffle spill during this job was 1825 GB(memory) and 181 GB(disk). Can someone help me with reducing shuffle spill and duration of the job? Thanks.
Try an initial sort on executor and then reduce + sort them
joinedDataset.groupBy("group_field")
.agg(collect_set("name").as("names")).select("names").as[List[String]]
.rdd. //converting to rdd since I need to use reduceByKey
.flatMap(entry => generatePairs(entry)) // this line generates pairs of words out of input text, so data size increases here
.map(pair => ((pair._1, pair._2), 1))
.sortBy(entry => entry._2, ascending = false) // Do a initial sort on executors
.reduceByKey(_+_)
.sortBy(entry => entry._2, ascending = false)
.coalesce(1)

Why Spark repartition leads to MemoryOverhead?

So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory

knowing the size of a broadcasted variable in spark

I have broadcasted a variable in spark(scala) but because of the size of data, it gives output as this
WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, 10.240.0.33): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
When run on smaller database, it works fine. I want to know the size of this broadcasted variable (in mb/gb). Is there a way to find this?
Assuming you are trying to broadcast obj, you can find its size as follows:
import org.apache.spark.util.SizeEstimator
val objSize = SizeEstimator.estimate(obj)
Note that this is an estimator which means it is not 100% correct
This is because the driver runs out of memory. By default this is 1g, this can be increased using --driver-memory 4g. By default Spark will broadcast a dataframe when it is <10m, although I found out that broadcasting bigger dataframes is also not a problem. This might significantly speed up the join, but when the dataframe becomes too big, it might even slow down the join operation because of the overhead of broadcasting all the data to the different executors.
What is your datasource? When the table is read into Spark, under the sql tab and then open the dag diagram of the query you are executing, should give some metadata about the number of rows and size. Otherwise you could also check the actual size on the hdfs using hdfs dfs -du /path/to/table/.
Hope this helps.