Why Spark repartition leads to MemoryOverhead? - scala

So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?

Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory

Related

Spark- write 128 MB size parquet files

I have a DataFrame (df) with more than 1 billion rows
df.coalesce(5)
.write
.partitionBy("Country", "Date")
.mode("append")
.parquet(datalake_output_path)
From the above command I understand only 5 worker nodes in my 100 worker node cluster (spark 2.4.5) will be performing all the tasks. Using coalesce(5) takes the process 7 hours to complete.
Should I try repartition instead of coalesce?
Is there a more faster/ efficient way to write out 128 MB size parquet files or do I need to first calculate the size of my dataframe to determine how many partitions are required.
For example if the size of my dataframe is 1 GB and spark.sql.files.maxPartitionBytes = 128MB should I first calculate No. of partitions required as 1 GB/ 128 MB = approx(8) and then do repartition(8) or coalesce(8) ?
The idea is to maximize the size of parquet files in the output at the time of writing and be able to do so quickly (faster).
You can get the size (dfSizeDiskMB) of your dataframe df by persisting it and then checking the Storage tab on the Web UI as in this answer. Armed with this information and an estimate of the expected Parquet compression ratio you can then estimate the number of partitions you need to achieve your desired output file partition size e.g.
val targetOutputPartitionSizeMB = 128
val parquetCompressionRation = 0.1
val numOutputPartitions = dfSizeDiskMB * parquetCompressionRatio / targetOutputPartitionSizeMB
df.coalesce(numOutputPartitions).write.parquet(path)
Note that spark.files.maxPartitionBytes is not relevant here as it is:
The maximum number of bytes to pack into a single partition when reading files.
(Unless df is the direct result of reading an input data source with no intermediate dataframes created. More likely the number of partitions for df is dictated by spark.sql.shuffle.partitions, being the number of partitions for Spark to use for dataframes created from joins and aggregations).
Should I try repartition instead of coalesce?
coalesce is usually better as it can avoid the shuffle associated with repartition, but note the warning in the docs about potentially losing parallelism in the upstream stages depending on your use case.
Coalesce is better if you are coming from higher no of partitions to lower no. However, if before writing the df, your code isn't doing shuffle , then coalesce will be pushed down to the earliest point possible in DAG.
What you can do is process your df in say 100 partitions or whatever number you seem appropriate and then persist it before writing your df.
Then bring your partitions down to 5 using coalesce and write it. This should probably give you a better performance

Performance tuning in spark

I am running a spark job which processes about 2 TB of data. The processing involves:
Read data (avrò files)
Explode on a column which is a map type
OrderBy key from the exploded column
Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a df.filter(col("key").isin(keyset: _*) )
I write this df to a parquet (this dataframe is very small)
Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) ) and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?
All look right to me.
If you have small dataset then isin is okay.
1) Ensure that you can increase the number of cores. executor core=5
More than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
2) Ensure that you have good/uniform partition strucutre.
Example (only for debug purpose not for production):
import org.apache.spark.sql.functions.spark_partition_id
yourcacheddataframe.groupBy(spark_partition_id).count.show()
This is will print spark partition number and how many records
exists in each partition. based on that you can repartition, if you wanot more parllelism.
3) spark.dynamicAllocation.enabled could be another option.
For Example :
spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=100 --conf spark.shuffle.service.enabled=true
along with all other required props ..... thats for that job. If you give these props in spark-default.conf it would be applied for all jobs.
With all these aforementioned options your processing time might lower.
On top of what has been mentioned, a few suggestions depending on your requirements and cluster:
If the job can run at 20g executor memory and 5 cores, you may be able to fit more workers by decreasing the executor memory and keeping 5 cores
Is the orderBy actually required? Spark ensures that rows are ordered within partitions, but not between partitions which usually isn't terribly useful.
Are the files required to be in specific locations? If not, adding a
df.withColumn("in_keyset", when( col('key').isin(keyset), lit(1)).otherwise(lit(0)). \
write.partitionBy("in_keyset").parquet(...)
may speed up the operation to prevent the data from being read in + exploded 2x. The partitionBy ensures that the items in the keyset are in a different directory than the other keys.
spark.dynamicAllocation.enabled is enabled
partition sizes are quite uneven (based on the size of output parquet part files) since I am doing an orderBy key and some keys are more frequent than others.
keyset is a really small set (7 elements)

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Spark Structured Streaming Memory Bound

I am processing a stream of 100 Mb/s average load. I have six executors with each having 12 Gb of memory allocated. However, due to data load, I am getting Out of Memory errors (Error 52) in the spark executors in few minutes. It seems even though Spark dataframe is conceptually unbounded it is bounded by total executor memory?
My idea here was to save dataframe/stream as an in parquet in about every five minutes. However, it seems spark won't have a direct mechanism to purge the dataframe after that?
val out = df.
writeStream.
format("parquet").
option("path", "/applications/data/parquet/customer").
option("checkpointLocation", "/checkpoints/customer/checkpoint").
trigger(Trigger.ProcessingTime(300.seconds)).
outputMode(OutputMode.Append).
start
It seems that there is no direct way to do this. As this conflicts with the general Spark model that operations be rerunnable in case of failure.
However I would share the same sentiment of the comment at 08/Feb/18 13:21 on this issue.

Best practice for writing to hadoop from spark

I was reviewing some code written by a co-worker, and I found a method like this:
def writeFile(df: DataFrame,
partitionCols: List[String],
writePath: String): Unit {
val df2 = df.repartition(partitionCols.get.map(col): _*)
val dfWriter = df2.write.partitionBy(partitionCols.get.map(col): _*)
dfWriter
.format("parquet")
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.save(writePath)
}
Is it generally good practice to call repartition on a predefined set of columns like this, and then call partitionBy, and then save to disk?
Generally you call repartition with the same columns as the partitionBy to have a single parquet file in each partition. This is being achieved here. Now you could argue that this could mean the parquet file size becoming large or worse could cause memory overflow.
This problem generally handled by adding a row_number to the Dataframe and then specify the number of documents than each parquet file can have. Something like
val repartitionExpression =colNames.map(col) :+ floor(col(RowNumber) / docsPerPartition)
// now use this to repartition
To answer the next part as persist after partitionBy that is not needed here as after partition it is directly written to the disk.
To help you understand the differences between partitionBy() and repartition(), repartition on the dataframe uses a Hash based partitioner which takes COL as well as NumOfPartitions basing on which generates a hash value and buckets the data.
By default repartition() creates 200 partitions. Because of possibility of collisions there is good chance of partitioning multiple records with different keys into same buckets.
On the other hand the partitionBy() takes COL by which the partitions are purely based on the unique keys. The partitions are proportional to the no: of unique keys in the data.
In repartition case there is a good chance of writing empty files. But, in the case of partitionBy there will not be any empty files.
Is your job CPU-bound, memory-bound, network-IO bound, or disk-IO bound?
First 2 cases are significant if df2 is sufficiently large, and other answers correctly address those cases.
If your job is disk-IO bound (and you see yourself writing large files to HDFS frequently in future), many cloud providers will let you pick a faster SSD disk for an extra charge.
Also Sandy Ryza recommends keeping --executor-cores under 5:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.