Spark dataset write to parquet file takes forever - scala

spark scala App is getting stuck at the below statement and it's running more than 3 hours before getting timeout due to timeout settings. Any pointers on how to understand and interpret the job execution in the yarnUI and debug this issue are appreciated.
dataset
.repartition(100,$"Id")
.write
.mode(SaveMode.Overwrite)
.partitionBy(dateColumn)
.parquet(temppath)
I have a bunch of joins and the largest dataset is ~15 Million and the smallest is < 100 rows. I tried multiple options like increasing the executory memory and spark driver memory but no luck so far. Note I have cached the datasets I am using multiple times and the final dataset storage level is set to Memory_desk_ser.
Not sure whether below executors summary this will or not
executors (summary)
Total_tasks Input shuffle_read shuffle_write
7749 98 GB 77GB 106GB
Appreciate any pointers on how to go about and understand the bottle based on the query plan or any other info.

Related

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Spark dataframe Join issue

Below code snippet works fine. (Read CSV, Read Parquet and join each other)
//Reading csv file -- getting three columns: Number of records: 1
df1=spark.read.format("csv").load(filePath)
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 30 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Its weired that below code snippet doesnt work. (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase)
//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
df1=read from Hbase code
// It read from Hbase properly and able to show one record.
df1.show
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use.
If I persist df1 in memory&disk and do a count(). Program works fine. Code snippet is below
//Reading from Hbase -- getting three columns: Number of records: 1
df1=read from Hbase code
**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
It works with file even it has the same data but not with Hbase. Running this on 100 worknode cluster with 125 GB of memory on each. So memory is not the problem.
My question here is both the file and Hbase has same data and both read and able to show() the data. But why only Hbase is failing. I am struggling to understand what might be going wrong with this code. Any suggestions will be appreciated.
When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join.
thus it tries to sort and shuffle the data across the executors.
to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet.
however to bypass this (since it is only one row) we can use Case expression for the columns to be padded.
example:
df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
,lit(hbaseValueCol1))
.otherwise(lit(null))
I'm sometimes struggling with this error too. Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). As there is no hint to force sort-merge join (How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1
When I have problem with memory during a join it usually means one of two reasons:
You have too few partitions in dataframes (partitions are too big)
There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory.
Ad 1. I think you should look at number of partitions you have in each table before join. When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. Too large partitions become even larger after join and this creates memory problem. Have a look at the Peak Execution Memory per task in Spark UI. This will give you some idea about your memory usage per task. I found it best to keep it below 1 Gb.
Solution: Repartition your tables before the join.
Ad. 2 Maybe not the case here but worth checking.

Why Spark repartition leads to MemoryOverhead?

So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory

Spark Structured Streaming Memory Bound

I am processing a stream of 100 Mb/s average load. I have six executors with each having 12 Gb of memory allocated. However, due to data load, I am getting Out of Memory errors (Error 52) in the spark executors in few minutes. It seems even though Spark dataframe is conceptually unbounded it is bounded by total executor memory?
My idea here was to save dataframe/stream as an in parquet in about every five minutes. However, it seems spark won't have a direct mechanism to purge the dataframe after that?
val out = df.
writeStream.
format("parquet").
option("path", "/applications/data/parquet/customer").
option("checkpointLocation", "/checkpoints/customer/checkpoint").
trigger(Trigger.ProcessingTime(300.seconds)).
outputMode(OutputMode.Append).
start
It seems that there is no direct way to do this. As this conflicts with the general Spark model that operations be rerunnable in case of failure.
However I would share the same sentiment of the comment at 08/Feb/18 13:21 on this issue.

Spark: sc.WholeTextFiles takes a long time to execute

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total
I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.
I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)
I'm just starting and I've never had the need to optimize a job before
EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.
EDIT 2: benchmark assessment
So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:
15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.
Would that be a reason of the bad performance? How do I hedge that?
Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....
It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...
To summarize my recommendations from the comments:
HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks
So my recommendation is:
Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration