Spark repartition performance when converting rdd to dataframe - scala

I am new to scala/spark world and have recently started working on a task where it reads some data, processes it and saves it on S3. I have read several topics/questions on stackoverflow regarding repartition/coalesce performance and optimal number of partitions (like this one). Assuming that I have the right number of paritions, my questions is, would it be a good idea to repartition a rdd while converting it to dataframe? Here is how my code looks like at the moment:
val dataRdd = dataDf.rdd.repartition(partitions)
.map(x => ThreadedConcurrentContext.executeAsync(myFunction(x)))
.mapPartitions( it => ThreadedConcurrentContext.awaitSliding(it = it, batchSize = asyncThreadsPerTask, timeout = Duration(3600000, "millis")))
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.toDF()
Here is what I'm planning to do (repartition data after filter):
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.repartition(partitions)
.toDF()
My questions is, would it be a good idea to do so? is there a performance gain here?
Note1: Filter usually removes 10% of original data.
Note2: Here is the first part of spark-submit command that I use to run the above code:
spark-submit --master yarn --deploy-mode client --num-executors 4 --executor-cores 4 --executor-memory 2G --driver-cores 4 --driver-memory 2G

The answer to your problem depends on the size of your dataRdd, number of partitions, executor-cores and processing power of your HDFS cluster.
With this in mind, you should run some tests on your cluster with different values of partitions and removing repartition altogether to fine tune it and find an accurate answer.
Example - if you specify partitions=8 and executor-cores=4 then you will be fully utilizing all your cores, however if the size of your dataRdd is only 1GB, then there is no advantage in repartitioning because it triggers shuffle which incurs performance impact. In addition to that if the processing power of your HDFS cluster is low or it is under a heavy load then there is an additional performance overhead due to that.
If you do have sufficient resources available on your HDFS cluster and you have a big (say over 100GB) dataRDD then a repartition should help in improving performance with config values in the example above.

Related

Tune Elastic Write Performance

I am trying to save (and index) 170GB file (~915 millions rows and 25 columns) in Elasticsearch cluster. I got a horrible performance on a 5 nodes elasticsearch cluster. The task takes ~5hours.
Spark cluster has 150 Cores 10x(15 CPU, 64 RAM).
This is my current worflow:
Build a Spark Dataframe from multiple parquet files from S3.
Then save this dataframe to ElasticSearch index using "org.elasticsearch.spark.sql" source from Spark. (I tried many sharding and replication configuration mix without gaining in performance)
This is the Cluster nodes caracteristics
5 nodes (16 CPU, 64 RAM, 700GB DISK) each.
HEAP_SIZE is about 50% of availabe RAM, means 32GB on each node. Configured in /etc/elasticsearch/jvm.options
This is the code which writes the dataframe to ElasticSearch(wrote in scala)
writeDFToEs(whole_df, "main-index")
The writeDFToEs function:
def writeDFToEs(df: DataFrame, index: String) = {
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes", "192.168.1.xxx")
.option("es.http.timeout", 600000)
.option("es.http.max_content_length", "2000mb")
.option("es.port", 9200)
.mode("overwrite")
.save(s"$index")
}
Can you help me finf out what I am not doing well and how to fi it?
Thanks in advance.
Answering my own question.
As suggested by #warkolm, I focused on _bulk.
I am using es-hadoop connector, so I had to tweak es.batch.size.entries parameter.
After running a bunch of tests, (testing various values), I finally got better results (still not optimal though) with es.batch.size.entries set to 10000 and following values in ES index template.
{
"index": {
"number_of_shards": "10",
"number_of_replicas": "0",
"refresh_interval": "60s"
}
}
Finally, my df.writelooks like this:
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes", es_nodes)
.option("es.port", es_port)
.option("es.http.timeout", 600000)
.option("es.batch.size.entries", 10000)
.option("es.http.max_content_length", "2000mb")
.mode("overwrite")
.save(s"$writeTo")
Now the process takes ~3h (2h 55 min) instead of 5 hours.
I am still improving the configs and code. I will update if I got better performance.

Performance tuning in spark

I am running a spark job which processes about 2 TB of data. The processing involves:
Read data (avrò files)
Explode on a column which is a map type
OrderBy key from the exploded column
Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a df.filter(col("key").isin(keyset: _*) )
I write this df to a parquet (this dataframe is very small)
Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) ) and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?
All look right to me.
If you have small dataset then isin is okay.
1) Ensure that you can increase the number of cores. executor core=5
More than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
2) Ensure that you have good/uniform partition strucutre.
Example (only for debug purpose not for production):
import org.apache.spark.sql.functions.spark_partition_id
yourcacheddataframe.groupBy(spark_partition_id).count.show()
This is will print spark partition number and how many records
exists in each partition. based on that you can repartition, if you wanot more parllelism.
3) spark.dynamicAllocation.enabled could be another option.
For Example :
spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=100 --conf spark.shuffle.service.enabled=true
along with all other required props ..... thats for that job. If you give these props in spark-default.conf it would be applied for all jobs.
With all these aforementioned options your processing time might lower.
On top of what has been mentioned, a few suggestions depending on your requirements and cluster:
If the job can run at 20g executor memory and 5 cores, you may be able to fit more workers by decreasing the executor memory and keeping 5 cores
Is the orderBy actually required? Spark ensures that rows are ordered within partitions, but not between partitions which usually isn't terribly useful.
Are the files required to be in specific locations? If not, adding a
df.withColumn("in_keyset", when( col('key').isin(keyset), lit(1)).otherwise(lit(0)). \
write.partitionBy("in_keyset").parquet(...)
may speed up the operation to prevent the data from being read in + exploded 2x. The partitionBy ensures that the items in the keyset are in a different directory than the other keys.
spark.dynamicAllocation.enabled is enabled
partition sizes are quite uneven (based on the size of output parquet part files) since I am doing an orderBy key and some keys are more frequent than others.
keyset is a really small set (7 elements)

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Why Spark repartition leads to MemoryOverhead?

So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory

is CPU usage in Apache Spark limited?

I recently discovered that adding parallel computing (e.g. using parallel-collections) inside UDFs increases performance considerable even when running spark in local[1] mode or using Yarn with 1 executor and 1 core.
E.g. in local[1] mode, the Spark-Jobs consumes as much CPU as possible (i.e. 800% if I have 8 cores, measured using top).
This seems strange because I thought Spark (or yarn) limits the CPU usage per Spark application?
So I wonder why that is and whether it's recommended to use parallel-processing/mutli-threading in spark or should I stick to sparks parallelizing pattern?
Here an example to play with (times measured in yarn client-mode with 1 instance and 1 core)
case class MyRow(id:Int,data:Seq[Double])
// create dataFrame
val rows = 10
val points = 10000
import scala.util.Random.nextDouble
val data = {1 to rows}.map{i => MyRow(i, Stream.continually(nextDouble()).take(points))}
val df = sc.parallelize(data).toDF().repartition($"id").cache()
df.show() // trigger computation and caching
// some expensive dummy-computation for each array-element
val expensive = (d:Double) => (1 to 10000).foldLeft(0.0){case(a,b) => a*b}*d
val serialUDF = udf((in:Seq[Double]) => in.map{expensive}.sum)
val parallelUDF = udf((in:Seq[Double]) => in.par.map{expensive}.sum)
df.withColumn("sum",serialUDF($"data")).show() // takes ~ 10 seconds
df.withColumn("sum",parallelUDF($"data")).show() // takes ~ 2.5 seconds
Spark does not limit CPU directly, instead it defines the number of concurrent threads spark creates. So for local[1] it would basically run one task at a time in parallel. When you are doing in.par.map{expensive} you are creating threads which spark does not manage and therefore are not handled by this limit. i.e. you told spark to limit itself to a single thread and then created other threads without spark knowing it.
In general, it is not a good idea to do parallel threads inside of a spark operation. Instead, it would be better to tell spark how many threads it can work with and make sure you have enough partitions for parallelism.
Spark are configuration of CPU usage
examople
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
change the local[*] it will utilization all of your CPU cores.