I am trying to save (and index) 170GB file (~915 millions rows and 25 columns) in Elasticsearch cluster. I got a horrible performance on a 5 nodes elasticsearch cluster. The task takes ~5hours.
Spark cluster has 150 Cores 10x(15 CPU, 64 RAM).
This is my current worflow:
Build a Spark Dataframe from multiple parquet files from S3.
Then save this dataframe to ElasticSearch index using "org.elasticsearch.spark.sql" source from Spark. (I tried many sharding and replication configuration mix without gaining in performance)
This is the Cluster nodes caracteristics
5 nodes (16 CPU, 64 RAM, 700GB DISK) each.
HEAP_SIZE is about 50% of availabe RAM, means 32GB on each node. Configured in /etc/elasticsearch/jvm.options
This is the code which writes the dataframe to ElasticSearch(wrote in scala)
writeDFToEs(whole_df, "main-index")
The writeDFToEs function:
def writeDFToEs(df: DataFrame, index: String) = {
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes", "192.168.1.xxx")
.option("es.http.timeout", 600000)
.option("es.http.max_content_length", "2000mb")
.option("es.port", 9200)
.mode("overwrite")
.save(s"$index")
}
Can you help me finf out what I am not doing well and how to fi it?
Thanks in advance.
Answering my own question.
As suggested by #warkolm, I focused on _bulk.
I am using es-hadoop connector, so I had to tweak es.batch.size.entries parameter.
After running a bunch of tests, (testing various values), I finally got better results (still not optimal though) with es.batch.size.entries set to 10000 and following values in ES index template.
{
"index": {
"number_of_shards": "10",
"number_of_replicas": "0",
"refresh_interval": "60s"
}
}
Finally, my df.writelooks like this:
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes", es_nodes)
.option("es.port", es_port)
.option("es.http.timeout", 600000)
.option("es.batch.size.entries", 10000)
.option("es.http.max_content_length", "2000mb")
.mode("overwrite")
.save(s"$writeTo")
Now the process takes ~3h (2h 55 min) instead of 5 hours.
I am still improving the configs and code. I will update if I got better performance.
Related
I am trying to write a Pyspark dataframe of ~3 millions rows x 158 columns (~3GB) to TimeScale DB.
The write operation is executed from a Jupyter Kernel with the following ressources :
1 Driver, 2 vcpu, 2GB memory
2 Executors, 2 vcpu, 4GB memory
As one could expect, it is fairly slow.
I know of repartition and batchsize, so I am trying to play with those parameters to speed up the write operation, but I was wondering what would be the optimal parameters to be as performant as possible.
df.rdd.getNumPartitions() is 7, should I try to increase or decrease the number of partitions?
I've tried playing with it a bit but I did not get any conclusive result. Increasing the number of partitions does seem to slow the writing, but it might just be because of Spark performing repartition first.
I am more specifically wondering about batchsize. I guess optimal batchsize depends on TimeScale/Postgre config, but I haven't been able to find more info about this.
For the record, here is an example of what I've tried :
df.write \
.mode("overwrite") \
.format('jdbc') \
.option('url', 'my_url') \
.option('user', 'my_user') \
.option('password', 'my_pwd') \
.option('dbtable', 'my_table') \
.option('numPartitions', '5') \
.option('batchsize', '10000') \
.save()
This took 26 minutes on a much smaller sample of the dataframe (~500K rows, 500MB).
We are aware our Jupyter kernel is lacking in resources and are trying to work on that too, but is there a way to optimize the writing speed with Spark and TimeScale parameters?
[EDIT] I have also read this very helpful answer about using COPY, but we are specifically looking for ways to increase performance using Spark for now.
If it's using the JDBC, the reWriteBatchedInserts=true parameter, which was introduced a while back https://jdbc.postgresql.org/documentation/changelog.html#version_9.4.1209 will likely speed things up significantly. It should just be able to be introduced to the connection string, or potentially there's a way to specify to use it in the Spark connector.
I am new to scala/spark world and have recently started working on a task where it reads some data, processes it and saves it on S3. I have read several topics/questions on stackoverflow regarding repartition/coalesce performance and optimal number of partitions (like this one). Assuming that I have the right number of paritions, my questions is, would it be a good idea to repartition a rdd while converting it to dataframe? Here is how my code looks like at the moment:
val dataRdd = dataDf.rdd.repartition(partitions)
.map(x => ThreadedConcurrentContext.executeAsync(myFunction(x)))
.mapPartitions( it => ThreadedConcurrentContext.awaitSliding(it = it, batchSize = asyncThreadsPerTask, timeout = Duration(3600000, "millis")))
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.toDF()
Here is what I'm planning to do (repartition data after filter):
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.repartition(partitions)
.toDF()
My questions is, would it be a good idea to do so? is there a performance gain here?
Note1: Filter usually removes 10% of original data.
Note2: Here is the first part of spark-submit command that I use to run the above code:
spark-submit --master yarn --deploy-mode client --num-executors 4 --executor-cores 4 --executor-memory 2G --driver-cores 4 --driver-memory 2G
The answer to your problem depends on the size of your dataRdd, number of partitions, executor-cores and processing power of your HDFS cluster.
With this in mind, you should run some tests on your cluster with different values of partitions and removing repartition altogether to fine tune it and find an accurate answer.
Example - if you specify partitions=8 and executor-cores=4 then you will be fully utilizing all your cores, however if the size of your dataRdd is only 1GB, then there is no advantage in repartitioning because it triggers shuffle which incurs performance impact. In addition to that if the processing power of your HDFS cluster is low or it is under a heavy load then there is an additional performance overhead due to that.
If you do have sufficient resources available on your HDFS cluster and you have a big (say over 100GB) dataRDD then a repartition should help in improving performance with config values in the example above.
I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory
Would like to know the relation between Spark executors, cores and Elasticsearch batch size and how to tune Spark job optimally to get better index throughput.
I have 3.5B data in Parquet format and I would like to ingest them to Elasticsearch and I'm not getting more than 20K index rate. Sometimes I got 60K-70K but it comes down immediately and the average I got was around 15K-25K indexes per second.
Little bit more details about my input:
Around 22,000 files in Parquet format
It contains around 3.2B records (around 3TB in size)
Currently running 18 executors (3 executors per node)
Details about my current ES setup:
8 nodes, 1 master and 7 data nodes
Index with 70 shards
Index contains 49 fields (none of them are analyzied)
No replication
"indices.store.throttle.type" : "none"
"refresh_interval" : "-1"
es.batch.size.bytes: 100M (I tried with 500M also)
I'm very new in Elasticsearch so not sure how to tune my Spark job to get better performance.