I am trying to write a Pyspark dataframe of ~3 millions rows x 158 columns (~3GB) to TimeScale DB.
The write operation is executed from a Jupyter Kernel with the following ressources :
1 Driver, 2 vcpu, 2GB memory
2 Executors, 2 vcpu, 4GB memory
As one could expect, it is fairly slow.
I know of repartition and batchsize, so I am trying to play with those parameters to speed up the write operation, but I was wondering what would be the optimal parameters to be as performant as possible.
df.rdd.getNumPartitions() is 7, should I try to increase or decrease the number of partitions?
I've tried playing with it a bit but I did not get any conclusive result. Increasing the number of partitions does seem to slow the writing, but it might just be because of Spark performing repartition first.
I am more specifically wondering about batchsize. I guess optimal batchsize depends on TimeScale/Postgre config, but I haven't been able to find more info about this.
For the record, here is an example of what I've tried :
df.write \
.mode("overwrite") \
.format('jdbc') \
.option('url', 'my_url') \
.option('user', 'my_user') \
.option('password', 'my_pwd') \
.option('dbtable', 'my_table') \
.option('numPartitions', '5') \
.option('batchsize', '10000') \
.save()
This took 26 minutes on a much smaller sample of the dataframe (~500K rows, 500MB).
We are aware our Jupyter kernel is lacking in resources and are trying to work on that too, but is there a way to optimize the writing speed with Spark and TimeScale parameters?
[EDIT] I have also read this very helpful answer about using COPY, but we are specifically looking for ways to increase performance using Spark for now.
If it's using the JDBC, the reWriteBatchedInserts=true parameter, which was introduced a while back https://jdbc.postgresql.org/documentation/changelog.html#version_9.4.1209 will likely speed things up significantly. It should just be able to be introduced to the connection string, or potentially there's a way to specify to use it in the Spark connector.
Related
I am dealing with a huge amount of data which can't be processed through available memory in PySpark, which is resulted in Out of Memory error. I need to utilize the MEMORY_AND_DISK option for this.
My question is: How I can enable this flag in PySpark Jupyter Notebook?
I am looking for something like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "15g") \
.appName('voice-30') \
.getOrCreate()
This is how we are setting driver memory. Is there any similar way to set the DISK_AND_MEMORY flag for PySpark?
MEMORY_AND_DISK is the default storage level since Spark 2.0 for persisting a Dataframe, or RDD, for use in multiple actions, so there is no need to set it explicitly. However, you are experiencing an OOM error, hence setting storage options for persisting RDDs is not the answer to your problem.
Note from the Spark FAQs:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
Hence, your OOM error is due to your cluster running out of storage (both memory and disk), so you need to increase the resources of your cluster (some permutation of memory, disk, and number of nodes).
I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables.
I have done this from spark to
MSSQL in the past by making use of bulk copy and batch size option which was successful too.
Is there something similar that can be here for Postgres?
Adding the code I have tried and the time it took to run the process:
def inserter():
start = timer()
sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
.option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
.option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
end = timer()
print(timedelta(seconds=end-start))
inserter()
So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.
The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).
Is there any other efficient approach which would reduce the time?
What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?
On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.
I actually did kind of the same work a while ago but using Apache Sqoop.
I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.
But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which
PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.
In order to tune the insertion process I would approach the problem following the next steps:
Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.
Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.
I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.
I am trying to do sentimental analysis of comments. Program is running successfully on Spark, But the problem I am facing is out of 70 partitions 68 partitions gave result in around 20% of time compared to the time taken by last 2 partitions. I have checked my data is equally distributed on all partitions , and even checked with different sample data also.
Also I had run the code using persist(StorageLevel.MEMORY_AND_DISK_SER) for all dataframes, and unpersist these dataframes as soon as they are not required anymore.
I also tried increasing and decreasing the number of partitions, but still for last 2 tasks, It takes huge time. Following is the current config I am using
--master yarn \
--deploy-mode client \
--num-executors 15 \
--executor-cores 5 \
--executor-memory 32g \
--driver-memory 8g \
--driver-cores 8 \
I have used KryoSerializer. Following is set in sparkConf
sparkConf.set("spark.driver.allowMultipleContexts", "true")
sparkConf.set("spark.scheduler.mode", "FAIR")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryoserializer.buffer.max", "1024m")
How can I optimise, so that last 2 partitions dont take so much time?
Thanks
Try using KYRO serialization , the default java serialization is bit slow.
When it is writing back to DISK and getting the data back , it is transferring huge amounts of data so you need to use better serialization technique.
You are running in client mode, try running in cluster mode. In cluster mode driver is launched with in cluster as another process.
Executor processing time mostly un even based on how you partitioned your data and how busy worker node at that time. Try to relook your partitioned logic(trying to increase partition number) and increase number of core and executor memory. Its kind of specific tuning based on your specific use cases. In my case i increased partitioned number from 100 to 1000 with number of executor to 10 and core 2 solve this problem. again its depend upon how large your dataset also.
NLP tasks can take variable amounts of time depending on the data factors. For example, Stanford's CoreNLP takes a very long time to do NER on long sentences (time goes up with the square of the number of tokens in the sentence).
If it is a different partition each time, then I would guess that your cluster is sized in an inconvenient way: the number of executors does not divide evenly into the number of partitions. An easy way to check this is to re-partition your 70 partitions into 700 partitions and see if the time taken is still 2 or 3 partitions, or if the time evens out. If it is two or three partitions still, then likely it is a data issue. If the time evens out then it was due to partition/executor mismatch.