Optimizing spark AWS GLUE jobs - postgresql

I'm reading 8 tables from Aurora postgres using pyspark AWS GLUE and after transformation and joins writing to one table in redshift of size around 2-5 GB, with read table sizes as below
92GB, 20 GB, 68 MB, 50 MB ,8 mb, 7 mb, 6 mb, 1.5 mb, 88kb, 56 kb,
No: of Standard worker node 10 concurrency between 1-3 (if in anyway it is helpful)
Reading 2 big table applying filtering while fetching from postgress. Trying to apply kryoSerializer for glue job (will this help?) if yes how can we apply and verify?
billing_fee_df= glueContext.read.format("jdbc")\
.option("driver", "org.postgresql.Driver")\
.option("url", "jdbc:postgresql://prod-re.cbn6.us-west-2.rds.amazonaws.com:5432/db")\
.option("dbtable", "("sql query with specific column selection" from first_largest_table cc LEFT JOIN second_largest_table cf ON cc.id = cf.id LEFT JOIN thirdTable con ON cf.id=con.id where cc.maturity_date > ((current_date-45)::timestamp without time zone at time zone 'UTC')) as bs")\
.option("user", "postgres")\
.option("password", "pass")\
.option("numPartitions", "100")\
.option("partitionColumn", "date_col")\
.option("lowerBound", "2020-07-30 00:00:00.000")\
.option("upperBound", "2020-12-11 00:00:00.000").load()
below are the optimizations i'm already implementing
trying to implement Broadcast on all the smaller tables.
doing column pruning.
my job is currently finishing in 20 min. I'm looking for suggestions how to improve performance to finish the job in lessor time while considering cost aspects.
Any suggestions and questions are appreciated.

You probably need to take a step back and understand where your job is spending most of its time. Is the initial read from postgres the limiting factor? The joins and computation afterwords? The write to redshift? The spark history server is the go to place to start getting this information. click on the sql tab and look at the execution graph and how long each stage took to complete. Also while you're here see if there is any skew. Also click on the details section and get the query plan and paste that above.

Related

Faster write to MySQL using databricks write

I am currently working on a azure date bricks notebook that read files from a storage container into a data frame and then writes all the records to a table in MySQL. The file can have anywhere from 2 million to 10 million rows. I have the following write code in my notebook after using read to populate my data frame.
newFile.repartition(16).write
.format("jdbc").option("driver", "com.mysql.cj.jdbc.Driver")
.mode("append")
.option("url", s"${jdbc_url}")
.option("dbtable", pathToDBTable)
.option("user", s"user")
.option("password", s"pass")
.save()
I have played around with the partitions and decided to go with 16 because my cluster will have 16 cores. Other than this is there a faster way to insert all this data into my DB using write? Or any other suggested approaches to try within azure data bricks notebooks? It currently takes 10-15 min for 2 million row files.
Most probably the delay is caused by the MySQL side - you're writing using 16 cores, and each is opening a separate connection to MySQL, so you can overload the database. Also performance could be affected if you have indexes on the columns, etc.
So it's recommended to check the MySQL side for reporting about problems, look onto load on the database node, check how many cores instance has, etc.

Spark dataset write to parquet file takes forever

spark scala App is getting stuck at the below statement and it's running more than 3 hours before getting timeout due to timeout settings. Any pointers on how to understand and interpret the job execution in the yarnUI and debug this issue are appreciated.
dataset
.repartition(100,$"Id")
.write
.mode(SaveMode.Overwrite)
.partitionBy(dateColumn)
.parquet(temppath)
I have a bunch of joins and the largest dataset is ~15 Million and the smallest is < 100 rows. I tried multiple options like increasing the executory memory and spark driver memory but no luck so far. Note I have cached the datasets I am using multiple times and the final dataset storage level is set to Memory_desk_ser.
Not sure whether below executors summary this will or not
executors (summary)
Total_tasks Input shuffle_read shuffle_write
7749 98 GB 77GB 106GB
Appreciate any pointers on how to go about and understand the bottle based on the query plan or any other info.

Optimal parameters to speed spark df.write to PostgreSQL

I am trying to write a Pyspark dataframe of ~3 millions rows x 158 columns (~3GB) to TimeScale DB.
The write operation is executed from a Jupyter Kernel with the following ressources :
1 Driver, 2 vcpu, 2GB memory
2 Executors, 2 vcpu, 4GB memory
As one could expect, it is fairly slow.
I know of repartition and batchsize, so I am trying to play with those parameters to speed up the write operation, but I was wondering what would be the optimal parameters to be as performant as possible.
df.rdd.getNumPartitions() is 7, should I try to increase or decrease the number of partitions?
I've tried playing with it a bit but I did not get any conclusive result. Increasing the number of partitions does seem to slow the writing, but it might just be because of Spark performing repartition first.
I am more specifically wondering about batchsize. I guess optimal batchsize depends on TimeScale/Postgre config, but I haven't been able to find more info about this.
For the record, here is an example of what I've tried :
df.write \
.mode("overwrite") \
.format('jdbc') \
.option('url', 'my_url') \
.option('user', 'my_user') \
.option('password', 'my_pwd') \
.option('dbtable', 'my_table') \
.option('numPartitions', '5') \
.option('batchsize', '10000') \
.save()
This took 26 minutes on a much smaller sample of the dataframe (~500K rows, 500MB).
We are aware our Jupyter kernel is lacking in resources and are trying to work on that too, but is there a way to optimize the writing speed with Spark and TimeScale parameters?
[EDIT] I have also read this very helpful answer about using COPY, but we are specifically looking for ways to increase performance using Spark for now.
If it's using the JDBC, the reWriteBatchedInserts=true parameter, which was introduced a while back https://jdbc.postgresql.org/documentation/changelog.html#version_9.4.1209 will likely speed things up significantly. It should just be able to be introduced to the connection string, or potentially there's a way to specify to use it in the Spark connector.

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!
Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu
Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables.
I have done this from spark to
MSSQL in the past by making use of bulk copy and batch size option which was successful too.
Is there something similar that can be here for Postgres?
Adding the code I have tried and the time it took to run the process:
def inserter():
start = timer()
sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
.option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
.option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
end = timer()
print(timedelta(seconds=end-start))
inserter()
So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.
The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).
Is there any other efficient approach which would reduce the time?
What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?
On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.
I actually did kind of the same work a while ago but using Apache Sqoop.
I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.
But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which
PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.
In order to tune the insertion process I would approach the problem following the next steps:
Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.
Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.