Faster write to MySQL using databricks write

Faster write to MySQL using databricks write - scala

I am currently working on a azure date bricks notebook that read files from a storage container into a data frame and then writes all the records to a table in MySQL. The file can have anywhere from 2 million to 10 million rows. I have the following write code in my notebook after using read to populate my data frame.
newFile.repartition(16).write
.format("jdbc").option("driver", "com.mysql.cj.jdbc.Driver")
.mode("append")
.option("url", s"${jdbc_url}")
.option("dbtable", pathToDBTable)
.option("user", s"user")
.option("password", s"pass")
.save()
I have played around with the partitions and decided to go with 16 because my cluster will have 16 cores. Other than this is there a faster way to insert all this data into my DB using write? Or any other suggested approaches to try within azure data bricks notebooks? It currently takes 10-15 min for 2 million row files.

Most probably the delay is caused by the MySQL side - you're writing using 16 cores, and each is opening a separate connection to MySQL, so you can overload the database. Also performance could be affected if you have indexes on the columns, etc.
So it's recommended to check the MySQL side for reporting about problems, look onto load on the database node, check how many cores instance has, etc.

Related

Optimizing spark AWS GLUE jobs

I'm reading 8 tables from Aurora postgres using pyspark AWS GLUE and after transformation and joins writing to one table in redshift of size around 2-5 GB, with read table sizes as below
92GB, 20 GB, 68 MB, 50 MB ,8 mb, 7 mb, 6 mb, 1.5 mb, 88kb, 56 kb,
No: of Standard worker node 10 concurrency between 1-3 (if in anyway it is helpful)
Reading 2 big table applying filtering while fetching from postgress. Trying to apply kryoSerializer for glue job (will this help?) if yes how can we apply and verify?
billing_fee_df= glueContext.read.format("jdbc")\
.option("driver", "org.postgresql.Driver")\
.option("url", "jdbc:postgresql://prod-re.cbn6.us-west-2.rds.amazonaws.com:5432/db")\
.option("dbtable", "("sql query with specific column selection" from first_largest_table cc LEFT JOIN second_largest_table cf ON cc.id = cf.id LEFT JOIN thirdTable con ON cf.id=con.id where cc.maturity_date > ((current_date-45)::timestamp without time zone at time zone 'UTC')) as bs")\
.option("user", "postgres")\
.option("password", "pass")\
.option("numPartitions", "100")\
.option("partitionColumn", "date_col")\
.option("lowerBound", "2020-07-30 00:00:00.000")\
.option("upperBound", "2020-12-11 00:00:00.000").load()
below are the optimizations i'm already implementing
trying to implement Broadcast on all the smaller tables.
doing column pruning.
my job is currently finishing in 20 min. I'm looking for suggestions how to improve performance to finish the job in lessor time while considering cost aspects.
Any suggestions and questions are appreciated.

You probably need to take a step back and understand where your job is spending most of its time. Is the initial read from postgres the limiting factor? The joins and computation afterwords? The write to redshift? The spark history server is the go to place to start getting this information. click on the sql tab and look at the execution graph and how long each stage took to complete. Also while you're here see if there is any skew. Also click on the details section and get the query plan and paste that above.

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!

Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu

Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables.
I have done this from spark to
MSSQL in the past by making use of bulk copy and batch size option which was successful too.
Is there something similar that can be here for Postgres?
Adding the code I have tried and the time it took to run the process:
def inserter():
start = timer()
sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
.option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
.option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
end = timer()
print(timedelta(seconds=end-start))
inserter()
So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.
The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).
Is there any other efficient approach which would reduce the time?
What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?
On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.

I actually did kind of the same work a while ago but using Apache Sqoop.
I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.
But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which
PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.
In order to tune the insertion process I would approach the problem following the next steps:
Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.
Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.

Time consuming write process of Spark Dataset into the Oracle DB using JDBC driver

I am using Apache Spark for dataset loading,processing,and outputting the dataset into the Oracle DB using JDBC driver.
I am using spark jdbc write method for writing the Dataset into Database.
But,meanwhile writing the Dataset into the DB it takes same time for writing 10 rows and 10 Million rows into the different tables of the Database.
I want to know how to performance tune this write method using spark,so that we can make wise use of the apache spark compute engine.Otherwise,there is no benefit in using it for fast computation process;if it takes time to write the dataset into the Database.
The code to write the 10 rows and 10M rows is as follows:
with 10 rows to write
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM).save();
with 10M rows to write
finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL).save();
Attaching the screenshot of the apache spark Dashb
oard Spark Stages Screenshot
If some can help out would be helpful...

You can bulk insert the records at once rather than inserting 1000 records (default setting) at a time by adding a new option batchSize and increasing its value
finalPritmOutput.distinct().write()
.mode("append")
.format("jdbc").option("url", connection)
.option("dbtable", CI_TXN_DTL)
.option("batchsize", "100000")
.save()
Refer to https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases on how to configure your jdbc for better performance.

Performance Issue with writing data to snowflake using spark df

I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.

It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.