Time consuming write process of Spark Dataset into the Oracle DB using JDBC driver - scala

I am using Apache Spark for dataset loading,processing,and outputting the dataset into the Oracle DB using JDBC driver.
I am using spark jdbc write method for writing the Dataset into Database.
But,meanwhile writing the Dataset into the DB it takes same time for writing 10 rows and 10 Million rows into the different tables of the Database.
I want to know how to performance tune this write method using spark,so that we can make wise use of the apache spark compute engine.Otherwise,there is no benefit in using it for fast computation process;if it takes time to write the dataset into the Database.
The code to write the 10 rows and 10M rows is as follows:
with 10 rows to write
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM).save();
with 10M rows to write
finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL).save();
Attaching the screenshot of the apache spark Dashb
oard Spark Stages Screenshot
If some can help out would be helpful...

You can bulk insert the records at once rather than inserting 1000 records (default setting) at a time by adding a new option batchSize and increasing its value
finalPritmOutput.distinct().write()
.mode("append")
.format("jdbc").option("url", connection)
.option("dbtable", CI_TXN_DTL)
.option("batchsize", "100000")
.save()
Refer to https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases on how to configure your jdbc for better performance.

Related

Faster write to MySQL using databricks write

I am currently working on a azure date bricks notebook that read files from a storage container into a data frame and then writes all the records to a table in MySQL. The file can have anywhere from 2 million to 10 million rows. I have the following write code in my notebook after using read to populate my data frame.
newFile.repartition(16).write
.format("jdbc").option("driver", "com.mysql.cj.jdbc.Driver")
.mode("append")
.option("url", s"${jdbc_url}")
.option("dbtable", pathToDBTable)
.option("user", s"user")
.option("password", s"pass")
.save()
I have played around with the partitions and decided to go with 16 because my cluster will have 16 cores. Other than this is there a faster way to insert all this data into my DB using write? Or any other suggested approaches to try within azure data bricks notebooks? It currently takes 10-15 min for 2 million row files.
Most probably the delay is caused by the MySQL side - you're writing using 16 cores, and each is opening a separate connection to MySQL, so you can overload the database. Also performance could be affected if you have indexes on the columns, etc.
So it's recommended to check the MySQL side for reporting about problems, look onto load on the database node, check how many cores instance has, etc.

Spark JDBC Save to HDFS Performance

Below is my problem statement,looking for suggestions
1)I have 4-5 dataframes that arte reading data from a teradata source using spark jdbc read API.
2)These 4-5 dataframes are combined into a final dataframe FinalDF that uses a shuffle partition of 1000
3)My data volume is really high ,currently each of the tasks are processing > 2GB of data
4)Lastly i am writing the FinalDF into an ORC File in HDFS.
5)The queries i am populating into the dataframes using jdbc,i am using predicates in the jdbc api for the date ranges.
My questions are as below :
1)While it writes the DF as ORC,does it internally works like a foreachpartition.Like the Action is called for each partition while it tries to fetch the data from source via JDBC call?
2)How can i improve the process performance wise,currently some of my tasks die out due large data in the RDDs and my stages are going for a memory spill
3)I have limitation opening too many sessions to the teradata source as there is a limit set on the source database,this stops me running multiple execotrs as i could only hold on to the limit of 300 concurrent sessions.

spark bulk insert millions of records into sql table of 400 columns GC limit exceeded

I'm relatively new with spark scala and I am trying to bulk insert a dataframe containing millions of records into MS SQL. I am using Azure sqldb spark to do the insertion but spark would crash (GC limit exceed or heartbeat not responding) before the actual insertion is made.
I have tried setting the increasing the memory, executors, timeout etc but still no able to make it write to the database. Normalising the table schema of 400 columns is not an option for me.
Appreciate any advice on how I can approach this issue. Thanks in advance.

Optimize fetch size on Spark Couchbase connector

Im using spark to join a table with other database and couchbase using Spark's datasets.
val couchbaseTable = session.read.couchbase(StructType(StructField("name",StringType) ::Nil))
On the Couchbase console I can see the ops rising to 500 caped and then it goes to 0 after a few seconds. I made a load test using the java api and reactivex, and was able to reach 20k ops.
How can I increase the fetch size (batch, bulk) so all the docs get fetched by spark at once to be processed, as I can do with the cassandra connector?

Performance Issue with writing data to snowflake using spark df

I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.