Optimize fetch size on Spark Couchbase connector - scala

Im using spark to join a table with other database and couchbase using Spark's datasets.
val couchbaseTable = session.read.couchbase(StructType(StructField("name",StringType) ::Nil))
On the Couchbase console I can see the ops rising to 500 caped and then it goes to 0 after a few seconds. I made a load test using the java api and reactivex, and was able to reach 20k ops.
How can I increase the fetch size (batch, bulk) so all the docs get fetched by spark at once to be processed, as I can do with the cassandra connector?

Related

Spark JDBC Save to HDFS Performance

Below is my problem statement,looking for suggestions
1)I have 4-5 dataframes that arte reading data from a teradata source using spark jdbc read API.
2)These 4-5 dataframes are combined into a final dataframe FinalDF that uses a shuffle partition of 1000
3)My data volume is really high ,currently each of the tasks are processing > 2GB of data
4)Lastly i am writing the FinalDF into an ORC File in HDFS.
5)The queries i am populating into the dataframes using jdbc,i am using predicates in the jdbc api for the date ranges.
My questions are as below :
1)While it writes the DF as ORC,does it internally works like a foreachpartition.Like the Action is called for each partition while it tries to fetch the data from source via JDBC call?
2)How can i improve the process performance wise,currently some of my tasks die out due large data in the RDDs and my stages are going for a memory spill
3)I have limitation opening too many sessions to the teradata source as there is a limit set on the source database,this stops me running multiple execotrs as i could only hold on to the limit of 300 concurrent sessions.

Ideal Strategy to maximise write throughput of RDD in cassandra

I have a 3 Node cluster on same DC and same rack. Keyspace has Replication Factor with 2, I have a spark application which is taking data form Kafka and Now I'am saving the RDD to Cassandra with
rdd.saveToCassandra("db_name", "table_name")
I'm consuming with Time interval of 10 Seconds and every batch will have 10k records and size per batch is around 2.5 MB
In Spark Conf I have setup
.set("spark.cassandra.output.consistency.level", "ONE")
Application takes around 2-3 seconds to insert. Why so? I would like to optimise. Earlier when I was using 1 Node machine with RF-1 , I was able to insert at a rate of 0.8-1 second/ batch. So, why this much delay after increase in node and RF.
Is there any other setting do I need to make in Spark Conf or cassandra side to increase write speed.

Time consuming write process of Spark Dataset into the Oracle DB using JDBC driver

I am using Apache Spark for dataset loading,processing,and outputting the dataset into the Oracle DB using JDBC driver.
I am using spark jdbc write method for writing the Dataset into Database.
But,meanwhile writing the Dataset into the DB it takes same time for writing 10 rows and 10 Million rows into the different tables of the Database.
I want to know how to performance tune this write method using spark,so that we can make wise use of the apache spark compute engine.Otherwise,there is no benefit in using it for fast computation process;if it takes time to write the dataset into the Database.
The code to write the 10 rows and 10M rows is as follows:
with 10 rows to write
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM).save();
with 10M rows to write
finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL).save();
Attaching the screenshot of the apache spark Dashb
oard Spark Stages Screenshot
If some can help out would be helpful...
You can bulk insert the records at once rather than inserting 1000 records (default setting) at a time by adding a new option batchSize and increasing its value
finalPritmOutput.distinct().write()
.mode("append")
.format("jdbc").option("url", connection)
.option("dbtable", CI_TXN_DTL)
.option("batchsize", "100000")
.save()
Refer to https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases on how to configure your jdbc for better performance.

Performance Issue with writing data to snowflake using spark df

I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.

Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka

I'm creating a Spark Structured streaming application which is going to be calculating data received from Kafka every 10 seconds.
To be able to do some of the calculations, I need to look up some information about sensors and placement in a Cassandra Database
I'm a little stuck at wrapping my head around how to keep the Cassandra data available throughout the cluster, and somehow update data from time to time, in-case we have done some changes to database table.
Currently, I'm querying the database as soon as I start the Spark locally using the Datastax Spark-Cassandra-connector
val cassandraSensorDf = spark
.read
.cassandraFormat("specifications", "sensors")
.load
From here on I can use this cassandraSensorDs by joining it with my Structured Streaming Dataset.
.join(
cassandraSensorDs ,
sensorStateDf("plantKey") <=> cassandraSensorDf ("cassandraPlantKey")
)
How do I do additional queries to update this Cassandra data while having Structured Streaming Running?
And how can I make the queried data available in a cluster setting?
Using broadcast variables, you may write a wrapper to fetch data from Cassandra periodically and update a broadcast variable. Do a map-side join on the stream with the broadcast variable. I have not tested this approach and I think this might as well be an overkill depending on your use case(throughput).
How can I update a broadcast variable in spark streaming?
Another approach is to query Cassandra for every item in your stream, to optimise on the connections you should make sure that you use connection pooling and create only one connection for a JVM/partition. This approach is simpler you don't have to worry about warming the Cassandra data periodically.
spark-streaming and connection pool implementation