Spark JDBC Save to HDFS Performance - scala

Below is my problem statement,looking for suggestions
1)I have 4-5 dataframes that arte reading data from a teradata source using spark jdbc read API.
2)These 4-5 dataframes are combined into a final dataframe FinalDF that uses a shuffle partition of 1000
3)My data volume is really high ,currently each of the tasks are processing > 2GB of data
4)Lastly i am writing the FinalDF into an ORC File in HDFS.
5)The queries i am populating into the dataframes using jdbc,i am using predicates in the jdbc api for the date ranges.
My questions are as below :
1)While it writes the DF as ORC,does it internally works like a foreachpartition.Like the Action is called for each partition while it tries to fetch the data from source via JDBC call?
2)How can i improve the process performance wise,currently some of my tasks die out due large data in the RDDs and my stages are going for a memory spill
3)I have limitation opening too many sessions to the teradata source as there is a limit set on the source database,this stops me running multiple execotrs as i could only hold on to the limit of 300 concurrent sessions.

Related

Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Checkpoint version:
val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()
Write to disk version:
df.write.parquet(savePath)
val df = spark.read.parquet(savePath)
I think both break the lineage in the same way.
In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).
Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.
When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.
You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.
You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to
HDFS and read it back in parallel than recompute from scratch.
And there's a slight inconvenience prior to spark2.1 release;
there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.
The problem with saving to Disk in parquet and read it back is that
It could be inconvenient in coding. You need to save and read multiple times.
It could be a slower process in the overall performance of the job. Because when you save as parquet and read it back the Dataframe needs to be reconstructed again.
This wiki could be useful for further investigation
As presented in the dataset checkpointing wiki
Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.
Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.
Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.
One difference is that if your spark job needs a certain in memory partitioning scheme, eg if you use a window function, then checkpoint will persist that to disk, whereas writing to parquet will not.
I'm not aware of a way with the current versions of spark to write parquet files and then read them in again, with a particular in memory partitioning strategy. Folder level partitioning doesn't help with this.

Time consuming write process of Spark Dataset into the Oracle DB using JDBC driver

I am using Apache Spark for dataset loading,processing,and outputting the dataset into the Oracle DB using JDBC driver.
I am using spark jdbc write method for writing the Dataset into Database.
But,meanwhile writing the Dataset into the DB it takes same time for writing 10 rows and 10 Million rows into the different tables of the Database.
I want to know how to performance tune this write method using spark,so that we can make wise use of the apache spark compute engine.Otherwise,there is no benefit in using it for fast computation process;if it takes time to write the dataset into the Database.
The code to write the 10 rows and 10M rows is as follows:
with 10 rows to write
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM).save();
with 10M rows to write
finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL).save();
Attaching the screenshot of the apache spark Dashb
oard Spark Stages Screenshot
If some can help out would be helpful...
You can bulk insert the records at once rather than inserting 1000 records (default setting) at a time by adding a new option batchSize and increasing its value
finalPritmOutput.distinct().write()
.mode("append")
.format("jdbc").option("url", connection)
.option("dbtable", CI_TXN_DTL)
.option("batchsize", "100000")
.save()
Refer to https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases on how to configure your jdbc for better performance.

Performance Issue with writing data to snowflake using spark df

I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.

Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka

I'm creating a Spark Structured streaming application which is going to be calculating data received from Kafka every 10 seconds.
To be able to do some of the calculations, I need to look up some information about sensors and placement in a Cassandra Database
I'm a little stuck at wrapping my head around how to keep the Cassandra data available throughout the cluster, and somehow update data from time to time, in-case we have done some changes to database table.
Currently, I'm querying the database as soon as I start the Spark locally using the Datastax Spark-Cassandra-connector
val cassandraSensorDf = spark
.read
.cassandraFormat("specifications", "sensors")
.load
From here on I can use this cassandraSensorDs by joining it with my Structured Streaming Dataset.
.join(
cassandraSensorDs ,
sensorStateDf("plantKey") <=> cassandraSensorDf ("cassandraPlantKey")
)
How do I do additional queries to update this Cassandra data while having Structured Streaming Running?
And how can I make the queried data available in a cluster setting?
Using broadcast variables, you may write a wrapper to fetch data from Cassandra periodically and update a broadcast variable. Do a map-side join on the stream with the broadcast variable. I have not tested this approach and I think this might as well be an overkill depending on your use case(throughput).
How can I update a broadcast variable in spark streaming?
Another approach is to query Cassandra for every item in your stream, to optimise on the connections you should make sure that you use connection pooling and create only one connection for a JVM/partition. This approach is simpler you don't have to worry about warming the Cassandra data periodically.
spark-streaming and connection pool implementation

Streaming data store in hive using spark

I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.
my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using
results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")
I have not created the pipeline. Is it fine or I have to modified the architecture?
Thanks
I would give it a try!
BUT kafka->spark->hive is not the optimal pipline for your usecase.
hive is normally based on hdfs which is not designed for small number of inserts/updates/selects.
So your plan can end up in the following problems:
many small files which ends in bad performance
your window gets to small because it takes to long
Suggestion:
option 1:
- use kafka just as buffer queue and design your pipeline like
- kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table
Option 2:
kafka->flume/spark to hbase/kudu->batch spark to hive/impala
option 1 has no "realtime" analysis option. It depends on how often you run the batch spark
option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis.
Kudu makes the architecture even easier.
Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.
But basicly it would work like the following:
xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")
BR