I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.
Related
Below is my problem statement,looking for suggestions
1)I have 4-5 dataframes that arte reading data from a teradata source using spark jdbc read API.
2)These 4-5 dataframes are combined into a final dataframe FinalDF that uses a shuffle partition of 1000
3)My data volume is really high ,currently each of the tasks are processing > 2GB of data
4)Lastly i am writing the FinalDF into an ORC File in HDFS.
5)The queries i am populating into the dataframes using jdbc,i am using predicates in the jdbc api for the date ranges.
My questions are as below :
1)While it writes the DF as ORC,does it internally works like a foreachpartition.Like the Action is called for each partition while it tries to fetch the data from source via JDBC call?
2)How can i improve the process performance wise,currently some of my tasks die out due large data in the RDDs and my stages are going for a memory spill
3)I have limitation opening too many sessions to the teradata source as there is a limit set on the source database,this stops me running multiple execotrs as i could only hold on to the limit of 300 concurrent sessions.
I'm creating a Spark Structured streaming application which is going to be calculating data received from Kafka every 10 seconds.
To be able to do some of the calculations, I need to look up some information about sensors and placement in a Cassandra Database
I'm a little stuck at wrapping my head around how to keep the Cassandra data available throughout the cluster, and somehow update data from time to time, in-case we have done some changes to database table.
Currently, I'm querying the database as soon as I start the Spark locally using the Datastax Spark-Cassandra-connector
val cassandraSensorDf = spark
.read
.cassandraFormat("specifications", "sensors")
.load
From here on I can use this cassandraSensorDs by joining it with my Structured Streaming Dataset.
.join(
cassandraSensorDs ,
sensorStateDf("plantKey") <=> cassandraSensorDf ("cassandraPlantKey")
)
How do I do additional queries to update this Cassandra data while having Structured Streaming Running?
And how can I make the queried data available in a cluster setting?
Using broadcast variables, you may write a wrapper to fetch data from Cassandra periodically and update a broadcast variable. Do a map-side join on the stream with the broadcast variable. I have not tested this approach and I think this might as well be an overkill depending on your use case(throughput).
How can I update a broadcast variable in spark streaming?
Another approach is to query Cassandra for every item in your stream, to optimise on the connections you should make sure that you use connection pooling and create only one connection for a JVM/partition. This approach is simpler you don't have to worry about warming the Cassandra data periodically.
spark-streaming and connection pool implementation
Problem statement:
To transfer data from mongoDB to spark optimally with minimal latency
Problem Description:
I have my data stored in mongoDB and want to process the data (of the order ~100-500GB) using apache spark.
I used the mongoDB-spark connector and was able to read/write data from/to mongoDB (https://docs.mongodb.com/spark-connector/master/python-api/)
The problem was to create spark dataframe each time on the fly.
Is there a solution to handling such huge data transfers?
I looked into :
spark streaming API
Apache Kafka
Amazon S3 and EMR
But couldn't make a decision as to whether it was the optimal way to do it.
What strategy would you reckon to handle transferring such data?
Would having the data on the spark cluster and syncing just the deltas (changes in database) to the local file would be the way to go or just reading from mongoDB each time is the only way (or the optimal way) to go about it?
EDIT 1:
The following suggests to read data of mongoDB (due to secondary indexes, data retrieval is faster): https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb
EDIT 2:
The advantages of using parquet format : What are the pros and cons of parquet format compared to other formats?
I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.
my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using
results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")
I have not created the pipeline. Is it fine or I have to modified the architecture?
Thanks
I would give it a try!
BUT kafka->spark->hive is not the optimal pipline for your usecase.
hive is normally based on hdfs which is not designed for small number of inserts/updates/selects.
So your plan can end up in the following problems:
many small files which ends in bad performance
your window gets to small because it takes to long
Suggestion:
option 1:
- use kafka just as buffer queue and design your pipeline like
- kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table
Option 2:
kafka->flume/spark to hbase/kudu->batch spark to hive/impala
option 1 has no "realtime" analysis option. It depends on how often you run the batch spark
option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis.
Kudu makes the architecture even easier.
Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.
But basicly it would work like the following:
xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")
BR
Currently, we have mysql based analytics in place. We read our logs after every 15 mins, process them & add to mysql database.
As our data is growing(In one case, 9 million rows added till now & 0.5 million rows are adding in each month), we are planning to move analytics to no sql database.
As per my study, Hadoop seems to be better fit as we need to process the logs & it can handle very large data set.
However, it would be great if I can get some suggests from experts.
I agree with the other answers and comments. But if you want to evaluate Hadoop option then one solution can be following.
Apache Flume with Avro for log collection, agregation. Flume can ingest data into Hadoop File System (HDFS)
Then you can have Hbase as distributed scalable data store.
with Cloudera Impala on top of hbase you can have a near to real time (streaming) query engine. Impala uses SQL as its query language so it will be beneficial for you.
This is just one option. There can be multiple alternatives e.g. flume + hdfs + hive.
This is probably not a good q. for this forum but I would say that 9 million row and 0.5m per month hardly seems like a good reason to go to noSQL. This is a very small database and your best action would be to scale up the server a little (RAM, more disks, move to SSDs etc.)