I am new to Spark Streaming and Azure Databricks. I read many articles on how spark works and process data etc. But what about old data? If spark works on interactive data then my 2 weeks older or 2 months older data can Spark hold? or suppose I have to move data after transformation where should I move and clear the spark memory? will it store in SSD only?
Azure Databricks supports several data stores (as a source and as a target for data at rest). Good practice for Big Data is mounting an Azure Data Lake Store. If you have a streaming data source (like Kafka or EventHubs) you can use it as a sink and of cause reuse it for further analytics.
See https://docs.azuredatabricks.net/spark/latest/data-sources/index.html for supported data sources.
Related
I'm currently attempting to process telemetry data which has a volume of around 4TB a day using Delta Lake on Azure Databricks.
I have a dedicated event hub cluster where the events are written to and I am attempting to ingest this eventhub into delta lake with databricks structured streaming. there's a relatively simple job that takes the event hub output and extracts a few columns and then writes with a stream writer to ADLS gen2 storage that is mounted to the DBFS partitioned by date and hour.
Initially on a clean delta table directory the performance keeps up with the event hub writing around 18k records a second but after a few hours this drops to 10k a second and then further till it seems to stabilize around 3k records a second.
tried a few things on the databricks side with different partition schemes and the day hour partitions seemed to perform the best for the longest but still, after a pause and restart in this case the performance dropped and started to lag behind the event hub.
looking for some suggestions as to how I might be able to maintain performance.
I had a similar issue once, and it was not the Delta lake, but the Spark Azure EventHubs connector. It was extremely slow and using up a lot of resources.
I solved this problem by switching to the Kafka interface of Azure EventHubs: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
It's a little tricky to set up but it has been working very well for a couple of months now.
I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.
Problem statement:
To transfer data from mongoDB to spark optimally with minimal latency
Problem Description:
I have my data stored in mongoDB and want to process the data (of the order ~100-500GB) using apache spark.
I used the mongoDB-spark connector and was able to read/write data from/to mongoDB (https://docs.mongodb.com/spark-connector/master/python-api/)
The problem was to create spark dataframe each time on the fly.
Is there a solution to handling such huge data transfers?
I looked into :
spark streaming API
Apache Kafka
Amazon S3 and EMR
But couldn't make a decision as to whether it was the optimal way to do it.
What strategy would you reckon to handle transferring such data?
Would having the data on the spark cluster and syncing just the deltas (changes in database) to the local file would be the way to go or just reading from mongoDB each time is the only way (or the optimal way) to go about it?
EDIT 1:
The following suggests to read data of mongoDB (due to secondary indexes, data retrieval is faster): https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb
EDIT 2:
The advantages of using parquet format : What are the pros and cons of parquet format compared to other formats?
I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.
my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using
results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")
I have not created the pipeline. Is it fine or I have to modified the architecture?
Thanks
I would give it a try!
BUT kafka->spark->hive is not the optimal pipline for your usecase.
hive is normally based on hdfs which is not designed for small number of inserts/updates/selects.
So your plan can end up in the following problems:
many small files which ends in bad performance
your window gets to small because it takes to long
Suggestion:
option 1:
- use kafka just as buffer queue and design your pipeline like
- kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table
Option 2:
kafka->flume/spark to hbase/kudu->batch spark to hive/impala
option 1 has no "realtime" analysis option. It depends on how often you run the batch spark
option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis.
Kudu makes the architecture even easier.
Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.
But basicly it would work like the following:
xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")
BR
From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.