Databricks Delta Lake Structured Streaming Performance with event hubs and ADLS g2 - spark-structured-streaming

I'm currently attempting to process telemetry data which has a volume of around 4TB a day using Delta Lake on Azure Databricks.
I have a dedicated event hub cluster where the events are written to and I am attempting to ingest this eventhub into delta lake with databricks structured streaming. there's a relatively simple job that takes the event hub output and extracts a few columns and then writes with a stream writer to ADLS gen2 storage that is mounted to the DBFS partitioned by date and hour.
Initially on a clean delta table directory the performance keeps up with the event hub writing around 18k records a second but after a few hours this drops to 10k a second and then further till it seems to stabilize around 3k records a second.
tried a few things on the databricks side with different partition schemes and the day hour partitions seemed to perform the best for the longest but still, after a pause and restart in this case the performance dropped and started to lag behind the event hub.
looking for some suggestions as to how I might be able to maintain performance.

I had a similar issue once, and it was not the Delta lake, but the Spark Azure EventHubs connector. It was extremely slow and using up a lot of resources.
I solved this problem by switching to the Kafka interface of Azure EventHubs: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
It's a little tricky to set up but it has been working very well for a couple of months now.

Related

AWS DMS real time replication CDC from Oracle and PostgreSQL to Kinesis on a single thread is taking a lot of time

The goal:
Real time CDC from Oracle and PostgreSQL to Kinesis on a single thread/process without much time lag and no record drop.
The system:
We have a system where we are doing a real time CDC from Oracle and PostgreSQL to Kinesis using AWS DMS.
The problem with doing a real time CDC with only one thread is that it takes many hours to replicate the changes to Kinesis when the data grows big(MBs).
Alternate approach:
The approach we took was to pull the real time changes from Oracle and PostgreSQL using multiple threads and push to Kinesis while still using DMS.
The challenge:
We noticed that while pulling data in real time using multiple threads, there is a drop in some records from Oracle and PostgreSQL. This happens in like 1 in 3 million records.
Tried different solutions on the Oracle and PostgreSQL side, talked to AWS and nothing works.
Notes:
We are using Logminner or Binary leader on Oracle and PostgreSQL side.
Is there a solution to this or has anybody tried to build this kind of system? Please let me know.

Delta Lake OSS table large volume due to versioning

I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second.
After running the pipeline for a while, I noticed that the tables require a large amount of storage on Disk. Some tables require 10x storage compared to the sources.
I investigated the Delta Lake table versioning. When I DESCRIBE a selected table, it's stated that the sizeInBytes is actually around 10 GB, although the corresponding folder on the disk takes over 100 GB.
DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
So I set the following properties:
ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')
and then performed a VACUUM:
VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
But still, after several days, the size on the disk stays at around 100GB. Altough constantly performing a VACUUM.
How can I overcome this issue?
Spark 3.2.1
Delta Lake 1.2.0
Thanks in advance!
UPDATE:
Looks like when I manually delete the json logs in _delta_log the VACUUM actually cleans old versions, so I assume some of the properties are not applied?

Transfer data from Kinesis (or s3) to RDS postgres chron job

I'm pretty new to AWS, and I'm trying to find a way to reliably transfer data from a Kinesis stream to an AWS RDS postgres database table. The records will need to undergo small transformations on the way in, like filter (not all records will be inserted, depending on a key), and parsed for an insert into postgres. Currently, the data from the Kinesis stream is being dumped by firehose into S3 buckets as parquet.
I'm a bit lost in the many possible ways there seems to be of doing this, like maybe:
Kinesis streams -> Firehose -> Lambda -> RDS
Kinesis streams -> Firehose -> S3 -> Data Pipeline ETL job -> RDS
Database migration for S3 -> RDS?
AWS Glue?
others...?
In a non serverless world, I would run a chron job every, say, one hour which would take the files in the most recent S3 bucket partition (which is year/month/day/hour), so the latest hour, and filter out the records not needed in RDS, and do a bulk insert the rest into the RDS. I don't want to have a EC2 instance that sits idle 95% of the time to do this. Any advice?
Thanks for the clarification. Doing it in traditional ETL way with servers has some drawbacks. Either you'll need to keep a machine idle most of the time or you'll need to wait every time before the machine is created on demand - exactly as you're saying.
For Firehose, IMO it's interesting when you have a lot of real-time data to ingest. Regarding to AWS Glue, for me it's more like a "managed" Apache Spark, hence if you have some data processing logic to implement in a big amount of batch data, it can be interesting. But according to your description, it's not the case right ?
To sum up, if you think the amount of inserted data will always be still a few mb at a time, for me the simplest solution is the best, i.e. Kinesis -> Lambda -> RDS with maybe another Lambda to backup data on S3 (Kinesis retention period is limited to 7 days). It's especially interesting from the pricing point of view - apparently you have not a lot data, Lambda is executed at demand, for instance by batching 1000 Kinesis records, so it's a good occasion to save some money. Otherwise, if you expect to have more and more data, the use of "Firehose -> Lambda" version seems t be a better fit for me because you don't load the database with a big amount of data at once.

How to store old streaming data in Databricks Spark?

I am new to Spark Streaming and Azure Databricks. I read many articles on how spark works and process data etc. But what about old data? If spark works on interactive data then my 2 weeks older or 2 months older data can Spark hold? or suppose I have to move data after transformation where should I move and clear the spark memory? will it store in SSD only?
Azure Databricks supports several data stores (as a source and as a target for data at rest). Good practice for Big Data is mounting an Azure Data Lake Store. If you have a streaming data source (like Kafka or EventHubs) you can use it as a sink and of cause reuse it for further analytics.
See https://docs.azuredatabricks.net/spark/latest/data-sources/index.html for supported data sources.

Can i use Amazon Kinesis to connect to amazon redshift for data load in every couple of mins

From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.