Delta Lake OSS table large volume due to versioning - spark-structured-streaming

I have set up a Spark standalone cluster and use Spark Structured Streaming to write data from Kafka to multiple Delta Lake tables - simply stored in the file system. So there are multiple writes per second.
After running the pipeline for a while, I noticed that the tables require a large amount of storage on Disk. Some tables require 10x storage compared to the sources.
I investigated the Delta Lake table versioning. When I DESCRIBE a selected table, it's stated that the sizeInBytes is actually around 10 GB, although the corresponding folder on the disk takes over 100 GB.
DESCRIBE DETAIL delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
So I set the following properties:
ALTER TABLE delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 24 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')
and then performed a VACUUM:
VACUUM delta.`/mnt/delta/bronze/algod_indexer_public_txn_flat`
But still, after several days, the size on the disk stays at around 100GB. Altough constantly performing a VACUUM.
How can I overcome this issue?
Spark 3.2.1
Delta Lake 1.2.0
Thanks in advance!
UPDATE:
Looks like when I manually delete the json logs in _delta_log the VACUUM actually cleans old versions, so I assume some of the properties are not applied?

Related

Debezium postgres incremental snapshot performance issues

I am trying to use debezium incremental snapshots in the latest debezium (1.7) and postgres (V13). For testing, I populated a table with 1M rows, each row is 4KB with a UUID primary key and 20 varchar columns. Since I just wanted to measure snapshot performance, The table data does not change for the duration of the test
It seems that incremental snapshot is an order of magnitude slower than regular snapshots. For example, in my testing, I observed speeds of 10,000 change events per second with vanilla snapshot. Whereas, I observed speed of 500 change events per second with incremental snapshots.
I tried increasing the incremental.snapshot.chunk.size to 10,000 but I didn't see much effect on the performance.
I just wanted to confirm whether this is a known/expected issue or am I doing something wrong?
Thanks

Amazon RDS PostgreSQL: Sudden increase in Read IOPS

We are using Amazon RDS to host our PostgreSQL databases. Our production instance (db.t3.xlarge, Single-AZ) was running smoothly until suddenly Read IOPS, Read Latency, Read Throughput and Disk Queue Depth metrics in the AWS console increased rapidly and stayed high afterward (with a lower variability) whereas Write IOPS and Write Throughput were normal.
Read IOPS
Read Throughput
Disk Queue Depth
Write IOPS
There were no code changes or deployments on the date of the increase. There were no significant increases in user activity either.
About our DB structure, we have a single table that holds all of our data and in that table, we have these fields: id as UUID (primary key), type as VARCHAR, data as JSONB (holds the actual data), createdAt and updatedAt as timestamp with the time zone. Most of our data columns have sizes bigger than 2 KB so most of the rows are stored in TOAST table. We have 20 (BTREE) indexes that are created for frequently used fields in JSONB.
So far we have tried VACUUM ANALYZE and also completely rebuilding our table: creating a new table, copying all data from the old table, creating all indexes. They didn't change the behavior.
We also tried increasing storage thus increasing IOPS performance. It helped a bit but it is still not the same as before.
What could be the root cause of this problem? How can we fix it permanently (without increasing storage or instance type)? For now, we are looking for easy changes and we will improve our data model in the future.
T3 instances are not suitable for production. Try moving to another family like a C or M type. You may have hit some burst limits that are now causing odd behaviour

Databricks Delta Lake Structured Streaming Performance with event hubs and ADLS g2

I'm currently attempting to process telemetry data which has a volume of around 4TB a day using Delta Lake on Azure Databricks.
I have a dedicated event hub cluster where the events are written to and I am attempting to ingest this eventhub into delta lake with databricks structured streaming. there's a relatively simple job that takes the event hub output and extracts a few columns and then writes with a stream writer to ADLS gen2 storage that is mounted to the DBFS partitioned by date and hour.
Initially on a clean delta table directory the performance keeps up with the event hub writing around 18k records a second but after a few hours this drops to 10k a second and then further till it seems to stabilize around 3k records a second.
tried a few things on the databricks side with different partition schemes and the day hour partitions seemed to perform the best for the longest but still, after a pause and restart in this case the performance dropped and started to lag behind the event hub.
looking for some suggestions as to how I might be able to maintain performance.
I had a similar issue once, and it was not the Delta lake, but the Spark Azure EventHubs connector. It was extremely slow and using up a lot of resources.
I solved this problem by switching to the Kafka interface of Azure EventHubs: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
It's a little tricky to set up but it has been working very well for a couple of months now.

Can i use Amazon Kinesis to connect to amazon redshift for data load in every couple of mins

From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.

Analytics implementation in hadoop

Currently, we have mysql based analytics in place. We read our logs after every 15 mins, process them & add to mysql database.
As our data is growing(In one case, 9 million rows added till now & 0.5 million rows are adding in each month), we are planning to move analytics to no sql database.
As per my study, Hadoop seems to be better fit as we need to process the logs & it can handle very large data set.
However, it would be great if I can get some suggests from experts.
I agree with the other answers and comments. But if you want to evaluate Hadoop option then one solution can be following.
Apache Flume with Avro for log collection, agregation. Flume can ingest data into Hadoop File System (HDFS)
Then you can have Hbase as distributed scalable data store.
with Cloudera Impala on top of hbase you can have a near to real time (streaming) query engine. Impala uses SQL as its query language so it will be beneficial for you.
This is just one option. There can be multiple alternatives e.g. flume + hdfs + hive.
This is probably not a good q. for this forum but I would say that 9 million row and 0.5m per month hardly seems like a good reason to go to noSQL. This is a very small database and your best action would be to scale up the server a little (RAM, more disks, move to SSDs etc.)