Kafka data compression technique - apache-kafka

I loaded data(Selective data) from Oracle to Kafka with the replication factor of 1( So, only one copy ) and the data size in Kafka is 1TB. Kafka stores the data in a compressed format. But, I want to know the actual data size in Oracle. Since, we did selective tables and data load, I am not able to check the actual data size in Oracle. Is there any formula which I can apply to estimate the data size in Oracle for this 1TB data loaded in Kafka?
Kafka version - 2.1
Also, It took 4 hours to move data from oracle to kafka. The data size over the wire could be different. How to estimate the data over the wire and the bandwidth consumed?

There is as yet insufficient data for a meaningful answer.
Kafka supports GZip, LZ4 and "Snappy" compressions, with different compression factors and different saturation thresholds. All three methods are "learning based", i.e. they consume bytes from a stream, build a dictionary and output bytes that are symbols from the dictionary. As a result, short data streams will not be compressed very much because the dictionary hasn't learned yet very much. And if the characteristics of the dictionary become unsuitable for the new incoming bytes, the compression ratio again goes down.
This means that the structure of the data can completely change the compression performances.
On the whole, in real world applications with reasonable data (i.e. not a DTM sparse matrix or a PDF or Office document storage system) you can expect on average between 1.2x and 2.0x. The larger the data chunks, the higher the compression. The actual content of the "message" also has great weight, as you can imagine.
Oracle then allocates data in data blocks, which means you get some slack space overhead, but then again it can compress those blocks. Oracle also performs deduplication in some instances.
Therefore, a meaningful and reasonably precise answer would have to depend on several factors that we don't know here.
As a ballpark figure, I'd say that the actual "logical" data from the 1TB Kafka ought to range between 0.7 and 2 TB, and I'd expect the Oracle occupation to be anywhere from 0.9 to 1.2 TB, if compression is available Oracle side, 1.2 TB to 2.4 TB if it is not.
But this is totally a shot in the dark. You could have compressed binary information (say, XLSX or JPEG-2000 files or MP3 songs) stored, and those would actually grow in size when compression was used. Or you might have swaths of sparse matricial data, that can easily compress 20:1 or more even with the most cursory gzipping. In the first case, the 1TB might remain more or less 1TB when compression was removed; in the second case, the same 1TB could just as easily grow to 20TB or more.
I am afraid the simplest way to know would be to instrument both storages and the network, and directly monitor traffic and data usage.
Once you knew the parameters of your DBs, you could extrapolate them to different storage amounts (so, say, if you know that 1TB Kafka requires 2.5TB network traffic to become 2.1 TB of Oracle tablespace, then it stands to reason that 2TB Kafka would require 5TB of traffic and occupy 4.2TB Oracle side)... but, even here, only provided the nature of the data did not change in the interim.

Related

how large storage do I need for postgres comparing to data size

I did some analysis using some sample data and found table size is usually 2 twice as much as raw data (by importing a csv file into a postgres table, then csv file size is raw data size).
And the disk space seems 4 times as raw data most likely because of WAL log.
Is there any commonly used formulator to estimate how much disk space I need if we want to store like 1G size of data.
I know there are many factors affecting this, I just would like to have a quick estimate.

How should I store data in a recommendation engine?

I am developing a recommendation engine. I think I can’t keep the whole similarity matrix in memory.
I calculated similarities of 10,000 items and it is over 40 million float numbers. I stored them in a binary file and it becomes 160 MB.
Wow!
The problem is that I could have nearly 200,000 items.
Even if I cluster them into several groups and created similarity matrix for each group, then I still have to load them into memory at some point.
But it will consume a lot memory.
So, is there anyway to deal with these data?
How should I stored them and load into the memory while ensuring my engine respond reasonably fast to an input?
You could use memory mapping to access your data. This way you can view your data on disk as one big memory area (and access it just as you would access memory) with the difference that only pages where you read or write data are (temporary) loaded in memory.
If you can group the data somewhat, only smaller portions would have to be read in memory while accessing the data.
As for the floats, if you could do with less resolution and store the values in say 16 bit integers, that would also half the size.

Datalab kernel crashes because of data set size. Is load balancing an option?

I am currently running the virtual machine with the highest memory,n1-highmem-32 (32 vCPUs, 208 GB memory).
My data set is around 90 gigs, but has the potential to grow in the future.
The data is in stored in many zipped csv files. I am loading the data into a sparse matrix in order to preform some dimensionality reduction and clustering.
The Datalab kernel runs on a single machine. Since you are already running on a 208GB RAM machine, you may have to switch to a distributed system to analyze the data.
If the operations you are doing on the data can be expressed as SQL, I'd suggest loading the data into BigQuery, which Datalab has a lot of support for. Otherwise you may want to convert your processing pipeline to use Dataflow (which has a Python SDK). Depending on the complexity of your operations, either of these may be difficult, though.

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

How much data per node in Cassandra cluster?

Where are the boundaries of SSTables compaction (major and minor) and when it becomes ineffective?
If I have major compaction couple of 500G SSTables and my final SSTable will be over 1TB - will this be effective for one node to "rewrite" this big dataset?
This can take about day for HDD and need double size space, so are there best practices for this?
1 TB is a reasonable limit on how much data a single node can handle, but in reality, a node is not at all limited by the size of the data, only the rate of operations.
A node might have only 80 GB of data on it, but if you absolutely pound it with random reads and it doesn't have a lot of RAM, it might not even be able to handle that number of requests at a reasonable rate. Similarly, a node might have 10 TB of data, but if you rarely read from it, or you have a small portion of your data that is hot (so that it can be effectively cached), it will do just fine.
Compaction certainly is an issue to be aware of when you have a large amount of data on one node, but there are a few things to keep in mind:
First, the "biggest" compactions, ones where the result is a single huge SSTable, happen rarely, even more so as the amount of data on your node increases. (The number of minor compactions that must occur before a top-level compaction occurs grows exponentially by the number of top-level compactions you've already performed.)
Second, your node will still be able to handle requests, reads will just be slower.
Third, if your replication factor is above 1 and you aren't reading at consistency level ALL, other replicas will be able to respond quickly to read requests, so you shouldn't see a large difference in latency from a client perspective.
Last, there are plans to improve the compaction strategy that may help with some larger data sets.