how large storage do I need for postgres comparing to data size - postgresql

I did some analysis using some sample data and found table size is usually 2 twice as much as raw data (by importing a csv file into a postgres table, then csv file size is raw data size).
And the disk space seems 4 times as raw data most likely because of WAL log.
Is there any commonly used formulator to estimate how much disk space I need if we want to store like 1G size of data.
I know there are many factors affecting this, I just would like to have a quick estimate.

Related

Kafka data compression technique

I loaded data(Selective data) from Oracle to Kafka with the replication factor of 1( So, only one copy ) and the data size in Kafka is 1TB. Kafka stores the data in a compressed format. But, I want to know the actual data size in Oracle. Since, we did selective tables and data load, I am not able to check the actual data size in Oracle. Is there any formula which I can apply to estimate the data size in Oracle for this 1TB data loaded in Kafka?
Kafka version - 2.1
Also, It took 4 hours to move data from oracle to kafka. The data size over the wire could be different. How to estimate the data over the wire and the bandwidth consumed?
There is as yet insufficient data for a meaningful answer.
Kafka supports GZip, LZ4 and "Snappy" compressions, with different compression factors and different saturation thresholds. All three methods are "learning based", i.e. they consume bytes from a stream, build a dictionary and output bytes that are symbols from the dictionary. As a result, short data streams will not be compressed very much because the dictionary hasn't learned yet very much. And if the characteristics of the dictionary become unsuitable for the new incoming bytes, the compression ratio again goes down.
This means that the structure of the data can completely change the compression performances.
On the whole, in real world applications with reasonable data (i.e. not a DTM sparse matrix or a PDF or Office document storage system) you can expect on average between 1.2x and 2.0x. The larger the data chunks, the higher the compression. The actual content of the "message" also has great weight, as you can imagine.
Oracle then allocates data in data blocks, which means you get some slack space overhead, but then again it can compress those blocks. Oracle also performs deduplication in some instances.
Therefore, a meaningful and reasonably precise answer would have to depend on several factors that we don't know here.
As a ballpark figure, I'd say that the actual "logical" data from the 1TB Kafka ought to range between 0.7 and 2 TB, and I'd expect the Oracle occupation to be anywhere from 0.9 to 1.2 TB, if compression is available Oracle side, 1.2 TB to 2.4 TB if it is not.
But this is totally a shot in the dark. You could have compressed binary information (say, XLSX or JPEG-2000 files or MP3 songs) stored, and those would actually grow in size when compression was used. Or you might have swaths of sparse matricial data, that can easily compress 20:1 or more even with the most cursory gzipping. In the first case, the 1TB might remain more or less 1TB when compression was removed; in the second case, the same 1TB could just as easily grow to 20TB or more.
I am afraid the simplest way to know would be to instrument both storages and the network, and directly monitor traffic and data usage.
Once you knew the parameters of your DBs, you could extrapolate them to different storage amounts (so, say, if you know that 1TB Kafka requires 2.5TB network traffic to become 2.1 TB of Oracle tablespace, then it stands to reason that 2TB Kafka would require 5TB of traffic and occupy 4.2TB Oracle side)... but, even here, only provided the nature of the data did not change in the interim.

MATLAB: Are there any problems with many (millions) small files compared to few (thousands) large files?

I'm working on a real-time test software in MATLAB. On user input I want to extract the value of one (or a few neighbouring) pixels from 50-200 high resolution images (~25 MB).
My problem is that the total image set is to big (~2000 images) to store in RAM, consequently I need to read each of the 50-200 images from disk after each user-input which of course is way to slow!
So I was thinking about splitting the images into sub-images (~100x100 pixels) and saving these separately. This would make the image-read process quick enough.
Are there any problems I should be aware of with this approach? For instance I've read about people having trouble copying many small files, will this affect me to i.e. make the image-read slower?
rahnema1 is right - imread(...,'PixelRegion') will fasten read operation. If it is not enough for you, even if your files are not fragmented, may be it is time to think about some database?
Disk operations are always the bottleneck. First we switch to disk caches, then distributed storage, then RAID, and after some more time, we finish with in-memory databases. You should choose which access speed is reasonable.

How should I store data in a recommendation engine?

I am developing a recommendation engine. I think I can’t keep the whole similarity matrix in memory.
I calculated similarities of 10,000 items and it is over 40 million float numbers. I stored them in a binary file and it becomes 160 MB.
Wow!
The problem is that I could have nearly 200,000 items.
Even if I cluster them into several groups and created similarity matrix for each group, then I still have to load them into memory at some point.
But it will consume a lot memory.
So, is there anyway to deal with these data?
How should I stored them and load into the memory while ensuring my engine respond reasonably fast to an input?
You could use memory mapping to access your data. This way you can view your data on disk as one big memory area (and access it just as you would access memory) with the difference that only pages where you read or write data are (temporary) loaded in memory.
If you can group the data somewhat, only smaller portions would have to be read in memory while accessing the data.
As for the floats, if you could do with less resolution and store the values in say 16 bit integers, that would also half the size.

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

loading large csv files in Matlab

I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.