Redshift free storage doesn't increase after adding 2 nodes - amazon-redshift

My 4-Node (dc2.large 160 GB storage per node) Redshift cluster had around 75% storage full, so I added 2 more nodes, to make a total of 6 Nodes, and I was expecting the disk usage to drop down to around 50%, but after making the said change, the disk usage still remains at 75% (even after few days and after VACUUM).
75% of 4*160 = 480 GB of data
6*160 = 960 of available storage in the new configuration, which means it should have dropped to 480/960 i.e somewhere close to 50% disk usage.
The image shows the disk space percentage before and after adding two nodes.
I also checked if there are any large table which are using DISTSTYLE ALL, which causes data replication across the nodes, but the tables I have in that are very small in size as compared to the total storage capacity, so I don't think they'd have any significant impact on the storage.
What can I do here to reduce the storage usage as I don't want to add more nodes and then again land up in the same situation?

It sounds like your tables are affected by the minimum table size. It may be counter-intuitive but you can often reduce the size of small tables by converting them to DISTSTYLE ALL.
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/

Can you clarify what distribution style you are using for some of the bigger tables?
If you are not specifying a distribution style then Redshift will automatically pick one (see here), and it's possible that it will chose ALL distribution at first and only switch to EVEN or KEY distribution once you reach a certain disk usage %.
Also, have you run the ANALYZE command to make sure the table stats are up to date?

Related

About Ram & Secondary Storage

Why do Ram size is always smaller than Secondary Storage(HDD/SSD)? If you observe any device you will get the same question
Why do Ram size is always smaller than Secondary Storage(HDD/SSD)? If you observe any device you will get the same question
The primary reason is price. For example (depending a lot on type, etc) currently RAM is around $4 per GiB and "rotating disk" HDD is $0.04 per GiB, so RAM costs about 100 times as much per GIB.
Another reason is that HDD/SSD is persistent (the data remains when you turn the power off); and the amount of data you want to keep when power is turned off is typically much larger than amount of data you don't want to keep when power is turned off. A special case for this is when you put a computer into a "hibernate" state (where the OS stores everything in RAM and turns power off, and then when power is turned on again it loads everything back into RAM so that it looks the same; where the amount of persistent storage needs to be larger than the amount of RAM).
Another (much smaller) reason is speed. It's not enough to be able to store data, you have to be able to access it too, and the speed of accessing data gets worse as the amount of storage increases. This holds true for all kinds of storage for different reasons (and is why you also have L1, L2, L3 caches ranging from "very small and very fast" to "larger and slower"). For RAM it's caused by the number of address lines and the size of "row select" circuitry. For HDD it's caused by seek times. For humans getting the milk out of a refrigerator it's "search time + movement speed" (faster to get the milk out of a tiny bar fridge than to walk around inside a large industrial walk-in refrigerator).
However; there are special cases (there's always special cases). For example, you might have a computer that boots from network and then uses the network for persistent storage; where there's literally no secondary storage in the computer at all. Another special case is small embedded systems where RAM is often larger than persistent storage.

Redshift database size grew sagnificantly when the cluster was resized

In Redshift I had a cluster with 4 nodes of the type dc2.large
The total size of the cluster was 160*4=640gb. The system showed 100% storage full. The size of the database was close to 640gb
Query I use to check the size of the db:
select sum(used_mb) from (
SELECT schema as table_schema,
"table" as table_name,
size as used_mb
FROM svv_table_info d order by size desc
)
I added 2 dc2.large nodes - classic resize which set the size of the cluster to 160*6=960gb, but when I checked the size of the database suddenly I saw that it also grew and again takes almost 100% of the cluster with increased size.
Database size grew with the size of the cluster!
I had to perform additional resize operation - elastic one. From 6 nodes to 12 nodes. The size of the data remained close to 960gb
How is it possible that the size of the database grew from 640gb to 960gb as a result of cluster resize operation?
I'd guess that your database has a lot of small tables in it. There are other ways this can happen but this is by far the most likely cause. You see Redshift uses a 1MB "block" as the minimum storage unit which is great for large data table storage but is inefficient for small (< 1M rows per slice in the cluster).
If you have a table that has say 100K rows split across your 4 nodes of dc2.large nodes (8 slices), each slice holds 12.5K rows. Each column for this table will need 1 block (1MB) to store the data. However, a block on average can store 200K rows (per column) so most of the blocks for this table are mostly empty. If you add rows the on-disk size (post vacuum) doesn't increase. Now if you add 50% more nodes you are also adding 50% more slices which just adds 50% more nearly empty blocks to the table's storage.
If this isn't your case I can expand on other ways this can happen but this really is the most likely in my experience. Unfortunately the fix for this is often to revamp your data model or to offload some less used data to Spectrum (S3).

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

What is the max size of collection in mongodb

I would like to know what is the max size of collection in mongodb.
In mongodb limitations documentation it is mentioned single MMAPv1 database has a maximum size of 32TB.
This means max size of collection is 32TB?
If I want to store more than 32TB in one collection what is the solution?
There are theoretical limits, as I will show below, but even the lower bound is pretty high. It is not easy to calculate the limits correctly, but the order of magnitude should be sufficient.
mmapv1
The actual limit depends on a few things like length of shard names and alike (that sums up if you have a couple of hundred thousands of them), but here is a rough calculation with real life data.
Each shard needs some space in the config db, which is limited as any other database to 32TB on a single machine or in a replica set. On the servers I administrate, the average size of an entry in config.shards is 112 bytes. Furthermore, each chunk needs about 250 bytes of metadata information. Let us assume optimal chunk sizes of close to 64MB.
We can have at maximum 500,000 chunks per server. 500,000 * 250byte equals 125MB for the chunk information per shard. So, per shard, we have 125.000112 MB per shard if we max everything out. Dividing 32TB by that value shows us that we can have a maximum of slightly under 256,000 shards in a cluster.
Each shard in turn can hold 32TB worth of data. 256,000 * 32TB is 8.19200 exabytes or 8,192,000 terabytes. That would be the limit for our example.
Let's say its 8 exabytes. As of now, this can easily translated to "Enough for all practical purposes". To give you an impression: All data held by the Library of Congress (arguably one of the biggest library in the world in terms of collection size) holds an estimated size of data of around 20TB in size including audio, video, and digital materials. You could fit that into our theoretical MongoDB cluster some 400,000 times. Note that this is the lower bound of the maximum size, using conservative values.
WiredTiger
Now for the good part: The WiredTiger storage engine does not have this limitation: The database size is not limited (since there is no limit on how many datafiles can be used), so we can have an unlimited number of shards. Even when we have those shards running on mmapv1 and only our config servers on WT, the size of a becomes nearly unlimited – the limitation to 16.8M TB of RAM on a 64 bit system might cause problems somewhere and cause the indices of the config.shard collection to be swapped to disk, stalling the system. I can only guess, since my calculator refuses to work with numbers in that area (and I am too lazy to do it by hand), but I estimate the limit here in the two digit yottabyte area (and the space needed to host that somewhere in the size of Texas).
Conclusion
Do not worry about the maximum data size in a sharded environment. No matter what, it is by far enough, even with the most conservative approach. Use sharding, and you are done. Btw: even 32TB is a hell lot of data: Most clusters I know hold less data and shard because the IOPS and RAM utilization exceeded a single nodes capacity.

Total MongoDB storage size

I have a sharded and replicated MongoDB with dozens millions of records. I know that Mongo writes data with some padding factor, to allow fast updates, and I also know that to replicate the database Mongo should store operation log which requires some (actually, a lot of) space. Even with that knowledge I have no idea how to estimate the actual size required by Mongo given a size of a typical database record. By now I have a descrepancy with a factor of 2 - 3 between weekly repairs.
So the question is: How to estimate a total storage size required by MongoDB given an average record size in bytes?
The short answer is: you can't, not based solely on avg. document size (at least not in any accurate way).
To explain more verbosely:
The space needed on disk is not simply a function of the average document size. There is also the space needed for any indexes you create. Then there is the space needed if you do trigger those moves (despite padding, this does happen) - that space is placed on a list to be re-used but depending on the data you subsequently insert, it may or may not be possible to re-use that space.
You can also add into the fact that pre-allocation will mean that occasionally a handful of documents will increase your on-disk space utilization by ~2GB as a new data file is allocated. Of course, with sufficient data, this will be essentially a rounding error but it is worth bearing in mind.
The only way to estimate this type of data to size ratio, assuming a consistent usage pattern, is to trend it over time for your particular use case and track the disk space usage versus the data inserted (number of documents might be better than data volume depending on variability of doc size).
Similarly, if you track the insertion rate, doc size and the space gained back from a resync/repair. FYI - you can resync a secondary from scratch to get a "fresh" copy of the data files rather than running a repair, which can be less disruptive, and use less space depending on your set up.