Redshift database size grew sagnificantly when the cluster was resized - amazon-redshift

In Redshift I had a cluster with 4 nodes of the type dc2.large
The total size of the cluster was 160*4=640gb. The system showed 100% storage full. The size of the database was close to 640gb
Query I use to check the size of the db:
select sum(used_mb) from (
SELECT schema as table_schema,
"table" as table_name,
size as used_mb
FROM svv_table_info d order by size desc
)
I added 2 dc2.large nodes - classic resize which set the size of the cluster to 160*6=960gb, but when I checked the size of the database suddenly I saw that it also grew and again takes almost 100% of the cluster with increased size.
Database size grew with the size of the cluster!
I had to perform additional resize operation - elastic one. From 6 nodes to 12 nodes. The size of the data remained close to 960gb
How is it possible that the size of the database grew from 640gb to 960gb as a result of cluster resize operation?

I'd guess that your database has a lot of small tables in it. There are other ways this can happen but this is by far the most likely cause. You see Redshift uses a 1MB "block" as the minimum storage unit which is great for large data table storage but is inefficient for small (< 1M rows per slice in the cluster).
If you have a table that has say 100K rows split across your 4 nodes of dc2.large nodes (8 slices), each slice holds 12.5K rows. Each column for this table will need 1 block (1MB) to store the data. However, a block on average can store 200K rows (per column) so most of the blocks for this table are mostly empty. If you add rows the on-disk size (post vacuum) doesn't increase. Now if you add 50% more nodes you are also adding 50% more slices which just adds 50% more nearly empty blocks to the table's storage.
If this isn't your case I can expand on other ways this can happen but this really is the most likely in my experience. Unfortunately the fix for this is often to revamp your data model or to offload some less used data to Spectrum (S3).

Related

Redshift free storage doesn't increase after adding 2 nodes

My 4-Node (dc2.large 160 GB storage per node) Redshift cluster had around 75% storage full, so I added 2 more nodes, to make a total of 6 Nodes, and I was expecting the disk usage to drop down to around 50%, but after making the said change, the disk usage still remains at 75% (even after few days and after VACUUM).
75% of 4*160 = 480 GB of data
6*160 = 960 of available storage in the new configuration, which means it should have dropped to 480/960 i.e somewhere close to 50% disk usage.
The image shows the disk space percentage before and after adding two nodes.
I also checked if there are any large table which are using DISTSTYLE ALL, which causes data replication across the nodes, but the tables I have in that are very small in size as compared to the total storage capacity, so I don't think they'd have any significant impact on the storage.
What can I do here to reduce the storage usage as I don't want to add more nodes and then again land up in the same situation?
It sounds like your tables are affected by the minimum table size. It may be counter-intuitive but you can often reduce the size of small tables by converting them to DISTSTYLE ALL.
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
Can you clarify what distribution style you are using for some of the bigger tables?
If you are not specifying a distribution style then Redshift will automatically pick one (see here), and it's possible that it will chose ALL distribution at first and only switch to EVEN or KEY distribution once you reach a certain disk usage %.
Also, have you run the ANALYZE command to make sure the table stats are up to date?

Slow indexing of 300GB Postgis table

I am loading about 300GB of contour line data in to an postgis table. To speed up the process i read that it is fastest to first load the data, and then create an index. Loading the data only took about 2 days, but now I have been waiting for the index for about 30 days, and it is still not ready.
The query was:
create index idx_contour_geom on contour.contour using gist(geom);
I ran it in pgadmin4, and the memory consumption of the progran has varied from 500MB to 100GB++ since.
Is it normal to use this long time to index such a database?
Any tips on how to speed up the process?
Edit:
The data is loaded from 1x1 degree (lat/lon) cells (about 30.000 cells) so no line has a bounding box larger than 1x1 degree, most of then should be much smaller. They are in EPSG:4326 projection and the only attributes are height and the geometry (geom).
I changed the maintenance_work_mem to 1GB and stopped all other writing to disk (a lot of insert opperations had ANALYZE appended, which took a lot of resources). I now ran in 23min.

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

What is the max size of collection in mongodb

I would like to know what is the max size of collection in mongodb.
In mongodb limitations documentation it is mentioned single MMAPv1 database has a maximum size of 32TB.
This means max size of collection is 32TB?
If I want to store more than 32TB in one collection what is the solution?
There are theoretical limits, as I will show below, but even the lower bound is pretty high. It is not easy to calculate the limits correctly, but the order of magnitude should be sufficient.
mmapv1
The actual limit depends on a few things like length of shard names and alike (that sums up if you have a couple of hundred thousands of them), but here is a rough calculation with real life data.
Each shard needs some space in the config db, which is limited as any other database to 32TB on a single machine or in a replica set. On the servers I administrate, the average size of an entry in config.shards is 112 bytes. Furthermore, each chunk needs about 250 bytes of metadata information. Let us assume optimal chunk sizes of close to 64MB.
We can have at maximum 500,000 chunks per server. 500,000 * 250byte equals 125MB for the chunk information per shard. So, per shard, we have 125.000112 MB per shard if we max everything out. Dividing 32TB by that value shows us that we can have a maximum of slightly under 256,000 shards in a cluster.
Each shard in turn can hold 32TB worth of data. 256,000 * 32TB is 8.19200 exabytes or 8,192,000 terabytes. That would be the limit for our example.
Let's say its 8 exabytes. As of now, this can easily translated to "Enough for all practical purposes". To give you an impression: All data held by the Library of Congress (arguably one of the biggest library in the world in terms of collection size) holds an estimated size of data of around 20TB in size including audio, video, and digital materials. You could fit that into our theoretical MongoDB cluster some 400,000 times. Note that this is the lower bound of the maximum size, using conservative values.
WiredTiger
Now for the good part: The WiredTiger storage engine does not have this limitation: The database size is not limited (since there is no limit on how many datafiles can be used), so we can have an unlimited number of shards. Even when we have those shards running on mmapv1 and only our config servers on WT, the size of a becomes nearly unlimited – the limitation to 16.8M TB of RAM on a 64 bit system might cause problems somewhere and cause the indices of the config.shard collection to be swapped to disk, stalling the system. I can only guess, since my calculator refuses to work with numbers in that area (and I am too lazy to do it by hand), but I estimate the limit here in the two digit yottabyte area (and the space needed to host that somewhere in the size of Texas).
Conclusion
Do not worry about the maximum data size in a sharded environment. No matter what, it is by far enough, even with the most conservative approach. Use sharding, and you are done. Btw: even 32TB is a hell lot of data: Most clusters I know hold less data and shard because the IOPS and RAM utilization exceeded a single nodes capacity.

Total MongoDB storage size

I have a sharded and replicated MongoDB with dozens millions of records. I know that Mongo writes data with some padding factor, to allow fast updates, and I also know that to replicate the database Mongo should store operation log which requires some (actually, a lot of) space. Even with that knowledge I have no idea how to estimate the actual size required by Mongo given a size of a typical database record. By now I have a descrepancy with a factor of 2 - 3 between weekly repairs.
So the question is: How to estimate a total storage size required by MongoDB given an average record size in bytes?
The short answer is: you can't, not based solely on avg. document size (at least not in any accurate way).
To explain more verbosely:
The space needed on disk is not simply a function of the average document size. There is also the space needed for any indexes you create. Then there is the space needed if you do trigger those moves (despite padding, this does happen) - that space is placed on a list to be re-used but depending on the data you subsequently insert, it may or may not be possible to re-use that space.
You can also add into the fact that pre-allocation will mean that occasionally a handful of documents will increase your on-disk space utilization by ~2GB as a new data file is allocated. Of course, with sufficient data, this will be essentially a rounding error but it is worth bearing in mind.
The only way to estimate this type of data to size ratio, assuming a consistent usage pattern, is to trend it over time for your particular use case and track the disk space usage versus the data inserted (number of documents might be better than data volume depending on variability of doc size).
Similarly, if you track the insertion rate, doc size and the space gained back from a resync/repair. FYI - you can resync a secondary from scratch to get a "fresh" copy of the data files rather than running a repair, which can be less disruptive, and use less space depending on your set up.