I am loading about 300GB of contour line data in to an postgis table. To speed up the process i read that it is fastest to first load the data, and then create an index. Loading the data only took about 2 days, but now I have been waiting for the index for about 30 days, and it is still not ready.
The query was:
create index idx_contour_geom on contour.contour using gist(geom);
I ran it in pgadmin4, and the memory consumption of the progran has varied from 500MB to 100GB++ since.
Is it normal to use this long time to index such a database?
Any tips on how to speed up the process?
Edit:
The data is loaded from 1x1 degree (lat/lon) cells (about 30.000 cells) so no line has a bounding box larger than 1x1 degree, most of then should be much smaller. They are in EPSG:4326 projection and the only attributes are height and the geometry (geom).
I changed the maintenance_work_mem to 1GB and stopped all other writing to disk (a lot of insert opperations had ANALYZE appended, which took a lot of resources). I now ran in 23min.
Related
I did some analysis using some sample data and found table size is usually 2 twice as much as raw data (by importing a csv file into a postgres table, then csv file size is raw data size).
And the disk space seems 4 times as raw data most likely because of WAL log.
Is there any commonly used formulator to estimate how much disk space I need if we want to store like 1G size of data.
I know there are many factors affecting this, I just would like to have a quick estimate.
In Redshift I had a cluster with 4 nodes of the type dc2.large
The total size of the cluster was 160*4=640gb. The system showed 100% storage full. The size of the database was close to 640gb
Query I use to check the size of the db:
select sum(used_mb) from (
SELECT schema as table_schema,
"table" as table_name,
size as used_mb
FROM svv_table_info d order by size desc
)
I added 2 dc2.large nodes - classic resize which set the size of the cluster to 160*6=960gb, but when I checked the size of the database suddenly I saw that it also grew and again takes almost 100% of the cluster with increased size.
Database size grew with the size of the cluster!
I had to perform additional resize operation - elastic one. From 6 nodes to 12 nodes. The size of the data remained close to 960gb
How is it possible that the size of the database grew from 640gb to 960gb as a result of cluster resize operation?
I'd guess that your database has a lot of small tables in it. There are other ways this can happen but this is by far the most likely cause. You see Redshift uses a 1MB "block" as the minimum storage unit which is great for large data table storage but is inefficient for small (< 1M rows per slice in the cluster).
If you have a table that has say 100K rows split across your 4 nodes of dc2.large nodes (8 slices), each slice holds 12.5K rows. Each column for this table will need 1 block (1MB) to store the data. However, a block on average can store 200K rows (per column) so most of the blocks for this table are mostly empty. If you add rows the on-disk size (post vacuum) doesn't increase. Now if you add 50% more nodes you are also adding 50% more slices which just adds 50% more nearly empty blocks to the table's storage.
If this isn't your case I can expand on other ways this can happen but this really is the most likely in my experience. Unfortunately the fix for this is often to revamp your data model or to offload some less used data to Spectrum (S3).
My 4-Node (dc2.large 160 GB storage per node) Redshift cluster had around 75% storage full, so I added 2 more nodes, to make a total of 6 Nodes, and I was expecting the disk usage to drop down to around 50%, but after making the said change, the disk usage still remains at 75% (even after few days and after VACUUM).
75% of 4*160 = 480 GB of data
6*160 = 960 of available storage in the new configuration, which means it should have dropped to 480/960 i.e somewhere close to 50% disk usage.
The image shows the disk space percentage before and after adding two nodes.
I also checked if there are any large table which are using DISTSTYLE ALL, which causes data replication across the nodes, but the tables I have in that are very small in size as compared to the total storage capacity, so I don't think they'd have any significant impact on the storage.
What can I do here to reduce the storage usage as I don't want to add more nodes and then again land up in the same situation?
It sounds like your tables are affected by the minimum table size. It may be counter-intuitive but you can often reduce the size of small tables by converting them to DISTSTYLE ALL.
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
Can you clarify what distribution style you are using for some of the bigger tables?
If you are not specifying a distribution style then Redshift will automatically pick one (see here), and it's possible that it will chose ALL distribution at first and only switch to EVEN or KEY distribution once you reach a certain disk usage %.
Also, have you run the ANALYZE command to make sure the table stats are up to date?
I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.
I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.