I have created a number of small staging tables in RedShift as part of an ETL process. Each table has between 50-100 rows (on average) with ~100 columns. When I query to see how much disk space each staging table requires, all columns are taking up exactly the same amount of space. The amount of space taken is far in excess of what is required. For example, 6 MB for 59 BOOLEAN values. I have tried multiple permutations of:
Column data types (varchar, timestamp, etc)
Column encodings (lzo, bytedict, etc)
Loading styles (individual insert, deep copy, etc)
Repeated VACUUMs in between all the above steps
Nothing seems to change the amount of space required for these staging tables. Why does RedShift not compress these tables more aggressively? Can I configure this in RedShift? Or should I simply force everything to be in one large staging table?
I'm using this query to determine disk space:
select name
, col
, sum(num_values) as num_values
, count(blocknum) as size_in_mb
from svv_diskusage
group by name
, col
Since the blocksize in RedShift is 1MB all columns will take up 1MB per column at a minimum. On top of this if the DISTSTYLE is EVEN it will be closer to one block per slice in the database. Since there is no way to tweak the blocksize in RedShift there is no way to reduce the size of an empty table below (number of columns) * (slices containing data for each column) * 1MB.
Its basically,
For tables created using the KEY or EVEN distribution style:
Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_populated_slices * number_of_table_segments
For tables created using the ALL distribution style:
Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_cluster_nodes * number_of_table_segments
number_of_table_segments is 1 for unsorted table and 2 for a table defined with a sort key.
Related
I am relatively new to using Postgres, but am wondering what could be the workaround here.
I have a table with about 20 columns and 250 million rows, and an index created for the timestamp column time (but no partitions).
Queries sent to the table have been failing (although using the view first/last 100 rows function in PgAdmin works), running endlessly. Even simple select * queries.
For example, if I want to LIMIT a selection of the data to 10
SELECT * from mytable
WHERE time::timestamp < '2019-01-01'
LIMIT 10;
Such a query hangs - what can be done to optimize queries in a table this large? When the table was of a smaller size (~ 100 million rows), queries would always complete. What should one do in this case?
If time is of data type timestamp or the index is created on (time::timestamp), the query should be fast as lightning.
Please show the CREATE TABLE and the CREATE INDEX statement, and the EXPLAIN output for the query for more details.
"Query that doesn't complete" usually means that it does disk swaps. Especially when you mention the fact that with 100M rows it manages to complete. That's because index for 100M rows still fits in your memory. But index twice this size doesn't.
Limit won't help you here, as database probably decides to read the index first, and that's what kills it.
You could try and increase available memory, but partitioning would actually be the best solution here.
Partitioning means smaller tables. Smaller tables means smaller indexes. Smaller indexes have better chances to fit into your memory.
We run postgresql 9.5.2 in an RDS instance. One thing we noticed was that a certain table sometimes grow very rapidly in size.
The table in question has only 33k rows and ~600 columns. All columns are numeric (decimal(25, 6)). After vacuum full, the "total_bytes" as reported in the following query
select c.relname, pg_total_relation_size(c.oid) AS total_bytes
from pg_class c;
is about 150MB. However, we observed this grew to 71GB at one point. In a recent episode, total_bytes grew by 10GB in a 30 minute period.
During the episode mentioned above, we had a batch update query that runs ~4 times per minute that updates every record in the table. However, during other times table size remained constant despite similar update activities.
I understand this is probably caused by "dead records" being left over from the updates. Indeed when this table grow too big simply running vacuum full will shrink it to its normal size (150M). My questions are
have other people experienced similar rapid growth in table size in postgresql and is this normal?
if our batch update queries are causing the rapid growth in table size, why doesn't it happen every time? In fact I tried to to reproduce it manually by running something like
update my_table set x = x * 2
but couldn't -- table size remained the same before and after the query.
The problem is having 600 columns in a single table, which is never a good idea. This is going to cause a lot of problems, table size is just one of them.
From the PostgreSQL docs...
The actual storage requirement [for numeric values] is two bytes for each group of four decimal digits, plus three to eight bytes overhead.
So decimal(25, 6) is something like 8 + (31 / 4 * 2) or about 24 bytes per column. At 600 columns per row that's about 14,400 bytes per row or 14k per row. At 33,000 rows that's about 450 megs.
If you're updating every row 4 times per minute, that's going to leave about 1.8 gigs per minute of dead rows.
You should fix your schema design.
You shouldn't need to touch every row of a table 4 times a minute.
You should ask a question about redesigning that table and process.
The size of sql dump is same(30GB) even if I delete large number of rows from mysql (myisam) table
note: this variabe is innodb_file_per_table ON
mysql> delete from radacct where YEAR(acctstarttime)='2014';
Query OK, 1963534 rows affected (1 hour 30.58 sec)
what it the question ?
if you have troubles with storin bakup in once.
maybe it will be esyer to transport it for you if it was smaller in size, bud more parts ?
part 1
select * from radacct where YEAR(acctstarttime)='2014' and id<100000000 order by id asc;
part 2
select * from radacct where YEAR(acctstarttime)='2014' and id<200000000 order by id asc;
etc ...
And after you cound compres it
PS: I cant add replay to your comment. so i will add it here:
You can view this page Vary use-full info. MySQL InnoDB not releasing disk space after deleting data rows from table
I'm trying to set-up a testing environment for performance testing, currently we have a table with 8 million records and we want to duplicate this records for 30 days.
In other words:
- Table 1
--Partition1(8 million records)
--Partition2(0 records)
.
.
--Partition30(0 records)
Now I want to take the 8 million records in Partition1 and duplicate them across the rest of partitions, the only difference that they have is a column that contains a DATE. This column should vary 1 day in each copy.
Partition1(DATE)
Partition2(DATE+1)
Partition3(DATE+2)
And so on.
The last restrictions are that there are 2 indexes in the original table and they must be preserved in the copies and Oracle DB is 10g.
How can I duplicate this content?
Thanks!
It seems to me to be as simple as running as efficient an insert as possible.
Probably if you cross-join the existing data to a list of integers, 1 .. 29, then you can generate the new dates you need.
with list_of_numbers as (
select rownum day_add
from dual
connect by level <= 29)
insert /*+ append */ into ...
select date_col + day_add, ...
from ...,
list_of_numbers;
You might want to set NOLOGGING on the table, since this is test data.
I'm developing against a DB2 database, and at some point I get an error code "-670" when trying to add a new column.
The error code indicates an insufficiently sized tablespace page size, anyway, I just went and ran a DESCRIBE command and I estimate I don't have more than 17K for the table width (I just added the numeric value contained in the "Length" column), anyway I'm not sure of that estimate since I have many BLOB columns. There is a SQL command (or DB2 command line utility) I could use to retrieve the exact info regarding the table width?
The sum of the LENGTH values in the output of the DESCRIBE TABLE command is a fairly accurate gauge of row width if you don't count the BLOB, CLOB, or LONG VARCHAR columns, which are not stored inline with the rest of the columns. There is a small amount of overhead bytes that aren't shown in that report, but it's usually not a significant portion of the table. DB2 has historically stored large objects separately to improve manageability and performance of the rest of the data in the table. DB2 has recently supported storing large objects inline in order to make use of compression and buffering, but I haven't seen it used widely and I doubt it will become a popular approach.
It sounds like it's time for you to relocate your table to a tablespace with a larger page size. Unless you're maxed out at a 32K page already, you have the option of doubling your page size by migrating your table to a larger bufferpool and tablespace, which will give you more room for additional columns. If you need to keep the data from the old table, loading from a cursor is a quick way to copy a large amount of data from one table to another within the same database. Your other option is to export the table's contents to a flatfile so you can drop and recreate the table in the wider tablespace and load the data back in.
Answering my own question, this script can be very useful in giving you a very good estimation about the used table width size (hence, you can have an idea about the remaining free space):
select SUM(300) from sysibm.syscolumns where tbname = 'MY_TABLE' and (typename = 'BLOB' or typename = 'DBCLOB')
select 2 * SUM(length) from sysibm.syscolumns where tbname = 'MY_TABLE' and typename = 'VARGRAPHIC'
select SUM(length) from sysibm.syscolumns where tbname = 'MY_TABLE' and typename != 'BLOB' and typename != 'DBCLOB' and typename != 'VARGRAPHIC'