Redshift table size - amazon-redshift

Redshift table size - amazon-redshift

This is more like a puzzling question for me and would like to understand why.
I have two tables, almost identical the only differences are one column's data type and sortkey.
table mbytes rows
stg_user_event_properties_hist 460948 2378751028
stg_user_event_properties_hist_1 246442 2513860837
Even though they have almost same number of rows, size is close to double.
Here are the table structures
stg.stg_user_event_properties_hist
(
id bigint,
source varchar(20),
time_of_txn timestamp,
product varchar(50),
region varchar(50),
city varchar(100),
state varchar(100),
zip varchar(10),
price integer,
category varchar(50),
model varchar(50),
origin varchar(50),
l_code varchar(10),
d_name varchar(100),
d_id varchar(10),
medium varchar(255),
network varchar(255),
campaign varchar(255),
creative varchar(255),
event varchar(255),
property_name varchar(100),
property_value varchar(4000),
source_file_name varchar(255),
etl_batch_id integer,
etl_row_id integer,
load_date timestamp
);
stg.stg_user_event_properties_hist_1
(
id bigint,
source varchar(20),
time_of_txn timestamp,
product varchar(50),
region varchar(50),
city varchar(100),
state varchar(100),
zip varchar(10),
price integer,
category varchar(50),
model varchar(50),
origin varchar(50),
l_code varchar(10),
d_name varchar(100),
d_id varchar(10),
medium varchar(255),
network varchar(255),
campaign varchar(255),
creative varchar(255),
event varchar(255),
property_name varchar(100),
property_value varchar(4000),
source_file_name varchar(255),
etl_batch_id integer,
etl_row_id varchar(20),
load_date timestamp
);
The differences again etl_row_id has data type varchar(20) in _1, integer in the other table, and the first table has a sortkey on source column.
What would be the explanation for the size difference?
UPDATE:
The problem was both compression and sort keys, even though _1 table created with CTAS 11 of 26 had different compression settings, also the first table was created with Compound SortKey of 14 columns, recreated the table with no sort keys (it's a history table after all) size went down to 231GB.

Suspect that the larger table has different compression settings or no compression at all. You can use our view v_generate_tbl_ddl to generate table DDL that includes the compression settings.
Even with the same compression settings table size can vary with different sort keys. The sort key is use to place the data into blocks on disk. If one sort key places lots of similar column values together it will compress better and require less space.

The sizes are different for these two tables because one table is being allocated more blocks than the other based on sortkeys. For your bigger table, the distribution is happening in such a way that the disk blocks are not fully occupied, thus needing more blocks to store the same amount of data.
This happens because of the 1MB block size of Redshift and the way it stores data across slices and nodes. In general, data gets distributed across different nodes and slices based on the diststyle. For your case I am assuming this distribution is happening in a round robin way. So slice1 gets the first record, slice2 gets second record etc. As the minimum block size is 1MB for Redshift, every time a new record goes to a new slice, 1MB gets allocated (even if the record only takes a few KBs). For subsequent records to the same slice, data goes to the same 1MB block till it is possible, after which a new 1MB block gets allocated on the slice. But, if there are no more records after the first record for this slice, it still occupies the first block of 1MB size. The total size of the table is the sum of all blocks being occupied (irrespective of how much data in present in the blocks)

The difference in table size could be due to the following reasons.
The encoding used for each column. (query PG_TABLE_DEF)
Distribution key used for the table. (query PG_TABLE_DEF)
Vaccum performed on the table. (query SVV_VACUUM_SUMMARY)
If I’ve made a bad assumption please comment and I’ll refocus my answer.

Related

Postgresql - Optimize ordering of columns in table range partitioning with multiple columns range

I am testing with creating a data warehouse for a relatively big dataset. Based on ~10% sample of the data I decided to partition some tables that are expected to exceed memory which currently 16GB
Based on the recommendation in postgreSQL docs: (edited)
These benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
One particular table I am not sure how to partition, this table is frequently queried in 2 different ways, with WHERE clause that may include primary key OR another indexed column, so figured I need a range partition using the existing primary key and the other column (with the other column added to the primary key).
Knowing that the order of columns matters, and given the below information my question is:
What is the best order for primary key and range partitioning columns?
Original table:
CREATE TABLE items (
item_id BIGSERIAL NOT NULL, -- primary key
src_doc_id bigint NOT NULL, -- every item can exist in one src_doc only and a src_doc can have multiple items
item_name character varying(50) NOT NULL, -- used in `WHERE` clause with src_doc_id and guaranteed to be unique from source
attr_1 bool,
attr_2 bool, -- +15 other columns all bool or integer types
PRIMARY KEY (item_id)
);
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);
Table size for 10% of the dataset is ~2GB (result of pg_total_relation_size) with 3M+ rows, loading or querying performance is excellent, but thinking that this table is expected to grow to 30M rows and size 20GB I do not know what to expect in terms of performance.
Partitioned table being considered:
CREATE TABLE items (
item_id BIGSERIAL NOT NULL,
src_doc_id bigint NOT NULL,
item_name character varying(50) NOT NULL,
attr_1 bool,
attr_2 bool,
PRIMARY KEY (item_id, src_doc_id) -- should the order be reversed?
) PARTITION BY RANGE (item_id, src_doc_id); -- should the order be reversed?
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);
-- Ranges are not initially known, so maxvalue is used as upper bound,
-- when upper bound is known, partition is detached and reattached using
-- known known upper bound and a new partition is added for the next range
CREATE TABLE items_00 PARTITION OF items FOR VALUES FROM (MINVALUE, MINVALUE) TO (MAXVALUE, MAXVALUE);
Table usage
On loading data, the load process (python script) looks up existing items based on src_doc_id and item_name and stores item_id, so it does not reinsert existing items. Item_id gets referenced in lot of other tables, no foreign keys are used.
On querying for analytics item information is always looked up based on item_id.
So I can't decide the suitable order for the table PRIMARY KEY and PARTITION BY RANGE,
Should it be (item_id, src_doc_id) or (src_doc_id, item_id)?

Segment by in TimescaleDB requires foreign key id to be used for segmenting

I am working with the following use case:
A large number of users that each have their own separate time series. Each measurement in the time series has been made with a device, that has some accompanying metadata.
To create a TimescaleDB hypertable for this, I did the following:
CREATE TABLE devices (
id VARCHAR NOT NULL,
brand VARCHAR,
model VARCHAR,
serial_number VARCHAR,
mac VARCHAR,
firmware VARCHAR,
UNIQUE (id),
PRIMARY KEY (id)
);
CREATE TABLE measurements (
time TIMESTAMPTZ NOT NULL,
measurement_location VARCHAR,
measurement_value DOUBLE PRECISION NOT NULL,
device_id, VARCHAR NOT NULL,
customer_id VARCHAR NOT NULL,
FOREIGN KEY (device_id) REFERENCES devices (id)
);
SELECT create_hypertable('measurements', 'time');
ALTER TABLE measurements SET (
timescaledb.compress,
timescaledb.compress_segmentby='customer_id'
);
I wanted to segment by the customer id - since all measurements for a user is what generally will be queried.
However, when I do this I get the following error:
ERROR: column "device_id" must be used for segmenting
DETAIL: The foreign key constraint "measurements_device_id_fkey" cannot be enforced with the given compression configuration.
Why is it that I must use the foreign key for my segmentation? Is there another better way to accomplish what I want to do here?

Timescale engineer here. One of the limitations with compression is that we cannot cascade deletes from foreign tables to compressed hypertables, unless it is a non compressed column. Segment by columns are stored in non compressed form. That's the reason behind the restriction on foreign key constraints.

Fewer rows versus fewer columns

I am currently modeling a table schema for PostgreSQL that has a lot of columns and is intended to hold a lot of rows. I don't know if it is faster to have more columns or to split the data into more rows.
The schema looks like this (shortened):
CREATE TABLE child_table (
PRIMARY KEY(id, position),
id bigint REFERENCES parent_table(id) ON DELETE CASCADE,
position integer,
account_id bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_1 integer,
attribute_2 integer,
attribute_3 integer,
-- about 60 more columns
);
Exactly 10 rows of child_table are at maximum related to one row of parent_table. The order is given by the value in position which ranges from 1 to 10. parent_table is intended to hold 650 million rows. With this schema I would end up with 6.5 billion rows in child_table.
Is it smart to do this? Or is it better to model it this way so that I only have 650 million rows:
CREATE TABLE child_table (
PRIMARY KEY(id),
id bigint,
parent_id bigint REFERENCES other_table(id) ON DELETE CASCADE,
account_id_1 bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_1_1 integer,
attribute_1_2 integer,
attribute_1_3 integer,
account_id_2 bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_2_1 integer,
attribute_2_2 integer,
attribute_2_3 integer,
-- [...]
);

The number of columns and rows matters less than how well they are indexed. Indexes drastically reduce the number of rows which need to be searched. In a well-indexed table, the total number of rows is irrelevant. If you try to smash 10 rows into one row you'll make indexing much harder. It will also make writing efficient queries which use those indexes harder.
Postgres has many different types of indexes to cover many different types of data and searches. You can even write your own (though that shouldn't be necessary).
Exactly 10 rows of child_table are at maximum related to one row of parent_table.
Avoid encoding business logic in your schema. Business logic changes all the time, especially arbitrary numbers like 10.
One thing you might consider is reducing the number of attribute columns, 60 is a lot, especially if they are actually named attribute_1, attribute_2, etc. Instead, if your attributes are not well defined, store them as a single JSON column with keys and values. Postgres' JSON operations are very efficient (provided you use the jsonb type) and provide a nice middle ground between a key/value store and a relational database.
Similarly, if any sets of attributes are simple lists (like address1, address2, address3), you can also consider using Postgres arrays.
I can't give better advice than this without specifics.

postgresql - integer out of range

I've set up a table accordingly:
CREATE TABLE raw (
id SERIAL,
regtime float NOT NULL,
time float NOT NULL,
source varchar(15),
sourceport INTEGER,
destination varchar(15),
destport INTEGER,
blocked boolean
); ... + index and grants
I've successfully used this table for a while now, and all of a sudden the following insert doesn't work any longer..
INSERT INTO raw(
time, regtime, blocked, destport, sourceport, source, destination
) VALUES (
1403184512.2283964, 1403184662.118, False, 2, 3, '192.168.0.1', '192.168.0.2'
);
The error is: ERROR: integer out of range
Not even sure where to begin debugging this.. I'm not out of disk-space and the error itself is kinda discreet.

SERIAL columns are stored as INTEGERs, giving them a maximum value of 231-1. So after ~2 billion inserts, your new id values will no longer fit.
If you expect this many inserts over the life of your table, create it with a BIGSERIAL (internally a BIGINT, with a maximum of 263-1).
If you discover later on that a SERIAL isn't big enough, you can increase the size of an existing field with:
ALTER TABLE raw ALTER COLUMN id TYPE BIGINT;
Note that it's BIGINT here, rather than BIGSERIAL (as serials aren't real types). And keep in mind that, if you actually have 2 billion records in your table, this might take a little while...

postgres partition number

I have a table that is partitoned on time field. I have 25 partitions. Now I consider partitioning it further more using object type field. I have ten object types so it will result in 250 partitions. According to what I read, the recommended partition number is few dozens but in my cases the schema is very simple and doesn't include any joins so I wonder if it's o.k. to define that many partitions. I am using postgres version 9.1.2
CREATE TABLE metric_store.lc_aggregated_data_master_10_minutes
(
from_time integer,
object_id integer,
object_type integer,
latencies_client_fetch_sec_sum bigint,
latencies_client_rttsec_sum bigint,
latencies_db_bci_res_sec_sum bigint,
latencies_net_infrastructure_ttlb_sec_sum bigint,
latencies_retransmissions_sec_sum bigint,
latencies_ttfbsec_sum bigint,
latencies_ttlbsec_sum bigint,
latencies_ttlbsec_sumsqr bigint,
latencies_ttlbsec_histogram_level0 integer,
latencies_ttlbsec_histogram_level1 integer,
latencies_ttlbsec_histogram_level2 integer,
latencies_ttlbsec_histogram_level3 integer,
latencies_ttlbsec_histogram_level4 integer,
latencies_ttlbsec_histogram_level5 integer,
latencies_ttlbsec_histogram_level6 integer,
latencies_ttlbsec_histogram_level7 integer,
usage_bytes_total bigint,
usage_hits_total integer,
latencies_server_net_ttlbsec_sum bigint,
latencies_server_rttsec_sum bigint,
avaiability_errors_total integer
)
WITH (
OIDS=FALSE
);
ALTER TABLE metric_store.lc_aggregated_data_master_10_minutes
OWNER TO postgres;
CREATE TABLE metric_store.lc_aggregated_data_10_minutes_from_1353070800
(
CONSTRAINT lc_aggregated_data_10_minutes_from_1353070800_pkey PRIMARY KEY (from_time , object_id ),
CONSTRAINT lc_aggregated_data_10_minutes_from_1353070800_from_time_check CHECK (from_time >= 1353070800 AND from_time < 1353190800)
)
INHERITS (metric_store.lc_aggregated_data_master_10_minutes)
WITH (
OIDS=FALSE
);
ALTER TABLE metric_store.lc_aggregated_data_10_minutes_from_1353070800
OWNER TO postgres;
CREATE INDEX lc_aggregated_data_10_minutes_from_1353070800_obj_typ_idx
ON metric_store.lc_aggregated_data_10_minutes_from_1353070800
USING btree
(from_time , object_type );

The current version (9.2) has this guidance about the number of partitions. (That guidance hasn't changed since 8.3.)
All constraints on all partitions of the master table are examined
during constraint exclusion, so large numbers of partitions are likely
to increase query planning time considerably. Partitioning using these
techniques will work well with up to perhaps a hundred partitions;
don't try to use many thousands of partitions.
From reading the PostgreSQL mailing lists, I believe increasing the time for query planning is the main problem you face.
If your partitions can segregate hot data from cold data, or if your partitions can group clustered sets of data you frequently query, you will probably be ok. But testing is your best bet. EXPLAIN ANALYZE representative queries in the unpartitioned table, then do the same after partitioning. Choose representative queries before you analyze any of them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Redshift table size - amazon-redshift

Related

Postgresql - Optimize ordering of columns in table range partitioning with multiple columns range

Segment by in TimescaleDB requires foreign key id to be used for segmenting

Fewer rows versus fewer columns

postgresql - integer out of range

postgres partition number

Categories

Resources