Good day everyone!
I have insert performance issues, when inserting data into a table with a clustered columnstore index.
Table:
CREATE TABLE T
(
ID int NOT NULL,
cal int NOT NULL,
cod varchar(300) NULL
);
I am running a huge query in 100 batches. The amount of rows for the insert per batch is about 1.000.000.000.
Running the query and inserting into that table takes about 2h. Doing the same but inserting in a table without a clustered columnstore index takes 40 min.
Am I doing something wrong?
Any suggestions?
Thanks alot
Related
I am testing with creating a data warehouse for a relatively big dataset. Based on ~10% sample of the data I decided to partition some tables that are expected to exceed memory which currently 16GB
Based on the recommendation in postgreSQL docs: (edited)
These benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
One particular table I am not sure how to partition, this table is frequently queried in 2 different ways, with WHERE clause that may include primary key OR another indexed column, so figured I need a range partition using the existing primary key and the other column (with the other column added to the primary key).
Knowing that the order of columns matters, and given the below information my question is:
What is the best order for primary key and range partitioning columns?
Original table:
CREATE TABLE items (
item_id BIGSERIAL NOT NULL, -- primary key
src_doc_id bigint NOT NULL, -- every item can exist in one src_doc only and a src_doc can have multiple items
item_name character varying(50) NOT NULL, -- used in `WHERE` clause with src_doc_id and guaranteed to be unique from source
attr_1 bool,
attr_2 bool, -- +15 other columns all bool or integer types
PRIMARY KEY (item_id)
);
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);
Table size for 10% of the dataset is ~2GB (result of pg_total_relation_size) with 3M+ rows, loading or querying performance is excellent, but thinking that this table is expected to grow to 30M rows and size 20GB I do not know what to expect in terms of performance.
Partitioned table being considered:
CREATE TABLE items (
item_id BIGSERIAL NOT NULL,
src_doc_id bigint NOT NULL,
item_name character varying(50) NOT NULL,
attr_1 bool,
attr_2 bool,
PRIMARY KEY (item_id, src_doc_id) -- should the order be reversed?
) PARTITION BY RANGE (item_id, src_doc_id); -- should the order be reversed?
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);
-- Ranges are not initially known, so maxvalue is used as upper bound,
-- when upper bound is known, partition is detached and reattached using
-- known known upper bound and a new partition is added for the next range
CREATE TABLE items_00 PARTITION OF items FOR VALUES FROM (MINVALUE, MINVALUE) TO (MAXVALUE, MAXVALUE);
Table usage
On loading data, the load process (python script) looks up existing items based on src_doc_id and item_name and stores item_id, so it does not reinsert existing items. Item_id gets referenced in lot of other tables, no foreign keys are used.
On querying for analytics item information is always looked up based on item_id.
So I can't decide the suitable order for the table PRIMARY KEY and PARTITION BY RANGE,
Should it be (item_id, src_doc_id) or (src_doc_id, item_id)?
This table started out at short term storage for meter data before it was going to be validated and added to some long term storage tables.
Turns out the clients wants to keep this data for a long time since we saved it and it is growing fast.
create table metering_meterreading
(
id bigserial not null. # Primary Key
created_at timestamp with time zone not null,
updated_at timestamp with time zone not null,
timestamp timestamp with time zone not null, # BTREE index
value numeric(15, 3) not null,
meter_device_id uuid not null, # FK to meter_device, BTREE index
series_id uuid not null # FK to series, BTREE index
organization_id uuid not null. # FK to org , BTREE index
);
I am planning on dropping the primary key since (org_id, meter_device_id, series_id, timestamp) makes it unique. It was just added by my ORM (django) and I didn't care when we started.
But since I pretty much always want to filter in organization, meter_device, and series to get a range of time series data I am wondering if it would be more efficient to have a multicolumn index on (organization_id, meter_device_id, series_id, timestamp) instead of the separate indexes.
I read somewhere that if I had a range it should be the rightmost in the index.
This is still not an super efficient table for timeseries data, since it will grow large, but I am planning in fixing that by partitioning on range, or maybe even use Timescale. But before partitioning I would like it to be as efficient as possible to look up data in it.
I also saw an example somewhere that used a separate table to identify the metric:
create table metric
(
id
organization_id
meter_device_id
series_id
) UNIQE (organization_id, meter_device_id, series_id)
;
create table metering_meterreading
(
metric_id. bigserial, FK to metric, BTREE index
timestamp timestamp with time zone not null, # BTREE index
value numeric(15, 3) not null,
created_at timestamp with time zone not null,
updated_at timestamp with time zone not null,
);
But I am not sure if that is actually better than just putting them all in table. It might impact ingestion rate since there is another table involved now.
If (org_id, meter_device_id, series_id, timestamp) uniquely determine a table row, you need to use a multi-column primary key over all of them. So you automatically have a 4-column index on these columns. Just make sure that timestamp is last in the list, then that index will support your query ideally.
I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.
The table looks as follows:
CREATE TABLE public.data
(
event_time timestamp with time zone NOT NULL,
pair_id integer NOT NULL,
entry_id bigint NOT NULL,
event_data int NOT NULL,
CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
REFERENCES public.pairs (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
When I attempt to convert this table to a TimescaleDB hypertable using the following command:
SELECT create_hypertable(
'data',
'event_time',
chunk_time_interval => INTERVAL '1 hour',
migrate_data => TRUE
);
I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)
Question 1: From this post How to convert a simple postgresql table to hypertable or timescale db table using created_at for indexing my understanding is that this is because I have specified a unique constraint (pair_id_fkey) which does not contain the column I am partitioning by - event_time. Is that correct?
Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.
Data Properties and usage:
There can be multiple entries with the same event_time - those entries would have entry_id's which are in sequence
This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance
Solutions I have been considering:
Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 - then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time - would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)
You understand correctly that UNIQUE (pair_id, entry_id ) doesn't allow to create hypertable from the table, since unique constraints need to include the partition key, i.e., event_time in your case.
I don't follow how the first option, where records with the same timestamp are packed into single record, will help with the uniqueness.
Removing the unique constraint will allow to create hypertable and as you mentioned you will lose possibility to check the constraint.
Adding the time column, e.g., UNIQUE (pair_id, entry_id, event_time) is quite common approach, but it allows to insert duplicates with different timestamps as you mentioned. It will perform worse than option 2 during inserts. You can replace index on event_time (which you need, since you query on this column, and it is created automatically by TimescaleDB) with unique index, so you save a little bit e.g.,
CREATE UNIQUE INDEX indx ON (event_time, pair_id, entry_id);
Manually create unique constraint on each chunk table. This will guarantee uniqueness within the chunk, but it will be still possible to have duplicates in different chunks. The main drawback is you will need to figure out how to create it when new chunk is created.
Unique constraints without partition keys are not supported in TimescaleDB, since it will require to access all existing chunks to check uniqueness and it will kill performance. (or it will require to create a global index, which can be large) I don't think it is common case for time series data to have unique constraints as it is usually related to artificially generated counter-based identifiers.
I have a table with the following structure (massiveJS default):
CREATE TABLE "myTable" (
id uuid NOT NULL DEFAULT uuid_generate_v4(),
body jsonb NOT NULL,
created_at timestamptz NULL DEFAULT now(),
updated_at timestamptz NULL,
CONSTRAINT mytable_pkey PRIMARY KEY (id)
);
This table has a GIN index on the body column and a btree index on the primary key (default).
The json object stored inside the body column has an array that can grow really large (up to 50k items).
We use this statement to append new items to the array:
UPDATE "myTable"
SET body = jsonb_set(body, '{myArray}', (body->'myArray')
|| '["item1", "item2", "item3", "itemx""]')
WHERE id = 'e6e325da-0e8b-4d2e-9481-bc7e03c195b1'
The system updates this array in several records frequently and when the system is under heavy load this UPDATE starts getting really slow, PostgreSQL CPU goes to 100% and the whole application slows down.
Is there any way we could speed up this UPDATE statement in order to append/remove elements from the array without forcing PostgreSQL too much?
create table car_park {
id bigint primary key,
car_number bytea,
the_date date,
the_time timestamp
}
This table has thousands of rows daily in order to register coming/leaving cars in the car park. It will be huge table after several months. How to rebuilt or optimize my table to get fast query results?
Table Partitioning can be useful.