timescaledb compression and relations on multiple segment - postgresql

I'm currently trying to store a "blockchain" like into TimescaleDB and wanted to leverage the power of compression but without losing the performance on relational queries.
Given this simple schema with millions of blocks and billions of transactions:
blocks
-------------------------
block_number bigint
created_at timestamp
transactions
-------------------------
hash text
block_number bigint
from text
to text
created_at timestamp
Creating hypertables with this design and adding compression on the transactions table gives me terrible performance when I want to query for instance:
SELECT * FROM transactions WHERE transactions.from = lower('xxxx') AND transactions.to = lower('xxxx');
From what I understand, this is expected because data is having index on created_at and segmentby cannot help me there's because it's a "two-dimension" with "incoming transactions" and "outgoing transactions".
Because compression is really effective with TimescaleDB it should not bother to duplicate the data, am I right?
I was thinking if a database design like this could give me good performance and compression:
blocks
-------------------------
block_number bigint
created_at timestamp
transactions
-------------------------
hash text
block_number bigint
from text
to text
created_at timestamp
incoming_transactions
-------------------------
hash text
account text
created_at timestamp
outgoing_transactions
-------------------------
hash text
account text
created_at timestamp
If I compress and segmentby account the outgoing_transactions and incoming_transactions would this query be efficient:
SELECT
transactions.*
FROM
transactions
INNER JOIN outgoing_transactions ON
incoming_transactions.hash = transactions.hash AND account = lower('xxx')
INNER JOIN outgoing_transactions ON
outgoing_transactions.hash = transactions.hash AND account = lower('xxx')
Or am I completely wrong and should stick with indices on transactions.from and transactions.to and without compression?
Thank you

Related

Sometime Postgresql Query takes to much time to load Data

I am using PostgreSQL 13.4 and has intermediate level experience with PostgreSQL.
I have a table which stores reading from a device for each customer. For each customer we receive around ~3,000 data in a day and we store them in to our usage_data table as below
We have the proper indexing in place.
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
customer_id | bigint | idx_customer_id | btree
date | timestamp | idx_date | btree
usage | bigint | idx_usage | btree
amount | bigint | idx_amount | btree
Also I have common index on 2 columns which is below.
CREATE INDEX idx_cust_date
ON public.usage_data USING btree
(customer_id ASC NULLS LAST, date ASC NULLS LAST)
;
Problem
Few weird incidents I have observed.
When I tried to get data for 06/02/2022 for a customer it took almost 20 seconds. It's simple query as below
SELECT * FROM usage_data WHERE customer_id =1 AND date = '2022-02-06'
The execution plan
When I execute same query for 15 days then I receive result in 32 milliseconds.
SELECT * FROM usage_data WHERE customer_id =1 AND date > '2022-05-15' AND date <= '2022-05-30'
The execution plan
Tried Solution
I thought it might be the issue of indexing as for this particular date I am facing this issue.
Hence I dropped all the indexing from the table and recreate it.
But the problem didn't resolve.
Solution Worked (Not Recommended)
To solve this I tried another way. Created a new database and restored the old database
I executed the same query in new database and this time it takes 10 milliseconds.
The execution plan
I don't think this is proper solution for production server
Any idea why for any specific data we are facing this issue?
Please let me know if any additional information is required.
Please guide. Thanks

Benefit to adding an Index for an order by column?

We have a large table (2.8M rows) where we are finding a single row by our device_token column
CREATE TABLE public.rpush_notifications (
id bigint NOT NULL,
device_token character varying,
data text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
...
We are constantly doing the following query:
SELECT * FROM rpush_notifications WHERE device_token = '...' ORDER BY updated_at DESC LIMIT 1
I'd like to add a index for our device_token column, and I'm wondering if there is any benefit to creating an additional index for updated_at or creating a multicolumn index for both columns device_token and updated_at given that we are ordering by, i.e.:
CREATE INDEX foo ON rpush_notifications(device_token, updated_at)
I have been unable to find an answer that would help me understand if there would be any performance benefit to adding updated_at to the index given the query we are running above. Any help appreciated. We are running Postgresql11
There is a performance benefit if you combine both columns just like you did ((device_token, updated_at)), because the database can easily find the entries with the specific device_token and it does not need to do the sorting during the query.
Even better would be an index on (device_token, updated_at DESC) as it gives you the requested row as the first one with this device_token, so there is no need to get the first and start a sequential scan from there on to find the last.

Can Postgres table partitioning by hash boost query performance?

I have an OLTP table on a Postgres 14.2 database that looks something like this:
Column | Type | Nullable |
----------------+-----------------------------+-----------
id | character varying(32) | not null |
user_id | character varying(255) | not null |
slug | character varying(255) | not null |
created_at | timestamp without time zone | not null |
updated_at | timestamp without time zone | not null |
Indexes:
"playground_pkey" PRIMARY KEY, btree (id)
"playground_user_id_idx" btree (user_id)
The database host has 8GB of RAM and 2 CPUs.
I have roughly 500M records in the table which adds up to about 80GB in size.
The table gets about 10K INSERT/h, 30K SELECT/h, and 5K DELETE/h.
The main query run against the table is:
SELECT * FROM playground WHERE user_id = '12345678' and slug = 'some-slug' limit 1;
Users have anywhere between 1 record to a few hundred records.
Thanks to the index on the user_id I generally get decent performance (double-digit milliseconds), but 5%-10% of the queries will take a few hundred milliseconds to maybe a second or two at worst.
My question is this: would partitioning the table by hash(user_id) help me boost lookup performance by taking advantage of partition pruning?
No, that wouldn't improve the speed of the query at all, since there is an index on that attribute. If anything, the increased planning time will slow down the query.
If you want to speed up that query as much as possible, create an index that supports both conditions:
CREATE INDEX ON playground (user_id, slug);
If slug is large, it may be preferable to index a hash:
CREATE INDEX ON playground (user_id, hashtext(slug));
and query like this:
SELECT *
FROM playground
WHERE user_id = '12345678'
AND slug = 'some-slug'
AND hashtext(slug) = hashtext('some-slug');
LIMIT 1;
Of course, partitioning could be a good idea for other reasons, for example to speed up autovacuum or CREATE INDEX.

Does the index column order matter on row insert in Postgresql?

I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.

Postgres: How to efficiently bucketize random event ids below (hour,config_id,sensor_id)

I have a large table "measurement" with 4 columns:
measurement-service=> \d measurement
Table "public.measurement"
Column | Type | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
hour | timestamp without time zone | | not null |
config_id | bigint | | not null |
sensor_id | bigint | | not null |
event_id | uuid | | not null |
Partition key: RANGE (hour)
Indexes:
"hour_config_id_sensor_id_event_id_key" UNIQUE CONSTRAINT, btree (hour, config_id, sensor_id, event_id)
Number of partitions: 137 (Use \d+ to list them.)
An example of a partition name: "measurement_y2019m12d04"
And then i insert a lot of events as CSV via COPY to a temporary table, and from there i copy the table directly into the partition using ON CONFLICT DO NOTHING.
Example:
CREATE TEMPORARY TABLE 'tmp_measurement_y2019m12d04T02_12345' (
hour timestamp without timezone,
config_id bigint,
sensor_id bigint,
event_id uuid
) ON COMMIT DROP;
[...]
COPY tmp_measurement_y2019m12d04T02_12345 FROM STDIN DELIMITER ',' CSV HEADER;
INSERT INTO measurement_y2019m12d04 (SELECT * FROM tmp_measurement_y2019m12d04T02_12345) ON CONFLICT DO NOTHING;
I think i help postgres by sending CSV with data of the same hour only. Also within that hour, i remove all duplicates in the CSV. Therefore the CSV only contains unique rows.
But i send many batches for different hours. There is no order. It can be the hour of today, yesterday, the last week. Etc.
My approach worked alright so far, but i think i have reached a limit now. The insertion speed has become very slow. While the CPU is idle, i have 25% i/o wait. Subsystem is a RAID with several TB, using disks, that are not SSD.
maintenance_work_mem = 32GB
max_wal_size = 1GB
fsync = off
max_worker_processes = 256
wal_buffers = -1
shared_buffers = 64GB
temp_buffers = 4GB
effective_io_concurrency = 1000
effective_cache_size = 128GB
Each partition per day is around 20gb big and contains no more than 500m rows. And by maintaining the unique index per partition, i just duplicated the data once more.
The lookup speed, on the other hand, is quick.
I think the limit is in the maintenance of the btree with the rather random UUIDs in (hour,config_id,sensor_id). I constantly change it, its written out and has to be re-read.
I am wondering, if there is another approach. Basically i want uniqueness for (hour,config_id,sensor_id,event_id) and then a quick lookup per (hour,config_id,sensor_id).
I am considering removal of the unique index and only having an index over (hour,config_id,sensor_id). And then providing the uniqueness on the reader side. But it may slow down the reading, as the event_id can no longer be delivered via the index, when i lookup via (hour,config_id,sensor_id). It has to access the actual row to get the event_id.
Or i provide uniqueness via a hash index.
Any other ideas are welcome!
Thank you.
When you do the insert, you should specify an ORDER BY which matches the index of the table being inserted into:
INSERT INTO measurement_y2019m12d04
SELECT * FROM tmp_measurement_y2019m12d04T02_12345
order by hour, config_id, sensor_id, event_id
Only if this fails to give enough improvement would I consider any of the other options you list.
Hash indexes don't provide uniqueness. You can simulate it with an exclusion constraint, but I think they are less efficient. Exclusion constraints do support DO NOTHING, but not support DO UPDATE. So as long as your use case does not evolve to want DO UPDATE, you would be good on that front, but I still doubt it would actually solve the problem. If your bottleneck is IO from updating the index, hash would only make it worse as it is designed to scatter your data all over the place, rather than focus it in a small cacheable area.
You also mention parallel processing. For inserting into the temp table, that might be fine. But I wouldn't do the INSERT...SELECT in parallel. If IO is your bottleneck, that would probably just make it worse. Of course if IO is no longer the bottleneck after my ORDER BY suggestion, then ignore this part.