Database design for time series

Database design for time series - postgresql

Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?

I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.

You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);

Related

Forcing Postgresql to use Merge Append

Say I have the following tables and indices:
create table inbound_messages(id int, user_id int, received_at timestamp);
create table outbound_messages(id int, user_id int, sent_at timestamp);
create index on inbound_messages(user_id, received_at);
create index on outbound_messages(user_id, sent_at);
Now I want to pull out the last 20 messages for a user, either inbound or outbound in a specific time range. I can do the following and from the explain it looks like PG walks back both indices in 'parallel' so it minimises the amount of rows it needs to scan.
explain select * from (select id, user_id, received_at as time from inbound_messages union all select id, user_id, sent_at as time from outbound_messages) x where user_id = 5 and time between '2018-01-01' and '2020-01-01' order by user_id,time desc limit 20;
Limit (cost=0.32..16.37 rows=2 width=16)
-> Merge Append (cost=0.32..16.37 rows=2 width=16)
Sort Key: inbound_messages.received_at DESC
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages (cost=0.15..8.17 rows=1 width=16)
Index Cond: ((user_id = 5) AND (received_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (received_at <= '2020-01-01 00:00:00'::timestamp without time zone))
-> Index Scan Backward using outbound_messages_user_id_sent_at_idx on outbound_messages (cost=0.15..8.17 rows=1 width=16)
Index Cond: ((user_id = 5) AND (sent_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (sent_at <= '2020-01-01 00:00:00'::timestamp without time zone))
For example it could do something crazy like find all the matching rows in memory, and then sort the rows. Lets say there were millions of matching rows then this could take a long time. But because it walks the indices in the same order we want the results in this is a fast operation. It looks like the 'Merge Append' operation is done lazily and it doesn't actually materialize all the matching rows.
Now we can see postgres supports this operation for two distinct tables, however is it possible to force Postgres to use this optimisation for a single table.
Lets say I wanted the last 20 inbound messages for user_id = 5 or user_id = 6.
explain select * from inbound_messages where user_id in (6,7) order by received_at desc limit 20;
Then we get a query plan that does a bitmap heap scan, and then does an in-memory sort. So if there are millions of messages that match then it will look at millions of rows even though theoretically it could use the same Merge trick to only look at a few rows.
Limit (cost=15.04..15.09 rows=18 width=16)
-> Sort (cost=15.04..15.09 rows=18 width=16)
Sort Key: received_at DESC
-> Bitmap Heap Scan on inbound_messages (cost=4.44..14.67 rows=18 width=16)
Recheck Cond: (user_id = ANY ('{6,7}'::integer[]))
-> Bitmap Index Scan on inbound_messages_user_id_received_at_idx (cost=0.00..4.44 rows=18 width=0)
Index Cond: (user_id = ANY ('{6,7}'::integer[]))
We could think of just adding (received_at) as an index on the table and then it will do the same backwards scan. However, if we have a large number of users then we are missing out on a potentially large speedup because we are scanning lots of index entries that would not match the query.

The following approach should work as a way of forcing Postgres to use the "merge append" plan when you are interested in most recent messages for two users from the same table.
[Note: I tested this on YugabyteDB (which is based on Postgres)- so I expect the same to apply to Postgres also.]
explain select * from (
(select * from inbound_messages where user_id = 6 order by received_at DESC)
union all
(select * from inbound_messages where user_id = 7 order by received_at DESC)
) AS result order by received_at DESC limit 20;
which produces:
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.01..3.88 rows=20 width=16)
-> Merge Append (cost=0.01..38.71 rows=200 width=16)
Sort Key: inbound_messages.received_at DESC
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages (cost=0.00..17.35 rows=100 width=16)
Index Cond: (user_id = 6)
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages inbound_messages_1 (cost=0.00..17.35 rows=100 width=16)
Index Cond: (user_id = 7)

How to auto-extract or index day from timestamp in Postgres?

We have a Postgres table with a timestamp field created_at. On a regular basis, we need to find all the records with the day field of created_at being a certain number.
We can run a query like
select * from table where extract(day from created_at) = 3;
I suspect this isn't efficient, ie it's doing a full-table scan. If so, can I create an index somehow to make the above efficient?
If it's not possible, we can create a separate column called created_at_day and create an index on it.
So we can simply run the query like
select * from table where created_at_day = 3;
Let's say created_at can be updated. Whenever this happens, created_at_day should be updated, too.
Does Postgres provide any support to automatically keep created_at_day in sync with created_at? If so, how?
Of course this can be done in the application logic. So whenever created_at is created or updated, we update the created_at_day column. But just wondering if there's an easier, automated way to do this.
Thanks

You can create an index on extract(day from created_at)
To see the difference:
Create a table
knayak=# create table t as select i ,now()::timestamp + interval '1 days' * i as created_at from generate_series(1,10000) as i;
SELECT 10000
Create normal index on created_at
knayak=# create index ind_created_at on t(created_at);
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..205.00 rows=50 width=12) (actual time=1.049..6.020 rows=328 loops=1)
Filter: (date_part('day'::text, created_at) = '3'::double precision)
Rows Removed by Filter: 9672
Planning time: 0.392 ms
Execution time: 6.070 ms
(5 rows)
Create index with extract
knayak=# drop index ind_created_at;
DROP INDEX
knayak=# create index ind_created_at on t( extract(day from created_at) );
CREATE INDEX
knayak=# explain analyze select * from t where extract(day from created_at) = 3;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=4.67..61.66 rows=50 width=12) (actual time=0.110..0.260 rows=328 loops=1)
Recheck Cond: (date_part('day'::text, created_at) = '3'::double precision)
Heap Blocks: exact=54
-> Bitmap Index Scan on ind_created_at (cost=0.00..4.66 rows=50 width=0) (actual time=0.093..0.093 rows=328 loops=1)
Index Cond: (date_part('day'::text, created_at) = '3'::double precision)
Planning time: 0.316 ms
Execution time: 0.314 ms
(7 rows)

PostgreSQL increase group by over 30 million rows

Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.
create table if not exists tb
(
id serial not null constraint tb_pkey primary key,
week integer,
month integer,
year integer,
starttime varchar(20),
endtime varchar(20),
brand smallint,
category smallint,
value real
);
The query below takes 8.5 seconds.
SELECT category from tb group by category
Is there any way to increase the speed. I have tried with and without index.

For that exact query, not really; doing this operation requires scanning every row. No way around it.
But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):
Counting distinct rows using recursive cte over non-distinct index
You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:
testdb=# create table tb(id bigserial, category smallint);
CREATE TABLE
testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
testdb-# ;
INSERT 0 10000
testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
INSERT 0 10000
testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
INSERT 0 10000
testdb=# create index on tb(category);
CREATE INDEX
testdb=# WITH RECURSIVE cte AS
(
(SELECT category
FROM tb
WHERE category >= 0
ORDER BY 1
LIMIT 1)
UNION ALL SELECT
(SELECT category
FROM tb
WHERE category > c.category
ORDER BY 1
LIMIT 1)
FROM cte c
WHERE category IS NOT NULL)
SELECT category
FROM cte
WHERE category IS NOT NULL;
category
----------
1
2
3
(3 rows)
And here's the EXPLAIN ANALYZE:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
Filter: (category IS NOT NULL)
Rows Removed by Filter: 1
CTE cte
-> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
-> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
-> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
Index Cond: (category >= 0)
Heap Fetches: 1
-> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
Filter: (category IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
-> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
Index Cond: (category > c.category)
Heap Fetches: 2
Planning time: 0.224 ms
Execution time: 0.191 ms
(19 rows)
The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.
Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.

Postgres execution plan order of where clause and order by

I have a very simple SQL:
select * from email.email_task where acquire_time < now() and state IN ('CREATED', 'RELEASED') order by creation_time asc limit 1;
I have 2 indexes created:
Index of state
Index of state, acquire_time, creation_time
Ideally I think Postgres should pick the 2nd one since it matches every column required in this SQL:
However the execution plan shows differently, it uses neither of the indexes:
Limit (cost=187404.36..187404.36 rows=1 width=743)
-> Sort (cost=187404.36..190753.58 rows=1339690 width=743)
Sort Key: creation_time
-> Seq Scan on email_task (cost=0.00..180705.91 rows=1339690 width=743)
Filter: (((state)::text = 'CREATED'::text) AND (acquire_time < now()))
I understand that if the number of rows returned arrives like 10% of total, then it would pick Seq Scan over Index Scan. (As explained at Why does PostgreSQL perform sequential scan on indexed column?
) So that's why index1 is not picked.
What I don't understand is why index2 is not picked since matches all the columns?
Then I tried a 3rd index:
Index of create_time, acquire_time, state
And this time it uses the index3 (I add the index using another smaller database
perf_1 because the original one has 2 million rows and it takes too much time)
Limit (cost=0.29..0.36 rows=1 width=75) (actual time=0.043..0.043 rows=1 loops=1)
-> Index Scan using perf_1 on email_task (cost=0.29..763.76 rows=9998 width=75) (actual time=0.042..0.042 rows=1 loops=1)
Index Cond: (acquire_time < now())
Filter: ((state)::text = ANY ('{CREATED,RELEASED}'::text[]))
It seems that, Postgres execution planner is picking the order by clause first then the where clause which is a little bit counter-intuitive.
Is my understanding correct or there are some other factors that impact the Postgres planner?
Thanks in advance.

Redshift query runs forever from only 2 mins by just changing a specific number to a variable

My query will automatically update every month in rundeck with updated date by using sysdate.
Here is the first query to generate the time variables for further use.
create temp table variables as
select dateadd('MONTH', 0, sysdate) as endMonth,
dateadd('MONTH', 1, sysdate) as endMonthPlusOne,
-- dateadd('MONTH', -12, sysdate) as startMonth,
dateadd('MONTH', -1, sysdate) as startMonth
;
Here is my second query:
create table ml_data.user_events_filtered distkey(userid) sortkey(eventid) as
select *
from user_events
where 1=1
and eventid=1011
and servertime_monthly = date_trunc('month', (select startMonth
from variables limit 1))
This will run forever until I changed the query to:
create table ml_data.user_events_filtered distkey(userid) sortkey(eventid) as
select userid, eventid, clienttime, charparam1, charparam2, charparam3, charparam4
from user_events
where 1=1
and eventid=1011
and servertime_monthly = '2015-01-01'
;
It takes 2 mins to run. I checked every possible reasons. Does it has something to do with leader node. I don't know if the query planner compile the subquery into a fixed date before running the primary query. I ask this questions since if I only run the subquery, it takes 0.3s. Since my table has huge number of rows, in total it might sum up to a formidable number. The table user_events contains 40 billion rows.
Per Sami's response, I run the EXPLAIN clause for both of the code and here is my result.
XN Seq Scan on user_events (cost=0.01..434963947.53 rows=205328 width=65)
Filter: ((eventid = 1011) AND ((servertime_monthly)::timestamp without time zone = date_trunc('month'::text, $0)))
InitPlan
-> XN Limit (cost=0.00..0.01 rows=1 width=8)
-> XN Seq Scan on variables (cost=0.00..0.01 rows=1 width=8)
XN Seq Scan on user_events (cost=0.00..326222960.64 rows=2945182 width=65)
Filter: ((eventid = 1011) AND (servertime_monthly = '2015-01-01'::date))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Database design for time series - postgresql

Related

Forcing Postgresql to use Merge Append

How to auto-extract or index day from timestamp in Postgres?

PostgreSQL increase group by over 30 million rows

Postgres execution plan order of where clause and order by

Redshift query runs forever from only 2 mins by just changing a specific number to a variable

Categories

Resources