Moving from Influx to Postgres, need tips - postgresql

I used Influx to store our time series data. It's cool when it worked, then after about one month, it stopped working and I couldn't figure out why. (Similiar to this issue https://github.com/influxdb/influxdb/issues/1386)
Maybe Influx will be great one day, but for now I need to use something that's more stable. I'm thinking about Postgres. Our data comes from many sensors, each sensor has a sensor id. So I'm thinking about structuring our data as this:
(pk), sensorId(string), time(timestamp), value(float)
Influx is built for time series data so it probably has some built in optimizations. Do I need to do optimizations myself to make Postgres efficient? More specifically, I have these questions:
Influx has has this notion of 'series' and it's cheap to create new series. So I had a separate series for each sensor. Should I create a separate Postgres table for each sensor?
How should I setup up indexes to make queries fast? A typical query is: select all data for sensor123 for the last 3 days.
Should I use timestamp or integer for the time column?
How do I set a retention policy? E.g. delete data that's older than one week automatically.
Will Postgres scale horizontally? Can I setup ec2 clusters for data replication and load balancing?
Can I downsample in Postgres? I have read in some articles that I can use date_trunc. But it seems that I can't date_trunc it to a specific interval e.g. 25 seconds.
Any other caveats I missed?
Thanks in advance!
Updates
Storing the time column as big integer is faster than storing it as timestamp. Am I doing something wrong?
storing it as timestamp:
postgres=# explain analyze select * from test where sensorid='sensor_0';
Bitmap Heap Scan on test (cost=3180.54..42349.98 rows=75352 width=25) (actual time=10.864..19.604 rows=51840 loops=1)
Recheck Cond: ((sensorid)::text = 'sensor_0'::text)
Heap Blocks: exact=382
-> Bitmap Index Scan on sensorindex (cost=0.00..3161.70 rows=75352 width=0) (actual time=10.794..10.794 rows=51840 loops=1)
Index Cond: ((sensorid)::text = 'sensor_0'::text)
Planning time: 0.118 ms
Execution time: 22.984 ms
postgres=# explain analyze select * from test where sensorid='sensor_0' and addedtime > to_timestamp(1430939804);
Bitmap Heap Scan on test (cost=2258.04..43170.41 rows=50486 width=25) (actual time=22.375..27.412 rows=34833 loops=1)
Recheck Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > '2015-05-06 15:16:44-04'::timestamp with time zone))
Heap Blocks: exact=257
-> Bitmap Index Scan on sensorindex (cost=0.00..2245.42 rows=50486 width=0) (actual time=22.313..22.313 rows=34833 loops=1)
Index Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > '2015-05-06 15:16:44-04'::timestamp with time zone))
Planning time: 0.362 ms
Execution time: 29.290 ms
storing it as big integer:
postgres=# explain analyze select * from test where sensorid='sensor_0';
Bitmap Heap Scan on test (cost=3620.92..42810.47 rows=85724 width=25) (actual time=12.450..19.615 rows=51840 loops=1)
Recheck Cond: ((sensorid)::text = 'sensor_0'::text)
Heap Blocks: exact=382
-> Bitmap Index Scan on sensorindex (cost=0.00..3599.49 rows=85724 width=0) (actual time=12.359..12.359 rows=51840 loops=1)
Index Cond: ((sensorid)::text = 'sensor_0'::text)
Planning time: 0.130 ms
Execution time: 22.331 ms
postgres=# explain analyze select * from test where sensorid='sensor_0' and addedtime > 1430939804472;
Bitmap Heap Scan on test (cost=2346.57..43260.12 rows=52489 width=25) (actual time=10.113..14.780 rows=31839 loops=1)
Recheck Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > 1430939804472::bigint))
Heap Blocks: exact=235
-> Bitmap Index Scan on sensorindex (cost=0.00..2333.45 rows=52489 width=0) (actual time=10.059..10.059 rows=31839 loops=1)
Index Cond: (((sensorid)::text = 'sensor_0'::text) AND (addedtime > 1430939804472::bigint))
Planning time: 0.154 ms
Execution time: 16.589 ms

You shouldn't create a table for each sensor. Instead you could add a field to your table that identifies what series it is in. You could also have another table that describes additional attributes about the series. If data points could belong to multiple series, then you'd need a different structure altogether.
For the query you described in q2, an index on your recorded_at column should work (time is a sql reserved keyword, so best avoid that as a name)
You should use TIMESTAMP WITH TIME ZONE as your time data type.
Retention is up to you.
Postgres has various options for sharding/replication. That's a big topic.
Not sure I understand your objective for #6, but I'm sure you can figure something out.

Related

Why the execution time given by ANALYZE (POSTGRESQL) of the same query, is different each time I execute it even if I reset the cache?

My setup:
MacOS
Homebrew
Postgresql10 installed with brew
To reset the cache I run these commands:
brew services stop postgresql#10
brew services start postgresql#10
Indeed each time I run the query I don't see share hit when I run EXPLAIN (ANALYZE, buffers).
This is what happen:
EXPLAIN (ANALYZE, buffers)
SELECT *
FROM my_table
WHERE key = '...';
From query plan:
Bitmap Heap Scan on mytable (cost=5.31..663.61 rows=169 width=97) (actual time=1.172..32.475 rows=221 loops=1)
Recheck Cond: ((key)::text = '...'::text)
Heap Blocks: exact=220
Buffers: shared read=222
-> Bitmap Index Scan on idx_hash_mytable_key (cost=0.00..5.27 rows=169 width=0) (actual time=0.719..0.719 rows=221 loops=1)
Index Cond: ((key)::text = '...'::text)
Buffers: shared read=2
Planning time: 6.370 ms
Execution time: 32.527 ms
After I re-start postgres and re-run the same query I have:
Bitmap Heap Scan on mytable (cost=5.31..663.61 rows=169 width=97) (actual time=0.705..42.808 rows=221 loops=1)
Recheck Cond: ((key)::text = '...'::text)
Heap Blocks: exact=220
Buffers: shared read=222
-> Bitmap Index Scan on idx_hash_mytable_key (cost=0.00..5.27 rows=169 width=0) (actual time=0.464..0.464 rows=221 loops=1)
Index Cond: ((key)::text = '...'::text)
Buffers: shared read=2
Planning time: 5.611 ms
Execution time: 42.869 ms
As you can see the difference between the two executions is quite big: the first run is about 25% faster that the second one. Probably there are other variables that influence the final result.
Is there a way to have the same execution time every time a run the query? Maybe my current approach to reset the cache is not correct.
The goal is to check the differences in terms of performance between two indexes, but if the executions time is different with the same query plan, after resetting the cache, I cannot check with precision the differences between the performances of two indexes.

Can I have a feedback about my Postgres performance?

this is the query I performed in pgAdmin4:
update point
set grid_id_new=g.grid_id
from grid as g
where (point.region='EMILIA-ROMAGNA'and st_within(point.geom,g.geom))
Point is a 34 millions record table describing a point geometry (16 GB - 20 columns)
Grid is a 10 millions record table describing a multlipolygon geometry (grid) (4 GB)
I want my point table to associate with the grid ID they lie in. The query output are 2.5 million records updated (since I filter by region), in 24 minutes.
I feel like it took too much time.
These are my computer specifics:
Windows 10 PRO/Intel(R) Core(TM) i9-10920X CPU # 3.50 GHz/RAM 128 GB/953GB SSD(C)+3.4TB HDD(F)
I have installed Postgres13 and the data folder is on F (I know this may be wrong so I am planning to move it).
I have also tried to tune postgres.conf file but I got poor results.
Can someone please explain if my Postgres performance are as poor as I think? And, if so, how can I make it better? Also, what could be a good configuration for postgres.conf according with my hardware?
Update
#jjanes Hi there! it took 8 minutes to run the query you wrote, and this is the result:
QUERY PLAN
Gather (cost=1363.89..273178616690.49 rows=23057026760 width=28) (actual time=76.935..503830.684 rows=2335279 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=18634521 read=2426823
-> Nested Loop (cost=363.89..270872913014.49 rows=9607094483 width=28) (actual time=157.628..503021.991 rows=778426 loops=3)
Buffers: shared hit=18634521 read=2426823
-> Parallel Seq Scan on egon_geom_new (cost=0.00..2657488.69 rows=1064319 width=59) (actual time=1.575..8642.488 rows=855390 loops=3)
Filter: (dsxreg = 'EMILIA-ROMAGNA'::text)
Rows Removed by Filter: 10581246
Buffers: shared hit=259223 read=2225262
-> Bitmap Heap Scan on "6_emilia_grid" (cost=363.89..254491.98 rows=903 width=148) (actual time=0.573..0.573 rows=1 loops=2566171)
Filter: st_within((egon_geom_new.geom_new)::geometry, geom)
Heap Blocks: exact=784879
Buffers: shared hit=18375298 read=201561
-> Bitmap Index Scan on emilia_idx (cost=0.00..363.66 rows=9027 width=0) (actual time=0.283..0.283 rows=1 loops=2566171)
Index Cond: (geom ~ (egon_geom_new.geom_new)::geometry)
Buffers: shared hit=16167046 read=74534
Planning:
Buffers: shared hit=130 read=3 dirtied=2
Planning Time: 22.756 ms
Execution Time: 504042.609 ms
Thanks!
You can create a GiST index on one of the geometry columns, that will speed up the nested loop join. But you cannot use another join strategy, because the join condition is not using the equality operator (=), so it will always be slow to join two big tables.

Phrase frequency counter with FULL Text Search of PostgreSQL 9.6

I need to calculate the number of times that a phrase appears using ts_query against an indexed text field (ts_vector data type). It works but it is very slow because the table is huge. For single words I pre-calculated all the frequencies but I have no ideas for increasing the speed of a phrase search.
Edit: Thank you for your reply #jjanes.
This is my query:
SELECT substring(date_input::text,0,5) as myear, ts_headline('simple',text_input,q, 'StartSel=<b>, StopSel=</b>,MaxWords=2, MinWords=1, ShortWord=1, HighlightAll=FALSE, MaxFragments=9999, FragmentDelimiter=" ... "') as headline
FROM
db_test, to_tsquery('simple','united<->kingdom') as q WHERE date_input BETWEEN '2000-01-01'::DATE AND '2019-12-31'::DATE and idxfti_simple ## q
And this is the EXPLAIN (ANALYZE, BUFFERS) output:
Nested Loop (cost=25408.33..47901.67 rows=5509 width=64) (actual time=286.536..17133.343 rows=38127 loops=1)
Buffers: shared hit=224723
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.005..0.007 rows=1 loops=1)
-> Append (cost=25408.33..46428.00 rows=5510 width=625) (actual time=285.372..864.868 rows=38127 loops=1)
Buffers: shared hit=165713
-> Bitmap Heap Scan on db_test (cost=25408.33..46309.01 rows=5509 width=625) (actual time=285.368..791.111 rows=38127 loops=1)
Recheck Cond: ((idxfti_simple ## q.q) AND (date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
Rows Removed by Index Recheck: 136
Heap Blocks: exact=29643
Buffers: shared hit=165607
-> BitmapAnd (cost=25408.33..25408.33 rows=5509 width=0) (actual time=278.370..278.371 rows=0 loops=1)
Buffers: shared hit=3838
-> Bitmap Index Scan on idxftisimple_idx (cost=0.00..1989.01 rows=35869 width=0) (actual time=67.280..67.281 rows=176654 loops=1)
Index Cond: (idxfti_simple ## q.q)
Buffers: shared hit=611
-> Bitmap Index Scan on db_test_date_input_idx (cost=0.00..23142.24 rows=1101781 width=0) (actual time=174.711..174.712 rows=1149456 loops=1)
Index Cond: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
Buffers: shared hit=3227
-> Seq Scan on test (cost=0.00..118.98 rows=1 width=451) (actual time=0.280..0.280 rows=0 loops=1)
Filter: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date) AND (idxfti_simple ## q.q))
Rows Removed by Filter: 742
Buffers: shared hit=106
Planning time: 0.332 ms
Execution time: 17176.805 ms
Sorry, I can't set track_io_timing turned on. I do know that ts_headline is not recommended but I need it to calculate the number of times that a phrase appears on the same field.
Thank you in advance for your help.
Note that fetching the rows in Bitmap Heap Scan is quite fast, <0.8 seconds, and almost all the time is spent in the top-level node. That time is likely to be spent in ts_headline, reparsing the text_input document. As long as you keep using ts_headline, there isn't much you can do about this.
ts_headline doesn't directly give you what you want (frequency), so you must be doing some kind of post-processing of it. Maybe you could move to postprocessing the tsvector directly, so the document doesn't need to be reparsed.
Another option is to upgrade further, which could allow the work of ts_headline to be spread over multiple CPUs. PostgreSQL 9.6 was the first version which supported parallel query, and it was not mature enough in that version to be able to parallelize this type of thing. v10 is probably enough to get parallel query for this, but you might as well jump all the way to v12.
Version 9.2 is old and out of support. It didn't have native support for phrase searching in the first place (introduced in 9.6).
Please upgrade.
And if it is still slow, show us the query, and the EXPLAIN (ANALYZE, BUFFERS) for it, preferably with track_io_timing turned on.

Postgres JSONB timestamp query very slow compared to timestamp column query

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdated": "2016-12-26T12:09:43.901Z",
"lastUpdatedTimestamp": "1482754183"
}
}
The actual JSONB column stores much more information, but I've omitted the irrelevant data. The data format cannot be changed since this is legacy information.
I'm trying to efficiently obtain a count of all records where the lastUpdated value is greater or equal to some reference time (I'll use 2015-12-01T10:10:10Z in the following examples):
explain analyze SELECT count(*) FROM "accounts"
WHERE data->>'lastUpdated' >= '2015-12-01T10:10:10Z';
This takes over 22 seconds:
Aggregate (cost=843795.05..843795.06 rows=1 width=0) (actual time=22292.584..22292.584 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..842317.05 rows=591201 width=0)
(actual time=1.410..22142.046 rows=1773603 loops=1)
Filter: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.234 ms
Execution time: 22292.671 ms
I've tried adding the following text index:
CREATE INDEX accounts_last_updated ON accounts ((data->>'lastUpdated'));
But the query is still rather slow, at over 17 seconds:
Aggregate (cost=815548.64..815548.65 rows=1 width=0) (actual time=17172.844..17172.845 rows=1 loops=1)
-> Bitmap Heap Scan on accounts (cost=18942.24..814070.64 rows=591201 width=0)
(actual time=1605.454..17036.081 rows=1773603 loops=1)
Recheck Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Heap Blocks: exact=28955 lossy=397518
-> Bitmap Index Scan on accounts_last_updated (cost=0.00..18794.44 rows=591201 width=0)
(actual time=1596.645..1596.645 rows=1773603 loops=1)
Index Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.373 ms
Execution time: 17172.974 ms
I've also tried following the directions in Create timestamp index from JSON on PostgreSQL and have tried creating the following function and index:
CREATE OR REPLACE FUNCTION text_to_timestamp(text)
RETURNS timestamp AS
$$SELECT to_timestamp($1, 'YYYY-MM-DD HH24:MI:SS.MS')::timestamp; $$
LANGUAGE sql IMMUTABLE;
CREATE INDEX accounts_last_updated ON accounts
(text_to_timestamp(data->>'lastUpdated'));
But this doesn't give me any improvement, in fact it was slower, taking over 24 seconds for the query, versus 22 seconds for the unindexed version:
explain analyze SELECT count(*) FROM "accounts"
WHERE text_to_timestamp(data->>'lastUpdated') >= '2015-12-01T10:10:10Z';
Aggregate (cost=1287195.80..1287195.81 rows=1 width=0) (actual time=24143.150..24143.150 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..1285717.79 rows=591201 width=0)
(actual time=4.044..23971.723 rows=1773603 loops=1)
Filter: (text_to_timestamp((data ->> 'lastUpdated'::text)) >= '2015-12-01 10:10:10'::timestamp without time zone)
Planning time: 1.107 ms
Execution time: 24143.183 ms
In one last act of desperation, I decided to add another timestamp column and update it to contain the same values as data->>'lastUpdated':
alter table accounts add column updated_at timestamp;
update accounts set updated_at = text_to_timestamp(data->>'lastUpdated');
create index accounts_updated_at on accounts(updated_at);
This has given me by far the best performance:
explain analyze SELECT count(*) FROM "accounts" where updated_at >= '2015-12-01T10:10:10Z';
Aggregate (cost=54936.49..54936.50 rows=1 width=0) (actual time=676.955..676.955 rows=1 loops=1)
-> Index Only Scan using accounts_updated_at on accounts
(cost=0.43..50502.48 rows=1773603 width=0) (actual time=0.026..552.442 rows=1773603 loops=1)
Index Cond: (updated_at >= '2015-12-01 10:10:10'::timestamp without time zone)
Heap Fetches: 0
Planning time: 4.643 ms
Execution time: 678.962 ms
However, I'd very much like to avoid adding another column just to improve the speed of ths one query.
This leaves me with the following question: is there any way to improve the performance of my JSONB query so it can be as efficient as the individual column query (the last query where I used updated_at instead of data->>'lastUpdated')? As it stands, it takes from 17 seconds to 24 seconds for me to query the JSONB data using data->>'lastUpdated', while it takes only 678 ms to query the updated_at column. It doesn't make sense that the JSONB query would be so much slower. I was hoping that by using the text_to_timestamp function that it would improve the performance, but it hasn't been the case (or I'm doing something wrong).
In your first and second try most execution time is spent on index recheck or filtering, which must read every json field index hits, reading json is expensive. If index hits a couple hundred rows, query will be fast, but if index hits thousands or hundreds of thousand rows - filtering/rechecking json field will take some serious time. In second try, using additionally another function makes it even worse.
JSON field is good for storing data, but are not intended to be used in analytic queries like summaries, statistics and its bad practice to use json objects to be used in where conditions, atleast as main filtering condition like in your case.
That last act of depression of yours is the right way to go :)
To improve query performance, you must add one or some several columns with key vales which will be used most in where conditions.

Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries

We have queries of the form
select sum(acol)
where xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol)
What index can be built to speed up the where clause ?
A btree index created using
create index idx_01 using btree(xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol))
does not seem to be used at all.
EDIT
Setting enable_seqscan to off, the query using xpath_exists is much faster (one order of magnitude) and clearly shows using the corresponding index (the btree index built with xpath_exists).
Any clue why PostgreSQL would not be using the index and attempt a much slower sequential scan ?
Since I do not want to disable sequential scanning globally, I am back to square one and I am happily welcoming suggestions.
EDIT 2 - Explain plans
See below - Cost of first plan (seqscan off) is slightly higher but processing time much faster
b2box=# set enable_seqscan=off;
SET
b2box=# explain analyze
Select count(*)
from B2HEAD.item
where cluster = 'B2BOX' and ( ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) ) ) offset 0 limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.042..606.042 rows=1 loops=1)
-> Aggregate (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.039..606.039 rows=1 loops=1)
-> Bitmap Heap Scan on item (cost=1058.65..22701.38 rows=26102 width=0) (actual time=3.290..603.823 rows=4085 loops=1)
Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
-> Bitmap Index Scan on item_counter_01 (cost=0.00..1052.13 rows=56515 width=0) (actual time=2.283..2.283 rows=4085 loops=1)
Index Cond: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) = true)
Total runtime: 606.136 ms
(7 rows)
plan on explain.depesz.com
b2box=# set enable_seqscan=on;
SET
b2box=# explain analyze
Select count(*)
from B2HEAD.item
where cluster = 'B2BOX' and ( ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) ) ) offset 0 limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.163..10864.163 rows=1 loops=1)
-> Aggregate (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.160..10864.160 rows=1 loops=1)
-> Seq Scan on item (cost=0.00..22490.45 rows=26102 width=0) (actual time=33.574..10861.672 rows=4085 loops=1)
Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
Rows Removed by Filter: 108945
Total runtime: 10864.242 ms
(6 rows)
plan on explain.depesz.com
Planner cost parameters
Cost of first plan (seqscan off) is slightly higher but processing time much faster
This tells me that your random_page_cost and seq_page_cost are probably wrong. You're likely on storage with fast random I/O - either because most of the database is cached in RAM or because you're using SSD, SAN with cache, or other storage where random I/O is inherently fast.
Try:
SET random_page_cost = 1;
SET seq_page_cost = 1.1;
to greatly reduce the cost param differences and then re-run. If that does the job consider changing those params in postgresql.conf..
Your row-count estimates are reasonable, so it doesn't look like a planner mis-estimation problem or a problem with bad table statistics.
Incorrect query
Your query is also incorrect. OFFSET 0 LIMIT 1 without an ORDER BY will produce unpredictable results unless you're guaranteed to have exactly one match, in which case the OFFSET ... LIMIT ... clauses are unnecessary and can be removed entirely.
You're usually much better off phrasing such queries as SELECT max(...) or SELECT min(...) where possible; PostgreSQL will tend to be able to use an index to just pluck off the desired value without doing an expensive table scan or an index scan and sort.
Tips
BTW, for future questions the PostgreSQL wiki has some good information in the performance category and a guide to asking Slow query questions.