How to improve Postgres query with pg_trgm? - postgresql

Problem
I have Postgres database where is more than 100 m rows. I need to find the most similar text in all DB.
To find text similarity I use pg_trgm to filter most similar and then after fetched results with Python and RapidFuzz I compare texts more closely.
What I found is that I can set pg_trgm threshold. So I do that. Also i created an index on the column where I do comparison
CREATE INDEX trgm_idx ON goods_goods USING GIN (name gin_trgm_ops)
There I found two types of indexes, GIN and GIST and after reading some explanations GIST would suggest for this type of work, but after real checking, GIN working better(faster)  GIST - 9min GIN - 12sec
I need to compare many columns so there also is some problems.
I can do that with pg_trgm similarity(concat_ws(' ', col1, col2, col3), 'some text'), and it's working fine while there is small amount of rows.
Because after setting threshold on similarity I need to use it in WHERE clause. And what's happening execution time is 12min. There I also find some explanation. We can't index function concat_ws
But there is also operator || which one we can index and execution time is also faster 12 sec.
WHERE name % 'Some text `'
But || operator didn't remove Null values. And if one of the column is Null, so the all value is Null, and threshold is not correct.
IDEA
Then my idea was to not concatenate string in where clause but just use only one most important column and then after that with RapidFuzz do more comparison with concatenation
In some times, it would be better to do queries with decreasing threshold. Start with threshold 0.9 if there are no results as I want, then decrease to 0.8 and so on. While threshold is 0.3 or fetched some count of results.
This idea helps save some time. Because to filter values with threshold 0.9 is just a sec. If there are no results returned. And then 0.8 also just a sec if no results or just 3 results.
But if set threshold in general on minimum and then fetch and compare all these results in some cases then would be more than 50k results and time about 120sek. And better would be if we checked some higher threshold. 1sek + 1sek + 5sek(here we found wanted count of results). so there is only 7sek with decreasing threshold.
Here is my queries and indexes:
all DB have 110m rows and source 37 in example have 45m rows
1:
CREATE INDEX trgm_idx ON goods_goods USING GIN (name gin_trgm_ops);
2:
SET pg_trgm.similarity_threshold = 0.3;
EXPLAIN ANALYZE SELECT id, article, brand, vendor_code, name
FROM goods_goods
WHERE name % 'Some text with some numbers 34/44 and chars :}{:">?>"??' AND source_id = 37
3:
"Bitmap Heap Scan on goods_goods (cost=2242.04..40605.89 rows=4026 width=165) (actual time=12704.821..12704.822 rows=0 loops=1)"
" Recheck Cond: ((name)::text % 'Some text with some numbers 34/44 and chars :}{:"">?>""??'::text)"
" Rows Removed by Index Recheck: 330475"
" Filter: (source_id = 37)"
" Heap Blocks: exact=36338 lossy=48099"
" -> Bitmap Index Scan on trgm_idx (cost=0.00..2241.03 rows=9737 width=0) (actual time=5923.925..5923.926 rows=91004 loops=1)"
" Index Cond: ((name)::text % 'Some text with some numbers 34/44 and chars :}{:"">?>""??'::text)"
"Planning Time: 3.028 ms"
"Execution Time: 12705.348 ms"
Question
In WHERE i need to filter another column, and it's also needed to be indexed?
like WHERE name % 'Some text' AND some_id = 123
And if there is five AND? I need to index all of them by one?
create index on some_id1
create index on some_id2
create index on some_id3
If one query takes time about 10sec can I split it by multiprocesses? Like if I have 32 queries and each query takes 10sec so with 32 processes I can get 32 results in 10sec?
Is there an option in Postgres to give some array with queries and return would be also an array with results?
How can I improve this or do it faster? Because in real work there are thousands of queries and if one process would do that this would take some months. So I want to do i faster but don't know how. Maybe there are some ideas ?

I'm assuming you're largely having trouble with (a) optimising the query and indices and (b) how to get the data for a large number of queries in some realistic time. I'll try to answer them both one by one.
Optimising the query and indices (comparing multiple columns)
I think you've got two doubts here again:
Supporting queries with concatenated string avoiding null's.
Well it can be handled by making use of coalesce. So your query should look like:
similarity(coalesce(col1, '') || ' ' || coalesce(col2, '') || ' ' || coalesce(col3, '')), 'some text')
And the corresponding index to support this query would be:
CREATE INDEX trgm_idx ON goods_goods USING GIN ((coalesce(col1, '') || ' ' || coalesce(col2, '') || ' ' || coalesce(col3, ''))) gin_trgm_ops)
Supporting queries with columns of different type (WHERE name % 'Some text' AND some_id = 123)
You can create an index like this to support index search:
CREATE INDEX trgm_idx ON goods_goods USING GIN (name gin_trgm_ops, some_id);
If you wish to combine above two, then the corresponding index would be:
CREATE INDEX trgm_idx ON goods_goods USING GIN ((coalesce(col1, '') || ' ' || coalesce(col2, '') || ' ' || coalesce(col3, ''))) gin_trgm_ops, some_id)
How to get the data for a large number of queries in some realistic time (I'll try to answer the queries posed by you in 2, 3 & 4)
Above solutions on optimising the query and indices will surely improve your query performance. But, if you wish to improve further you could look at establishing multiple connections with your database and firing queries in parallel. This would give you a certain performance gain. You could achieve this by looking at popular ORMs out there. Further, you could have multiple read replicas of your master Postgres machine so that you have more compute power and even more connections. Still you can look at partitioning your data and run your queries on individual partitions instead of a single large table and come up with a solution to merge/recompute the results.

Related

Postgres partial vs regular / full index on nullable column

I have a table with 1m records, with 100k records having null on colA. Remaining records have pretty distinct values, is there a difference in creating a regular index on this column vs a partial index with where colA is not null?
Since regular Postgres indexes do not store NULL values, wouldn't it be the same as creating a partial index with where colA is not null?
Any pros or cons with either indexes?
If you create a partial index without nulls, it will not use it to find nulls.
Here's a test with a full index on 13.5.
# create index idx_test_num on test(num);
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
-------------------------------------------------------------------------------------
Aggregate (cost=5135.00..5135.01 rows=1 width=8)
-> Bitmap Heap Scan on test (cost=63.05..5121.25 rows=5500 width=0)
Recheck Cond: (num IS NULL)
-> Bitmap Index Scan on idx_test_num (cost=0.00..61.68 rows=5500 width=0)
Index Cond: (num IS NULL)
(5 rows)
And with a partial index.
# create index idx_test_num on test(num) where num is not null;
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate (cost=10458.12..10458.13 rows=1 width=8)
-> Gather (cost=10457.90..10458.11 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9457.90..9457.91 rows=1 width=8)
-> Parallel Seq Scan on test (cost=0.00..9352.33 rows=42228 width=0)
Filter: (num IS NULL)
(6 rows)
Since regular postgres indexes do not store NULL values...
This has not been true since version 8.2 [checks notes] 16 years ago. The 8.2 docs say...
Indexes are not used for IS NULL clauses by default. The best way to use indexes in such cases is to create a partial index using an IS NULL predicate.
8.3 introduced nulls first and nulls last and many other improvements around nulls including...
Allow col IS NULL to use an index (Teodor)
It all depends.
NULL values are included in (default) B-tree indexes since version Postgres 8.3, like Schwern provided. However, predicates like the one you mention (where colA is not null) are only properly supported since Postgres 9.0. The release notes:
Allow IS NOT NULL restrictions to use indexes (Tom Lane)
This is particularly useful for finding MAX()/MIN() values in
indexes that contain many null values.
GIN indexes followed later:
As of PostgreSQL 9.1, null key values can be included in the index.
Typically, a partial index makes sense if it excludes a major part of the table from the index, making it substantially smaller and saving writes to the index. Since B-tree indexes are so shallow, bare seek performance scales fantastically (once the index is cached). 10 % fewer index entries hardly matter in that area.
Your case would exclude only around 10% of all rows, and that rarely pays. A partial index adds some overhead for the query planner and excludes queries that don't match the index condition. (The Postgres query planner doesn't try hard if the match is not immediately obvious.)
OTOH, Postgres will rarely use an index for predicates retrieving 10 % of the table - a sequential scan will typically be faster. Again, it depends.
If (almost) all queries exclude NULL anyway (in a way the Postgres planner understands), then a partial index excluding only 10 % of all rows is still a sensible option. But it may backfire if query patterns change. The added complexity may not be worth it.
Also worth noting that there are still corner cases with NULL values in Postgres indexes. I bumped into this case recently where Postgres proved unwilling to read sorted rows from a multicolumn index when the first index expression was filtered with IS NULL (making a partial index preferable for the case):
db<>fiddle here
So, it depends on the complete picture.

Big O notation of Postgresql Max with timescaledb index

I am writing some scripts that need to determine the last timestamp a timeseries datastream that can be interupted.
I am currently working out the most efficient way to do this, the simplest would be to look for the largest timestamp using MAX. As the tables in question are timescaledb hypertables they are indexed, so in theory it should be a case of following the index to find the largest and this should be very efficient operation. However, I am not sure if this is actually true and was wondering if anyone knew how max scales if it's working down an index, I know it's an O(n) function normally.
If there is an index on the column, max can use the index and will become O(1):
EXPLAIN (COSTS OFF) SELECT max(attrelid) FROM pg_attribute;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════
Result
InitPlan 1 (returns $0)
-> Limit
-> Index Only Scan Backward using pg_attribute_relid_attnum_index on pg_attribute
Index Cond: (attrelid IS NOT NULL)
(5 rows)

Different results depending on when I create a GIN index before or after inserting data

I an trying to build a very simple text array GIN with _text_ops. I know all about ts_vectors - I just want to do this with text arrays as a curiosity and I am seeing a strange behavior in PostgreSQL 9.6. Here is my sequence of commands:
drop table docs cascade;
drop index gin1;
CREATE TABLE docs (id SERIAL, doc TEXT, PRIMARY KEY(id));
-- create index gin1 on docs using gin(string_to_array(doc, ' ') _text_ops); -- before
INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from us'),
('We also teach Python and also SQL');
SELECT * FROM docs;
create index gin1 on docs using gin(string_to_array(doc, ' ') _text_ops); -- after
explain select doc from docs where '{SQL}' <# string_to_array(doc, ' ');
If I create the gin1 index before the inserts the explain works as expected:
pg4e=> explain select doc FROM docs WHERE '{SQL}' <# string_to_array(doc, ' ');
Bitmap Heap Scan on docs (cost=12.05..21.53 rows=6 width=32)
Recheck Cond: ('{SQL}'::text[] <# string_to_array(doc, ' '::text))
-> Bitmap Index Scan on gin1 (cost=0.00..12.05 rows=6 width=0)
Index Cond: ('{SQL}'::text[] <# string_to_array(doc, ' '::text))
If I create the gin index after the inserts, it never seems to use the index.
pg4e=> explain select doc from docs where '{SQL}' <# string_to_array(doc, ' ');
Seq Scan on docs (cost=0.00..1.04 rows=1 width=32)
Filter: ('{SQL}'::text[] <# string_to_array(doc, ' '::text))
I wondered if it is because I need to wait a while for the index to be fully populated (even with four rows) - but waiting several minutes and doing the explain still gives me a sequential table scan.
Then just for fun I insert 10000 more records
INSERT INTO docs (doc) SELECT 'Neon ' || generate_series(10000,20000);
The explain shows a Seq Scan for about 10 seconds and then after 10 seconds if I do another explain it shows a Bitmap Heap Scan. So clearly some of the index updating took a few moments - that makes sense. But in the first situation where I insert four rows and then create the index - no matter how long I wait explain never uses the index.
I have a workaround (make the index before doing the inserts) - I am mostly just curious if there is some mechanism like a "flush index" or that I missed - or some other mechanism is at work.
The explain shows a Seq Scan for about 10 seconds and then after 10
seconds if I do another explain it shows a Bitmap Heap Scan. So
clearly some of the index updating took a few moments - that makes
sense. But in the first situation where I insert four rows and then
create the index - no matter how long I wait explain never uses the
index.
When you insert 10,000 rows to a 4 row table, you exceed the level of activity determined by autovacuum_analyze_threshold and autovacuum_analyze_scale_factor. So the next time the autovacuum launcher visits your database, it will execute an ANALYZE of the table, and with new data from that ANALYZE on a largish table it decides the index scan will be useful. But if you just insert 4 rows, that will not trigger an auto analyze (the default value of autovacuum_analyze_threshold is 50). And if it did, the result of the ANALYZE would be that the table is so small that the index is not useful, so the plan would not change anyway.
I have a workaround (make the index before doing the inserts)
To have a workaround, you need to have a problem. You don't seem to have a genuine problem here (that lasts longer than autovacuum_naptime, anyway), so there is nothing to work around.

Why PostgreSQL doesn't use indexes on "WHERE NOT IN" conditions.

I have two tables db100 and db60 with the same fields: x, y, z.
Indexes are created for both the tables on field z like this:
CREATE INDEX db100_z_idx
ON db100
USING btree
(z COLLATE pg_catalog."default");
CREATE INDEX db60_z_idx
ON db60
USING btree
(z COLLATE pg_catalog."default");
Trying to find z values from db60 that don't exist in db100:
select db60.z from db60 where db60.z not in (select db100.z from db100)
As far as I understand, all the information required to execute the query is presented in the indexes. So, I would expect only indexes used.
However it uses sequential scan on tables instead:
"Seq Scan on db60 (cost=0.00..25951290012.84 rows=291282 width=4)"
" Filter: (NOT (SubPlan 1))"
" SubPlan 1"
" -> Materialize (cost=0.00..80786.26 rows=3322884 width=4)"
" -> Seq Scan on db100 (cost=0.00..51190.84 rows=3322884 width=4)"
Can someone pls explain why PostgreSQL doesn't use indexes in this example?
Both the tables contain a few millions records and execution takes a while.
I know that using a left join with "is null" condition gives better results. However, the question is about this particular syntax.
I'm on PG v 9.5
SubPlan 1 is for select db100.z from db100. You select all rows and hence an index is useless. You really want to select DISTINCT z from db100 here and then the index should be used.
In the main query you have select db60.z from db60 where db60.z not in .... Again, you select all rows except where a condition is not true, so again the index does not apply because it applies to the inverse condition.
In general, an index is only used if the planner thinks that such a use will speed up the query processing. It always depends on how many distinct values there are and how the rows are distributed over the physical pages on disk. An index to search for all rows having a column with a certain value is not the same as finding the rows that do not have that same value; the index indicates on which pages and at which locations to find the rows, but that set can not simply be inversed.
Given - in your case - that z is some text type, a meaningful "negative" index can not be constructed (this is actually almost a true-ism, although in some cases a "negative" index could be conceivable). You should look into trigram indexes, as these tend to work much faster than btree on text indexing.
You really want to extract all 291,282 rows with the same z value, or perhaps use a DISTINCT clause here too? That should speed things up quite a bit.

Postgresql Sorting a Joined Table with an index

I'm currently working on a complex sorting problem in Postgres 9.2
You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC).
But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)
You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.