Optimizing SELECT using array intersection - postgresql

Inside a Before trigger function, I'm trying to optimize a SELECT which uses an array intersection of the form:
select into matching_product * from products where global_ids && NEW.global_ids
The above is pegging the cpu at 100% while doing some modest batch inserts. (without the above select in the trigger function the cpu drops to ~5%)
I did define a GIN-index on global_ids but that doesn't seem to work.
Any other way to optimize the above? E.g.: Should I just go ahead and create a N-M relationship between products and global_ids and do some joins to get the same result?
EDIT
Seems the GIN-index IS used, however it's still slow. Not sure what I can expect, (YMMV and all that) but the table has ~200,000 items. Doing a query like below takes 300ms. I feel this should be near instant.
select * from products where global_ids && '{871712323629}'
Doing an explain on the above shows:
Bitmap Heap Scan on products (cost=40.51..3443.85 rows=1099 width=490)
Recheck Cond: (global_ids && '{871712323629}'::text[])
-> Bitmap Index Scan on "global_ids_GIN" (cost=0.00..40.24 rows=1099 width=0)
Index Cond: (global_ids && '{871712323629}'::text[])
Table definition, removed irrelevant columns
CREATE TABLE public.products
(
id text COLLATE pg_catalog."default" NOT NULL,
global_ids text[] COLLATE pg_catalog."default",
CONSTRAINT products_pkey PRIMARY KEY (id)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
Index
CREATE INDEX "global_ids_GIN"
ON public.products USING gin
(global_ids COLLATE pg_catalog."default")
TABLESPACE pg_default;

I cannot think of any reason why such a query should behave differently inside a PL/pgSQL function; my experiments suggest that it doesn't.
Run EXPLAIN (ANALYZE, BUFFERS) on a query like you run inside the function several times to get a good estimate of the duration you are to expect.
Run EXPLAIN (ANALYZE, BUFFERS) on inserts like the ones you are doing in batch on a similar table without a trigger to measure how long heap insert and index maintenance will take.
Add these values and multiply with the number of rows you insert in a batch.
If you end up with roughly the same time as you experience, there is no mystery to solve.
If you end up with a “lossy” bitmap index scan (look at EXPLAIN (ANALYZE, BUFFERS) output), you can boost the performance by increasing work_mem.

Related

Postgres TIMESTAMP index and query performance

I have this table:
CREATE TABLE IF NOT EXISTS CHANGE_REQUESTS (
ID UUID PRIMARY KEY,
FIELD_ID INTEGER NOT NULL,
LAST_CHANGE_DATE TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And I'm always going to be running the exact same query on it:
select * from change_requests where last_change_date > now() - INTERVAL '10 min';
The size of the table is going to be anywhere from 750k to 1million rows on average.
My question is how can I make sure the query is always very fast? I'm thinking of adding an index on last_change_date, but I'm not sure if that will do anything. I tried it (with only 1 row in the table right now) and got this explain:
create index change_requests__dt_index
on change_requests (last_change_date);
Seq Scan on change_requests (cost=0.00..1.02 rows=1 width=28)
Filter: (last_change_date > (now() - '00:10:00'::interval))
So it doesn't appear to use the index at all.
Will this index actually help? If not, what else could I do? Thanks!
Your index is perfect for the task. You see the sequential scan in the execution plan because you don't have a realistic amount of test data in the table, and for very small tables the overhead of using the index is not worth the effort (you'd have to process more 8kB database blocks).
Always test with realistic amounts of data. That will safe you some pain later on.

How does Postgres choos which index to use in case if multiple indexes are present?

I am new to Postgres and a bit confused on how Postgres decides which index to use if I have more than one btree indexes defined as below.
CREATE INDEX index_1 ON sample_table USING btree (col1, col2, COALESCE(col3, 'col3'::text));
CREATE INDEX index_2 ON sample_table USING btree (col1, COALESCE(col3, 'col3'::text));
I am using col1, col2, COALESCE(col3, 'col3'::text) in my join condition when I write to sample_table (from source tables) but when I do a explain analyze to get the query plan I see sometimes that it uses index_2 to scan rather than index_1 and sometimes just goes with sequential scan .I want to understand what can make Postgres to use one index over another?
Without seeing EXPLAIN (ANALYZE, BUFFERS) output, I can only give a generic answer.
PostgreSQL considers all execution plans that are feasible and estimates the row count and cost for each node. Then it takes the plan with the lowest cost estimate.
It could be that the condition on col2 is sometimes more selective and sometimes less, for example because you sometimes compare it to rare and sometimes to frequent values. If the condition involving col2 is not selective, it does not matzer much which of the two indexes is used. In that case PostgreSQL prefers the smaller two-column index.

Different results depending on when I create a GIN index before or after inserting data

I an trying to build a very simple text array GIN with _text_ops. I know all about ts_vectors - I just want to do this with text arrays as a curiosity and I am seeing a strange behavior in PostgreSQL 9.6. Here is my sequence of commands:
drop table docs cascade;
drop index gin1;
CREATE TABLE docs (id SERIAL, doc TEXT, PRIMARY KEY(id));
-- create index gin1 on docs using gin(string_to_array(doc, ' ') _text_ops); -- before
INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from us'),
('We also teach Python and also SQL');
SELECT * FROM docs;
create index gin1 on docs using gin(string_to_array(doc, ' ') _text_ops); -- after
explain select doc from docs where '{SQL}' <# string_to_array(doc, ' ');
If I create the gin1 index before the inserts the explain works as expected:
pg4e=> explain select doc FROM docs WHERE '{SQL}' <# string_to_array(doc, ' ');
Bitmap Heap Scan on docs (cost=12.05..21.53 rows=6 width=32)
Recheck Cond: ('{SQL}'::text[] <# string_to_array(doc, ' '::text))
-> Bitmap Index Scan on gin1 (cost=0.00..12.05 rows=6 width=0)
Index Cond: ('{SQL}'::text[] <# string_to_array(doc, ' '::text))
If I create the gin index after the inserts, it never seems to use the index.
pg4e=> explain select doc from docs where '{SQL}' <# string_to_array(doc, ' ');
Seq Scan on docs (cost=0.00..1.04 rows=1 width=32)
Filter: ('{SQL}'::text[] <# string_to_array(doc, ' '::text))
I wondered if it is because I need to wait a while for the index to be fully populated (even with four rows) - but waiting several minutes and doing the explain still gives me a sequential table scan.
Then just for fun I insert 10000 more records
INSERT INTO docs (doc) SELECT 'Neon ' || generate_series(10000,20000);
The explain shows a Seq Scan for about 10 seconds and then after 10 seconds if I do another explain it shows a Bitmap Heap Scan. So clearly some of the index updating took a few moments - that makes sense. But in the first situation where I insert four rows and then create the index - no matter how long I wait explain never uses the index.
I have a workaround (make the index before doing the inserts) - I am mostly just curious if there is some mechanism like a "flush index" or that I missed - or some other mechanism is at work.
The explain shows a Seq Scan for about 10 seconds and then after 10
seconds if I do another explain it shows a Bitmap Heap Scan. So
clearly some of the index updating took a few moments - that makes
sense. But in the first situation where I insert four rows and then
create the index - no matter how long I wait explain never uses the
index.
When you insert 10,000 rows to a 4 row table, you exceed the level of activity determined by autovacuum_analyze_threshold and autovacuum_analyze_scale_factor. So the next time the autovacuum launcher visits your database, it will execute an ANALYZE of the table, and with new data from that ANALYZE on a largish table it decides the index scan will be useful. But if you just insert 4 rows, that will not trigger an auto analyze (the default value of autovacuum_analyze_threshold is 50). And if it did, the result of the ANALYZE would be that the table is so small that the index is not useful, so the plan would not change anyway.
I have a workaround (make the index before doing the inserts)
To have a workaround, you need to have a problem. You don't seem to have a genuine problem here (that lasts longer than autovacuum_naptime, anyway), so there is nothing to work around.

PostgreSQL doesn't use index with unaccent function

I have the following table:
CREATE TABLE products (
id bigserial NOT NULL PRIMARY KEY,
name varchar(2048)
-- Many other rows
);
I want to make a case and diacritics insensitive LIKE query on name.
For that I have created the following function :
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE OR REPLACE FUNCTION immutable_unaccent(varchar)
RETURNS text AS $$
SELECT unaccent($1)
$$ LANGUAGE sql IMMUTABLE;
And then created an index on name using this function:
CREATE INDEX products_search_name_key ON products(immutable_unaccent(name));
However, when I make a query, the query is very slow (about 2.5s for 300k rows). I'm pretty sure PostgreSQL is not using the index
-- Slow (~2.5s for 300k rows)
SELECT products.* FROM products
WHERE immutable_unaccent(products.name) LIKE immutable_unaccent('%Hello world%')
-- Fast (~60ms for 300k rows), and there is no index
SELECT products.* FROM products
WHERE products.name LIKE '%Hello world%'
I have tried creating a separate column with a case and diacritics insensitive copy of the name like so, and in that case the query is fast:
ALTER TABLE products ADD search_name varchar(2048);
UPDATE products
SET search_name = immutable_unaccent(name);
-- Fast (~60ms for 300k rows), and there is no index
SELECT products.* FROM products
WHERE products.search_name LIKE immutable_unaccent('%Hello world%')
What am I doing wrong ? Why doesn't my index approach work ?
Edit: Execution plan for the slow query
explain analyze SELECT products.* FROM products
WHERE immutable_unaccent(products.name) LIKE immutable_unaccent('%Hello world%')
Seq Scan on products (cost=0.00..79568.32 rows=28 width=2020) (actual time=1896.131..1896.131 rows=0 loops=1)
Filter: (immutable_unaccent(name) ~~ '%Hello world%'::text)
Rows Removed by Filter: 277986
Planning time: 1.014 ms
Execution time: 1896.220 ms
If you're wanting to do a like '%hello world%' type query, you must find another way to index it.
(you may have to do some initial installation of a couple of contrib modules. To do so, login as the postgres admin/root user and issue the following commands)
Prerequisite:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION fuzzystrmatch;
Try the following:
create index on products using gist (immutable_unaccent(name) gist_trgm_ops);
It should use an index with your query at that point.
select * from product
where immutable_unaccent(name) like '%Hello world%';
Note: this index could get large, but with 240 character limit, probably wont get that big.
You could also use full text search, but that's a lot more complicated.
What the above scenario does is index "trigrams" of the name, IE, each set of "3 letters" within the name. So it the product is called "hello world" it would index hel,ell,llo ,lo , wo, wor, orl, and rld.
Then it can use that index against your search term in a more efficient way. You can use either a gist or a gin index type if you like.
Basically
GIST will be slightly slower to query, but faster to update.
GIN is the opposite>

Postgresql Sorting a Joined Table with an index

I'm currently working on a complex sorting problem in Postgres 9.2
You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC).
But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)
You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.