Slow query in postgresql geonames db despite indexes - postgresql

I imported all tables from http://www.geonames.org/ into my local postgresql 9.5.3.0 database and peppered it with indexes like so:
create extension pg_trgm;
CREATE INDEX name_trgm_idx ON geoname USING GIN (name gin_trgm_ops);
CREATE INDEX fcode_trgm_idx ON geoname USING GIN (fcode gin_trgm_ops);
CREATE INDEX fclass_trgm_idx ON geoname USING GIN (fclass gin_trgm_ops);
CREATE INDEX alternatename_trgm_idx ON alternatename USING GIN (alternatename gin_trgm_ops);
CREATE INDEX isolanguage_trgm_idx ON alternatename USING GIN (isolanguage gin_trgm_ops);
CREATE INDEX alt_geoname_id_idx ON alternatename (geonameid)
And now I would like to query the country names in different languages and cross reference the geonames attributes with these alternative names like so:
select g.geonameid as geonameid ,a.alternatename as name,g.country as country, g.fcode as fcode
from geoname g,alternatename a
where
a.isolanguage=LOWER('de')
and a.alternatename ilike '%Sa%'
and (a.ishistoric = FALSE OR a.ishistoric IS NULL)
and (a.isshortname = TRUE OR a.isshortname IS NULL)
and a.geonameid = g.geonameid
and g.fclass='A'
and g.fcode ='PCLI';
Unfortunately though this query takes as long as 13 to 15 seconds on an octacore machine with a fast SSD. 'Explain analyze verbose' shows this:
Nested Loop (cost=0.43..237138.04 rows=1 width=25) (actual time=1408.443..10878.115 rows=15 loops=1)
Output: g.geonameid, a.alternatename, g.country, g.fcode
-> Seq Scan on public.alternatename a (cost=0.00..233077.17 rows=481 width=18) (actual time=0.750..10862.089 rows=2179 loops=1)
Output: a.alternatenameid, a.geonameid, a.isolanguage, a.alternatename, a.ispreferredname, a.isshortname, a.iscolloquial, a.ishistoric
Filter: (((a.alternatename)::text ~~* '%Sa%'::text) AND ((a.isolanguage)::text = 'de'::text))
Rows Removed by Filter: 10675099
-> Index Scan using pk_geonameid on public.geoname g (cost=0.43..8.43 rows=1 width=11) (actual time=0.006..0.006 rows=0 loops=2179)
Output: g.geonameid, g.name, g.asciiname, g.alternatenames, g.latitude, g.longitude, g.fclass, g.fcode, g.country, g.cc2, g.admin1, g.admin2, g.admin3, g.admin4, g.population, g.elevation, g.gtopo30, g.timezone, g.moddate
Index Cond: (g.geonameid = a.geonameid)
Filter: ((g.fclass = 'A'::bpchar) AND ((g.fcode)::text = 'PCLI'::text))
Rows Removed by Filter: 1
Which to me seems to indicate that somehow a sequence scan is performed on 481 rows (which I deem to be fairly low), but nevertheless takes very long. I currently can't make sense of this. Any ideas?

The trigrams only work if you have minimum of 3 characters you're searching for %Sa% won't work, %foo% will. However your indexes are still not good enough. Depending on what parameters are dynamic use multicolumn or filtered indexes:
CREATE INDEX jkb1 ON geoname(fclass, fcode, geonameid, country);
CREATE INDEX jkb2 ON geoname(geonameid, country) WHERE fclass = 'A' AND fcode = 'PCLI';
Same for the other table:
CREATE INDEX jkb3 ON alternatename(geonameid, alternatename) WHERE (a.ishistoric = FALSE OR a.ishistoric IS NULL)
AND (a.isshortname = TRUE OR a.isshortname IS NULL) AND isolanguage=LOWER('de')

Related

How to use an index when using jsonb_array_elements in Postgres

I have the next table structure:
create table public.listings (id varchar(255) not null, data jsonb not null);
And the next indexes:
create index listings_data_index on public.listings using gin(data jsonb_ops);
create unique index listings_id_index on public.listings(id);
alter table public.listings add constraint listings_id_pk primary key(id);
With this row:
id | data
1 | {"attributes": {"ccid": "123", "listings": [{"vin": "1234","body": "Sleeper", "make": "International"}, { "vin": "5678", "body": "Sleeper", "make": "International" }]}}
The use case needs to retrieve a specific item inside the listings array that matches a specific vin.
I am accomplishing that with the next query:
SELECT elems
FROM public.listings, jsonb_array_elements(data->'attributes'->'listings') elems
WHERE id = '1' AND elems->'vin' ? '1234';
The output is what I need:
{"vin": "1234","body": "Sleeper", "make": "International"}
Now I am in the phase of optimizing this query, since there will be millions of rows, and up to 100K items inside listings array.
When I run the explain over that query is shows this:
Nested Loop (cost=0.01..2.53 rows=1 width=32)
-> Seq Scan on listings (cost=0.00..1.01 rows=1 width=32)
Filter: ((id)::text = '1'::text)
-> Function Scan on jsonb_array_elements elems (cost=0.01..1.51 rows=1 width=32)
Filter: ((value -> 'vin'::text) ? '1234'::text)
I wonder what would be the right way to construct an index for that, or if I need to modify the query to another that is more efficient.
Thank you!
First: with a table as small as that, you will never see PostgreSQL use an index. You need to try with realistic amounts. Second: while PostgreSQL will happily use an index for the condition on id, it can never use an index for such a JSON search, no matter how you write it.

Index not used on filterd aggregate query

I need to increase performance of the following query, which filters on column status_classification and aggregrates on classification -> 'flags' (a jsonb field in the form: '{"flags": ["NO_CLASS_FOUND"]}'::jsonb):
SELECT SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["NO_CLASS_FOUND"]' THEN 1 ELSE 0 END) AS "no_class_found",
SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["RULE"]' THEN 1 ELSE 0 END) AS "rule",
SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["NO_MAPPING"]' THEN 1 ELSE 0 END) AS "no_mapping"
FROM "result_materials"
WHERE "result_materials"."status_classification" = 'PROCESSED';
To improve performance i created an index on status_classification, but the query plan shows that the index was never hit, and a Seq Scan was performed:
Aggregate (cost=1010.15..1010.16 rows=1 width=24) (actual time=19.942..19.946 rows=1 loops=1)
-> Seq Scan on result_materials (cost=0.00..869.95 rows=6231 width=202) (actual time=0.024..4.660 rows=6231 loops=1)
Filter: ((status_classification)::text = 'PROCESSED'::text)
Rows Removed by Filter: 5
Planning Time: 1.212 ms
Execution Time: 20.187 ms
I've tried (all sql at the end of question):
adding an index to status_classification
adding a GIN index to classification -> 'flags'
adding a multi field GIN index, with classification -> 'flags' and status_classification (see here)
The index is still not hit, and performance suffers on as the table grows. Cardinality is low in status_classification field, but the entries in classification -> 'flags' are quite rare, so i would have thought an index very practical here.
Why is the index not used? What am i doing wrong?
SQL to recreate my db:
create table result_materials (
uuid int,
status_classification varchar(30),
classification jsonb
);
insert into result_materials(uuid, classification, status_classification)
select seq
, case(random() *2)::int
when 0 then '{"flags": ["NO_CLASS_FOUND"]}'::jsonb
when 1 then '{"flags": ["RULE"]}'::jsonb
when 2 then '{"flags": ["NO_MAPPING"]}'::jsonb end
as dummy
, case(random() *2)::int
when 0 then 'NOT_PROCESSABLE'
when 1 then 'PROCESSABLE' end
as sta
from generate_series(1, 150000) seq;
Indexes attempted:
-- status_classification
create index other_testes on result_materials (status_classification);
-- classification -> 'flags'
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'));
-- multi field gin
-- REQUIRES you to run: CREATE EXTENSION btree_gin;
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'), status_classification);
The query takes 20ms and only removes 5 rows of 6k, yes a scan it's a good choice.
Try adding more rows to the table, and check the cardinality of your clause.

Avoid using Nested Loop Join while using a non Equi join condition

Postgres is using a Nested Loop Join algorithm when I use a non equi join condition in my update query. I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation as per
[https://www.postgresql.org/docs/8.3/planner-optimizer.html]
The update query and the execution plan is below.
Query
explain analyze
UPDATE target_tbl tgt
set descr = stage.descr,
prod_name = stage.prod_name,
item_name = stage.item_name,
url = stage.url,
col1_name = stage.col1_name,
col2_name = stage.col2_name,
col3_name = stage.col3_name,
col4_name = stage.col4_name,
col5_name = stage.col5_name,
col6_name = stage.col6_name,
col7_name = stage.col7_name,
col8_name = stage.col8_name,
flag = stage.flag
from tbl1 stage
where tgt.col1 = stage.col1
and tgt.col2 = stage.col2
and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text)
and stage.row_number::int >= 1::int
and stage.row_number::int < 50001::int;
Execution Plan
Update on target_tbl tgt (cost=0.56..3557.91 rows=1 width=813) (actual time=346153.460..346153.460 rows=0 loops=1)
-> Nested Loop (cost=0.56..3557.91 rows=1 width=813) (actual time=4.326..163876.029 rows=50000 loops=1)
-> Seq Scan on tbl1 stage (cost=0.00..2680.96 rows=102 width=759) (actual time=3.060..2588.745 rows=50000 loops=1)
Filter: (((row_number)::integer >= 1) AND ((row_number)::integer < 50001))
-> Index Scan using tbl_idx on target_tbl tgt (cost=0.56..8.59 rows=1 width=134) (actual time=3.152..3.212 rows=1 loops=50000)
Index Cond: ((col1 = stage.col1) AND (col2 = stage.col2) AND (COALESCE(col3, 'col3'::text) = COALESCE(stage.col3, 'col3'::text)) AND (COALESCE(col4, 'col4'::text) = COALESCE(stage.col4, 'col4'::text)))
Planning time: 17.700 ms
Execution time: 346157.168 ms
Is there any way to avoid the nested loop join during the execution of the above query?
Or is there a way that can help me to reduce the cost of the the nested loop scan, currently it takes 6-7 minutes to update just 50000 records?
PostgreSQL can choose a different join strategy in that case. The reason why it doesn't is the gross mis-estimate in the sequential scan: 102 instead of 50000.
Fix that problem, and things will get better:
ANALYZE tbl1;
If that is not enough, collect more detailed statistics:
ALTER TABLE tbl1 ALTER row_number SET STATISTICS 1000;
ANALYZE tbl1;
All this assumes that row_number is an integer and the type cast is redundant. If you made the mistake to use a different data type, an index is your only hope:
CREATE INDEX ON tbl1 ((row_number::integer));
ANALYZE tbl1;
I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation
But the "right relation" here is an index scan, not a scan of the full table.
You can get it to stop using the index by changing the leading column of the join condition to something like where tgt.col1+0 = stage.col1 .... Upon doing that, it will probably change to a hash join or a merge join, but you will have to try it and see if it does. Also, the new plan may not actually be faster. (And fixing the estimation problem would be preferable, if that works)
Or is there a way that can help me to reduce the cost of the the
nested loop scan, currently it takes 6-7 minutes to update just 50000
records?
Your plan shows that over half the time is spent on the update itself, so probably reducing the cost of just the nested loop scan can have only a muted impact on the overall time. Do you have a lot of indexes on the table? The maintenance of those indexes might be a major bottleneck.

Is this the right way to create a partial index in Postgres?

We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.
Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).
For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.

PostgreSQL daterange not using index correctly

I have a simple table which has a user_birthday field with a type of date (which can be
NULL value)
CREATE TABLE users
(
user_id bigserial NOT NULL,
user_email text NOT NULL,
user_password text,
user_first_name text NOT NULL,
user_middle_name text,
user_last_name text NOT NULL,
user_birthday date,
CONSTRAINT pk_users PRIMARY KEY (user_id)
)
There's an index (btree) defined on that field, with the rule of NOT
user_birthday IS NULL.
CREATE INDEX ix_users_birthday
ON users
USING btree
(user_birthday)
WHERE NOT user_birthday IS NULL;
Trying to follow up on another idea, I've added the extension btree_gist and created the following index:
CREATE INDEX ix_users_birthday_gist
ON glances.users
USING gist
(user_birthday)
WHERE NOT user_birthday IS NULL;
But it had no affect either, as from what I could read it is not used for range checking.
The PostgreSQL version is 9.3.4.0 (22) Postgres.app
and issue also exists in 9.3.3.0 (21) Postgres.app
I've been intrigued by the following queries:
Query #1:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01)')
Query #2:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
which, at first glance both should have the same execution plan, but for some
reason, here are the results:
Query #1:
"Seq Scan on users (cost=0.00..52314.25 rows=11101 width=241) (actual
time=0.014..478.983 rows=208886 loops=1)"
" Filter: (user_birthday <# '[1978-07-15,1983-03-01)'::daterange)"
" Rows Removed by Filter: 901214"
"Total runtime: 489.584 ms"
Query #2:
"Bitmap Heap Scan on users (cost=4468.01..46060.53 rows=210301 width=241)
(actual time=57.104..489.785 rows=209019 loops=1)"
" Recheck Cond: ((user_birthday >= '1978-07-15'::date) AND (user_birthday
<= '1983-03-01'::date))"
" Rows Removed by Index Recheck: 611375"
" -> Bitmap Index Scan on ix_users_birthday (cost=0.00..4415.44
rows=210301 width=0) (actual time=54.621..54.621 rows=209019 loops=1)"
" Index Cond: ((user_birthday >= '1978-07-15'::date) AND
(user_birthday <= '1983-03-01'::date))"
"Total runtime: 500.983 ms"
As you can see, the <# daterange is not utilizing the existing index, while
BETWEEN does.
Important to note that the actual use case for this rule is in a more complex query,
which doesn't result in the Recheck Cond and Bitmap Heap scan.
In the application complex query, the difference between the two methods (with 1.2 million records) is massive:
Query #1 at 415ms
Query #2 at 84ms.
Is this a bug with daterange?
Am I doing something wrong? or datarange <# is performing as designed?
There's also a discussion in the pgsql-bugs mailing list
BETWEEN includes upper and lower border. Your condition
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
matches
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01]')
I see you mention a btree index. For that use simple comparison operators.
Detailed manual page on which index is good for which operators.
The range type operators <# or #> would work with GiST indexes.
Example:
Perform this hours of operation query in PostgreSQL